Download Binomial and normal distributions - The University of Chicago Booth

Document related concepts

Central limit theorem wikipedia , lookup

Transcript
Binomial and normal distributions
Business Statistics 41000
Fall 2015
1
Topics
1. Sums of random variables
2. Binomial distribution
3. Normal distribution
4. Vignettes
2
Topic: sums of random variables
Sums of random variables are important for two reasons:
1. Because we often care about aggregates and totals (sales, revenue,
employees, etc).
2. Because averages are basically sums, and probabilities are basically
averages (of dummy variables), when we go to estimate
probabilities, we will end up using sums of random variables a lot.
This second point is the topic of the next lecture. For now, we focus on
the direct case.
3
A sum of two random variables
Suppose X is a random variable denoting the profit from one wager and
Y is a random variable denoting the profit from another wager.
If we want to consider our total profit, we may consider the random
variable that is the sum of the two wagers, S = X + Y .
To determine the distribution of S, we must first know the joint
distribution of (X , Y ).
4
A sum of two random variables
Suppose that (X , Y ) has the following joint distribution:
-$200
$100
$200
$0
0
1
9
3
9
$100
1
9
2
9
2
9
So S can take the values {−200, −100, 100, 200, 300}.
Notice that there are two ways that S can be $200.
5
A sum of two random variables
We can directly determine the distribution of S as:
S
s
P(S = s)
-$200 +$0
0
-$200 + $100
1
9
1
9
5
9
2
9
$100 + $0
$100 + $100 or $200 + $0
$200 + $100
2
9
+
3
9
=
When determining the distribution of sums of random variables, we lose
information about individual values and aggregate the probability of
events giving the same sum.
6
Topic: binomial distribution
A binomial random variable can be constructed as the sum of
independent Bernoulli random variables.
Familiarity with the binomial distribution eases many practical probability
calculations.
See OpenIntro sections 3.4 and 3.6.4.
7
Sums of Bernoulli RVs
When rolling two dice, what is the probability of rolling two ones?
By independence we can calculate this probability as
1 1
1
P(1, 1) =
=
.
6 6
36
Now with three dice, what is the probability of rolling exactly two 1’s?
8
Sums of Bernoulli RVs (cont’d)
The event A =“rolling a one”, can be described as a Bernoulli random
variable with p = 61 .
We can denote the three independent rolls by writing
iid
Xi ∼ Bernoulli(p),
i = 1, 2, 3.
The notation iid is shorthand for “independent and identically
distributed”.
Determining the probability of rolling exactly two 1’s can be done by
considering the random variable Y = X1 + X2 + X3 and asking for
P(Y = 2).
9
Sums of Bernoulli random variables (cont’d)
Consider the distribution of Y = X1 + X2 + X3 .
Y
y
P(Y = y )
000
0
(1 − p)3
001 or 100 or 010
1
(1 − p)(1 − p)p + p(1 − p)(1 − p) + (1 − p)p(1 − p)
011 or 110 or 101
2
(1 − p)p 2 + p 2 (1 − p) + p(1 − p)p
111
3
p3
Event
Remember that for this example p = 61 .
10
Sums of Bernoulli random variables (cont’d)
Determining the probability of a certain number of successes requires
knowing 1) the probability of each individual success and 2) the number
of ways that number of successes can arise.
Y
y
P(Y = y )
000
0
(1 − p)3
001 or 100 or 010
1
3(1 − p)2 p
011 or 110 or 101
2
3(1 − p)p 2
111
3
p3
Event
We find that P(Y = 2) = 3p 2 (1 − p) = 3(1/36)(5/6) =
5
6(12)
=
5
72 .
11
Sums of Bernoulli random variables (cont’d)
What if we had four rolls, and the probability of success was 13 ?
0000
1000
0100
1100
0010
1010
0110
1110
0001
1001
0101
1101
0011
1011
0111
1111
12
Sums of Bernoulli random variables (cont’d)
Summing up the probabilities for each of the values of Y , we find:
Y
y
0
1
2
3
4
Substituting p =
1
3
P(Y = y )
(1 − p)4
4(1 − p)3 p
6(1 − p)2 p 2
4(1 − p)p 3
p4
we can now find P(Y = y ) for any y = 0, 1, 2, 3, 4.
13
Defintion: N choose y
The number of ways we can arrange y successes among N trials can be
calculated efficiently by a computer. We denote this number with a
special expression.
N choose y
The notation
N
N!
=
(N − y )!y !
y
designates the number of ways that y items can be assigned to N
possible positions.
This notation can be used to summarize the entries in the previous tables
for various values of N and y .
14
Definition: Binomial distribution
Binomial distribution
A random variable Y has a binomial distribution with parameters N and
p if its probability distribution function is of the form:
N y
p (1 − p)N−y
p(y ) =
y
for integer values of y between 0 and N.
15
Example: drunk batter
What is the probability that our alcoholic major-leaguer gets more than 2
hits in a game in which he has 5 at bats?
Let X =“number of hits”. We model X as a binomial random variable
with parameters N = 5 and p = 0.316.
X
x
0
1
2
3
4
5
P(X = x)
(1 − p)5
5(1 − p)4 p
10(1 − p)3 p 2
10(1 − p)2 p 3
5(1 − p)p 4
p5
Substituting p = 0.316 we calculate P(X > 2) = 0.185.
16
Example: winning a best-of-seven play-off
Assume that the Chicago Bulls have probability 0.4 of beating the Miami
Heat in any given game and that the outcomes of individual games are
independent.
What is the probability that the Bulls win a seven game series against the
Heat?
17
Example: winning a best-of-seven play-off (cont’d)
Consider the number of games won by the Bulls over a full seven games
against the Heat. We model this as a binomial random variable Y with
parameters N = 7 and p = 0.4, which we express with the notation
Y ∼ Bin(7, 0.4).
The symbol “∼” is read “distributed as”. “Bin” is short for “binomial”.
The numbers which follow are the values of the two binomial parameters,
the number of independent Bernoulli trials (N) and the probability of
success at each trial (p).
18
Example: winning a best-of-seven play-off (cont’d)
Although we never see all seven games played (because the series stops
as soon as one team wins four games) we note that in this expanded
event space
I
any event with at least four Bulls wins corresponds to an observable
Bulls series win,
I
any event corresponding to an observed Bulls series win has at least
four total Bulls wins.
19
Example: winning a best-of-seven play-off (cont’d)
For example, the observable sequence 011011 (where a 1 stands for a
Bulls win) has two possible completions, 0110110 or 0110111. Any
hypothetical games played beyond the series-ending fourth win can only
increase the total number of wins tallied by Y .
Conversely, the sequence 1010111 is an event corresponding to Y = 5
and we can associate it with the observable subsequence 101011, a Bulls
series win in six games.
20
Example: winning a best-of-seven play-off (cont’d)
Therefore, the events corresponding to “Bulls win the series” are
precisely those corresponding to Y ≥ 4.
We may conclude that the probability of a series win for the Bulls is
P(Y ≥ 4) = P(Y = 4) + P(Y = 5) + P(Y = 6) + P(Y = 7)
= 0.29.
21
Example: winning a best-of-seven play-off (cont’d)
We can arrive at this answer without reference to the binomial random
variable Y if we are willing to do our own counting.
4
P(Bulls series win) = p +
= p4 +
!
4 4
p (1 − p) +
3
!
4 4
p (1 − p) +
1
!
5 4
p (1 − p)2 +
3
!
5 4
p (1 − p)2 +
2
!
6 4
p (1 − p)3
3
!
6 4
p (1 − p)3
3
= 0.29.
This calculation explicitly accounts for the fact that Bulls series wins
necessarily conclude with a Bulls game win.
22
Example: double lottery winners
In 1971, Jane Adams won the lottery twice in one year! If you read of a
double winner in your daily newspaper, how surprised should you be?
To answer this question we need to make some assumptions. Consider 40
state lotteries. Assume that each one has a 1 in 18 million chance of
winning. Assume that each one has 1 million people that play it daily
(say, 250 times a year), and that each one buys 5 tickets.
Given these conditions, what is the probability that in one calendar year
there is at least one double winner?
23
Example: double lottery winners (cont’d)
Let Xi be the random variable denoting how many winning tickets person
i has:
Xi ∼ Binomial(5(250), p = (1/18) × 10−6 ).
Now let Yi be the dummy variable for the event Xi > 1, which is the
event that person i is a double (or more) winner:
Yi ∼ Bernoulli(q).
We can compute q = 1 − Pr (Xi = 0) − Pr (Xi = 1) = 2.4 × 10−9 .
24
Example: double lottery winners (cont’d)
To account for the
people playing the lottery in each of 40 states,
Pmillion
N
we consider Z = i=1 Yi , which is another binomial random variable:
Z ∼ Binomial(N = 4 × 107 , q).
Finally, the probability that Z > 0 can be found as
1 − P(Z = 0) = 1 − (1 − q)N = 1/11.
Not so rare!
25
Example: rural vs. urban hospitals
About as many boys as girls are born in hospitals. In a small Country Hospital
only a few babies are born every week. In the urban center, many babies are
born every week at City General. Say that a normal week is one where between
45% and 55% of the babies are female. An unusual week is one where more
than 55% are girls or more than 55% are boys.
Which of the following is true?
I
Unusual weeks occur equally often at Country Hospital and at City
General.
I
Unusual weeks are more common at Country Hospital than at City
General.
I
Unusual weeks are less common at Country Hospital than at City General.
26
Example: rural vs. urban hospital (cont’d)
We can model the births in the two hospitals as two independent random
variables. Let X = “number of baby girls born at Country Hospital” and
Y =“number of baby girls born at City General”.
X ∼ Binomial(N1 , p)
Y ∼ Binomial(N2 , p)
Assume that p = 0.5. The key difference is that N1 is much smaller than
N2 . To illustrate, assume that N1 = 20 and N2 = 500.
27
Example: rural vs. urban hospital (cont’d)
During a usual week at the rural hospital between 0.45N1 = 0.45(20) = 9
and 0.55N1 = 0.55(20) = 11 baby girls are born.
The probability of usual week is P(9 ≤ X ≤ 11) ≈ 0.50, so the
probability of an unusual week is
1 − P(9 ≤ X ≤ 11) = P(X < 9) + P(X > 11) ≈ 0.5.
Note: satisfying the condition X < 9 is the same as not satisfying the
condition X ≥ 9; strict versus non-strict inequalities make a difference.
28
Example: rural vs. urban hospital (cont’d)
0.10
0.05
0.00
Probability
0.15
0.20
Country Hospital
0
1
2
3
4
5
6
7
8
9
10 11 12 13 14 15 16 17 18 19 20
Births
29
Example: rural vs. urban hospital (cont’d)
In a usual week at the city hospital between 0.45N2 = 0.45(500) = 225
and 0.55N2 = 0.55(500) = 275 baby girls are born.
Then the probability of a usual week is P(225 ≤ X ≤ 275) = 0.978, so
the probability of an unusual week is
1 − P(225 ≤ X ≤ 275) = P(X < 225) + P(X > 275) = 0.022.
30
Example: rural vs. urban hospital (cont’d)
0.020
0.010
0.000
Probability
0.030
City General
200
206
212
218
224
230
236
242
248
254
260
266
272
278
284
290
Births
31
Variance of a sum of independent random variables
A useful fact:
Variance of linear combinations of independent random variables
A weighted sum/difference of random variables Y =
expressed as
m
X
V(Y ) =
ai2 V(Xi ).
Pm
i
ai Xi can be
i
How can this be used to derive the expression for the variance of a
binomial random variable?
32
Variance of binomial random variable
Variance of a binomial random variable
A binomial random variable X with parameters N and p has variance
V(X ) = Np(1 − p).
33
Variance of a proportion
By dividing through by the total number of babies born each week we
can consider the proportion of girl babies. Define the random variables
P1 =
X
N1
and
P2 =
Y
.
N2
Then it follows that
V (P1 ) =
V(X )
N1 p(1 − p)
=
= p(1 − p)/N1
2
N1
N12
V (P2 ) =
N2 p(1 − p)
V(Y )
=
= p(1 − p)/N2 .
2
N2
N22
and
34
Law of Large Numbers
An arithmetical average of random variables is itself a random variable.
As more and more individual random variables are averaged up, the
variance decreases but the mean stays the same.
As a result, the distribution of the averaged random variable becomes
more and more concentrated around its expected value.
35
Law of Large Numbers
0.00
0.05
0.10
0.15
0.20
0.25
Distribution of sample proportion (N = 10, p = 0.7)
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
36
Law of Large Numbers
0.00
0.05
0.10
0.15
Distribution of sample proportion (N = 20, p = 0.7)
0
0.7
1
37
Law of Large Numbers
0.00
0.02
0.04
0.06
0.08
0.10
0.12
Distribution of sample proportion (N = 50, p = 0.7)
0
0.7
1
38
Law of Large Numbers
0.00
0.01
0.02
0.03
0.04
0.05
0.06
0.07
Distribution of sample proportion (N = 150, p = 0.7)
0
0.7
1
39
Law of Large Numbers
0.00
0.01
0.02
0.03
0.04
0.05
Distribution of sample proportion (N = 300, p = 0.7)
0
0.7
1
40
Example: Schlitz Super Bowl taste test
41
Bell curve approximation to binomial
The binomial distributions can be approximated by a smooth density
function for large N.
0.15
0.10
0.05
0.00
Probability mass / Density
0.20
Normal approximation for binomial distribution with N = 20, p = 0.5
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
x
42
Bell curve approximation to binomial
0.10
0.05
0.00
Probability mass / Density
0.15
Normal approximation for binomial distribution with N = 60, p = 0.1
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
x
43
Bell curve approximation to binomial
0.03
0.02
0.00
0.01
Probability mass / Density
0.04
Normal approximation for binomial distribution with N = 500, p = 0.8
340
346
352
358
364
370
376
382
388
394
400
406
412
418
424
430
436
442
448
454
460
x
What are some reasons that very small p or small N lead to bad
approximations?
44
Central limit theorem
The normal distribution can be “justified” via its relationship to the
binomial distribution. Roughly: if a random outcome is the combined
result of many individual random events, its distribution will follow a
normal curve.
The quincunx or Galton box is a device which physically simulates such
a scenario using ball bearings and pins stuck in a board.
PLAY VIDEO
The CLT can be stated more precisely, but the practical impact is just
this: random variables which arise as sums of many other random
variables (not necessarily normally distributed) tend to be normally
distributed.
45
Normal distributions
The normal family of densities has two parameters, typically denoted µ
and σ 2 , which govern the location and scale, respectively.
0.2
0.1
0.0
f(x)
0.3
0.4
Gaussian densities for various location parameters
-4
-2
0
2
4
x
46
Normal distributions (cont’d)
I will use the terms normal distribution, normal density and normal
random variable more or less interchangeably.
0.4
0.0
0.2
f(x)
0.6
0.8
Mean-zero Gaussian densities with differing scale parameters
-4
-2
0
2
4
x
The normal distribution is also called the Gaussian distribution or the
bell curve.
47
Normal means and variances
Mean and variance of a normal random variable
A normal random variable X , with parameters µ and σ 2 , is denoted
X ∼ N(µ, σ 2 ).
The mean and variance of X are
E (X ) = µ,
V (X ) = σ 2 .
The density function is symmetric and unimodal, so the median and
mode of X are also given by the location parameter µ. The standard
deviation of X is given by σ.
48
Normal approximation to binomial
The binomial distributions can be approximated by a normal distribution.
Normal approximation to the binomial
A Bin(N, p) distribution can be approximated by a N(Np, Np(1 − p))
distribution for N “large enough”.
Notice that this just “matches” the mean and variance of the two
distributions.
49
Linear transformation of normal RVs
We can add a fixed number to a normal random variable and/or multiply
it by a fixed number and get a new normal random variable. This sort of
operation is called a linear transformation.
Linear transformation of normal random variables
If X ∼ N(µ, σ 2 ) and Y = a + bX for fixed numbers a and b, then
Y ∼ N(a + bµ, b 2 σ 2 ).
For example, if X ∼ N(1, 2) and Y = 3 − 5X , then Y ∼ N(−2, 50).
50
Standard normal RV
Standard normal
A standard normal random variable is one with mean 0 and variance 1.
It is often denoted by the letter Z :
Z ∼ N(0, 1).
We can write any normal random variable as a linear transformation of a
standard normal RV. For normal random variable X ∼ N(µ, σ 2 ), we can
write
X = µ + σZ .
51
The “empirical rule”
It is convenient to characterize where the “bulk” of the probability mass
of a normal distribution resides by providing an interval, in terms of
standard deviations, about the mean.
0.2
0.1
68 %
0.0
Density
0.3
0.4
N(µ,σ)
µ − 4σ
µ − 3σ
µ − 2σ
µ−σ
µ
µ+σ
µ + 2σ
µ + 3σ
µ + 4σ
x
52
The “empirical rule” (cont’d)
The widespread application of the normal distribution has lead this to be
dubbed the empirical rule.
0.2
0.1
95 %
0.0
Density
0.3
0.4
N(µ,σ)
µ − 4σ
µ − 3σ
µ − 2σ
µ−σ
µ
µ+σ
µ + 2σ
µ + 3σ
µ + 4σ
x
53
The “empirical rule” (cont’d)
It is, for obvious reasons, sometimes called the 68-95-99.7 rule.
0.2
0.1
99.7 %
0.0
Density
0.3
0.4
N(µ,σ)
µ − 4σ
µ − 3σ
µ − 2σ
µ−σ
µ
µ+σ
µ + 2σ
µ + 3σ
µ + 4σ
x
54
The “empirical rule” (cont’d)
To revisit some earlier examples:
I
68% of Chicago daily highs in the winter season are between 19 and
48 degrees.
I
95% of NBA players are between 6ft and 7ft 2in.
I
In 99.7% of weeks, the proportion of baby girls born at City General
is between 0.4985 and 0.5015.
55
Sums of normal random variables
Weighted sums of normal random variables are also normally distributed.
For example if
X1 ∼ N(5, 20)
and
X2 ∼ N(1, 0.5)
then for Y = 0.1X1 + 0.9X2
Y ∼ N(m, v ).
where m = 0.1(5) + 0.9(1) = 1.4 and v = 0.12 (20) + 0.92 (0.5) = 0.605.
56
Linear combinations of normal RVs
Linear combinations of independent normal random variables
For i = 1, . . . , n, let
iid
Xi ∼ N(µi , σi2 ).
Define Y =
Pn
i=1 ai Xi
for weights a1 , a2 , . . . , an . Then
Y ∼ N(m, v )
where
m=
n
X
i=1
ai µi
and
v=
n
X
ai2 σi2 .
i=1
57
Example: two-stock portfolio
Consider two stocks, A and B, with annual returns (in percent of
investment) distributed according to normal distributions
XA ∼ N(5, 20)
and
XB ∼ N(1, 0.5).
What fraction of our investment should we put into stock A, with the
remainder put in stock B?
58
Example: two-stock portfolio (cont’d)
For a given fraction α, the total return on our portfolio is
Y = αXA + (1 − α)XB
with distribution
Y ∼ N(m, v ).
where m = 5α + (1 − α) and v = 20α2 + 0.5(1 − α)2 .
59
Example: two-stock portfolio (cont’d)
Suppose we want to find α so that P(Y ≤ 0) is as small as possible.
0.3
0.4
Stock A
Stock B
0.0
0.1
0.2
Density
0.5
0.6
Two-stock portfolio
-5
0
5
10
15
20
Percent return
The blue distributions correspond to varying values of α.
60
Example: two-stock portfolio (cont’d)
We can plot the probability of a loss as a function of α.
0.08
0.04
0.06
Probability
0.10
0.12
Probability of a loss
0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50 0.55 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95 1.00
α
We see that this probability is minimized when α = 11% approximately.
This is the LLN at work!
61
Variance of a sum of correlated random variables
For correlated (dependent) random variables, we have a modified formula:
Variance of linear combinations of two correlated random variables
A weighted sum/difference of random variables Y = a1 X1 + a2 X2 can be
expressed as
V(Y ) = a12 V(X1 ) + a22 V(X2 ) + 2a1 a2 Cov(X1 , X2 ).
There is a homework problem that asks you to find the variance of
portfolios of stocks, as in the example above, for stocks which are related
to one another (in a common industry, for example).
62
Vignettes
1. Differential dispersion
2. Average number of sex partners
3. mean reversion
63
Vignette: a difference in dispersion
In this vignette we observe how selection (in the sense of evolution, or
hiring, or admissions) can turn higher variability into over-representation.
The analysis uses the ideas of random variables, distribution functions,
and conditional probability.
For more background, read the article “Sex Ed” from the February 2005
issue of the New Republic (available at the course home page).
64
A difference in dispersion
Consider two groups of college graduates with “employee fitness scores”
following the distributions shown below.
0.4
0.2
0.3
0.256
0.043
0.051
0.064
-5
-4
-3
0.085
0.128
0.128
0.085
0.064
0.051
0.043
3
4
5
0.023
0.008
0.003
3
4
5
0.0
0.1
Probability
0.5
0.6
Distribution of Capabilities, Group A
-2
-1
0
1
2
Score
0.6
Distribution of Capabilities, Group B
0.4
0.3
0.2
0.171
0.003
0.008
0.023
-5
-4
-3
0.171
0.063
0.063
0.0
0.1
Probability
0.5
0.464
-2
-1
0
1
2
Score
These distributions have the same mean, the same median, and the same
mode. But they differ in their dispersion, or variability.
65
A difference in dispersion (cont’d)
Let X denote the random variables recording the scores and let A and B
denote membership in the respective groups.
0.4
0.2
0.3
0.256
0.043
0.051
0.064
-5
-4
-3
0.085
0.128
0.128
0.085
0.064
0.051
0.043
3
4
5
0.023
0.008
0.003
3
4
5
0.0
0.1
Probability
0.5
0.6
Distribution of Capabilities, Group A
-2
-1
0
1
2
Score
0.6
Distribution of Capabilities, Group B
0.4
0.3
0.2
0.171
0.003
0.008
0.023
-5
-4
-3
0.171
0.063
0.063
0.0
0.1
Probability
0.5
0.464
-2
-1
0
1
2
Score
V (X | A) = 5.87 and V (X | B) = 1.666.
The corresponding standard deviations are σ(X | A) = 2.42 and
σ(X | B) = 1.29.
66
A difference in dispersion (cont’d)
But now consider only elite jobs, for which it is necessary that fitness
score X ≥ 4.
0.4
0.2
0.3
0.256
0.043
0.051
0.064
-5
-4
-3
0.085
0.128
0.128
0.085
0.064
0.051
0.043
3
4
5
0.023
0.008
0.003
3
4
5
0.0
0.1
Probability
0.5
0.6
Distribution of Capabilities, Group A
-2
-1
0
1
2
Score
0.6
Distribution of Capabilities, Group B
0.4
0.3
0.2
0.171
0.003
0.008
0.023
-5
-4
-3
0.171
0.063
0.063
0.0
0.1
Probability
0.5
0.464
-2
-1
0
1
2
Score
We can use Bayes’ rule to calculate P(A | X ≥ 4) and P(B | X ≥ 4).
67
A difference in dispersion (cont’d)
If we assume a priori that P(A) = P(B) = 1/2, we find
P(X ≥ 4 | A)P(A)
P(X ≥ 4 | A)P(A) + P(X ≥ 4 | B)P(B)
0.094(0.5)
=
0.094(0.5) + 0.012(0.5)
= 0.89.
P(A | X ≥ 4) =
Why don’t we need to calculate P(B | X ≥ 4) separately?
68
Larry Summers and women-in-science
“Summers’s critics have repeatedly mangled his suggestion that
innate differences might be one cause of gender disparities ... into
the claim that they must be the only cause. And they have
converted his suggestion that the statistical distributions of men’s
and women’s abilities are not identical to the claim that all men are
talented and all women are not–as if someone heard that women
typically live longer than men and concluded that every woman lives
longer than every man. . . .
In many traits, men show greater variance than women, and are
disproportionately found at both the low and high ends of the
distribution. Boys are more likely to be learning disabled or retarded
but also more likely to reach the top percentiles in assessments of
mathematical ability, even though boys and girls are similar in the
bulk of the bell curve. . . .”
Stephen Pinker in The New Republic
69
Example: gender and aptitudes revisited
Assume that job“aptitude” can be represented as a continuous random
variable and that the distribution of scores differs by gender.
0.4
Aptitude distribution
0.2
0.0
0.1
Density
0.3
women
men
-6
-4
-2
0
2
4
6
Score
For women, 93.7% of the scores are between the vertical dashed lines,
whereas only 68.6% of the men’s scores fall in this range.
70
Example: gender and aptitudes revisited (cont’d)
The corresponding CDFs reveals the same difference.
0.0
0.2
0.4
F(x)
0.6
0.8
1.0
Cumulative distribution function
-6
-4
-2
0
2
4
6
Score
These distributions are meant to be illustrative rather than factual.
71
Sex partners vignette: which average?
Here is a torn-from-the-headlines example of why it pays to know a little
probability.
“Everyone knows men are promiscuous by nature...Surveys bear
this out. In study after study and in country after country, men
report more, often many more, sexual partners than women...
But there is just one problem, mathematicians say. It is
logically impossible for heterosexual men to have more partners
on average than heterosexual women. Those survey results
cannot be true.”
72
A sex-partners statistical model
Question: is it possible for men to have more sex partners, on average,
than women?
To answer this question, we will consider a “toy” probability model for
homo sapiens mating behavior.
Sally
Chastity
Maude
John
0.07
0.5
0.05
Lenny
0.06
0.5
0.04
Romeo
0.05
0.5
0.09
Let’s call it the “summer camp” model.
73
A sex-partners random variable
The quantity of interest is the number of sex partners. In our model, this
will be a number between 0 and 3.
For each individual we can compute the distribution of this random
variable. We will denote individuals by their first initial. A red initial
means they partnered, a black initial means they did not.
We will assume independence. This means, for example, that Sally
hooking up with Romeo makes it neither more nor less likely that she will
hook up with Lenny.
74
Sally’s sex-partner distribution
Xs
x
P(Xs = x)
JLR
0
(1-0.07)(1-0.06)(1-0.05)
JLR or JLR or JLR
1
(0.07)(1-0.06)(1-0.05) +
(1-0.07)(0.06)(1-0.05) +
(1-0.07)(1-0.06)(0.05)
JLR or JLR or JLR
2
(0.07)(0.06)(1-0.05) +
(1-0.07)(0.06)(0.05) +
(0.07)(1-0.06)(0.05)
JLR
3
(0.07)(0.06)(0.05)
Event
Can you see the probability laws in action here?
75
Sally’s sex-partner distribution
Xs
Event
x
ps (x) = P(Xs = x)
JLR
0
0.83
JLR or JLR or JLR
1
0.16
JLR or JLR or JLR
2
0.01
JLR
3
0.0002
Here is what it looks like after the calculation (rounded a bit). We can
do similarly for each individual.
76
Sally’s sex-partners distribution
Here is a picture of Sally’s sex partner distribution.
0.4
0.6
0.8305
0.1592
0.0
0.2
Probability
0.8
1.0
Distribution of sex partners for Sally
0
1
0.0101
2e-04
2
3
Number of partners
The mean is 0(0.83) + 1(0.16) + 2(0.01) + 3(0.0002) = 0.18. What is the
mode? What is the median?
77
Female sex-partner distribution
To get the distribution for all females, we sum over the individual women.
We apply the law of total probability using all three conditional
distributions:
pfemale (x) = ps (x)P(Sally) + pc (x)P(Chastity) + pm (x)P(Maude).
We assume that the women are selected at random with equal probability
P(Maude) = P(Chastity) = P(Sally) = 1/3.
78
Female sex-partner distribution
At the end we get a distribution like this.
0.6
0.5951
0.4
Probability
0.8
1.0
Distribution of sex partners for females
0.2
0.2315
0.1315
0.0
0.0418
0
1
2
3
Number of partners
The mean is 0.62, the mode is 0, and the median is 0.
79
Male sex-partner distribution
We can do the same thing for the males, and we get this.
0.6
0.4983
0.4
0.4417
0.2
Probability
0.8
1.0
Distribution of sex partners for males
0.0
0.0583
0
1
2
0.0017
3
Number of partners
The mean is 0.62, the mode is 1, and the median is 1.
80
Sex-partners vignette recap
The narrow lesson is that it pays to be specific about which measure of
central tendency you’re talking about!
The more general lesson is that using probability models and a little bit
of algebra can help us see a situation more clearly.
This example uses the concepts of random variable, independence,
conditional distribution, mean, median...and others.
81
Idea: statistical “null” hypotheses
The hypothesis that events are independent often makes a nice contrast
to other explanations, namely that random events are somehow related.
This vantage point allows us to judge if those other explanations fit the
facts any better than the uninteresting “null” explanation that events are
independent.
82
Vignette: making better pilots
Flight instructors have a policy of berating pilots who make bad landings.
They notice that good landings met with praise mostly result in
subsequently less-good landings, while bad landings met with harsh
criticism mostly result in subsequently improved landings.
Is their causal reasoning necessarily valid?
To stress-test their judgment that “criticism works” we consider the
evidence in light of the null hypothesis that subsequent landings are in
fact independent of one another, regardless of criticism or praise.
83
Example: making better pilots (cont’d)
Contrary to the assumptions of the instructors, consider each landing as
independent of subsequent landings (irrespective of feedback).
Assume that landings can be classified into three types: poor, adequate,
or excellent. Further assume the following probabilities:
Event
Probability
bad
pb
adequate
pa
good
pg
Remember that pb + pa + pg = 1.
84
Example: making better pilots (cont’d)
Assume that the policy of criticism is judged to work when a poor
landing is followed by a not-poor landing. Then
P(criticism seems to work) = P(not bad2 | bad1 ) = P(not bad2 ) = pa +pg
by independence.
Conversely, the policy of praise appears to work when an good landing is
followed by another good landing. So
P(good2 | good1 ) = P(good2 ) = pg .
Praise always appears to work less often than criticism!
85
Remark: null and alternative hypotheses
The previous example shows that the evidence can appear to favor
criticism over praise even if criticism and praise are totally irrelevant.
Does this mean that criticism does not work?
No, it just means that the observed facts are not compelling evidence
that criticism works, because they are entirely consistent with the null
hypothesis that landing quality is independent of previous landings and
feedback.
In cases like this we say we “fail to reject the null hypothesis”. We’ll
revisit this terminology a couple weeks from now.
86
Example: making better pilots (continuous version)
What if we want to take pilot skill into account?
We will model this situation using normal random variables and see if the
same conclusions (that praise appears to hurt performance and criticism
seems to boost it) could arise by chance.
87
Example: making better pilots (continuous version, cont’d)
Assume that each pilot has a certain ability level, call it A. Each
individual landing score arises as a combination of this ability and certain
random fluctuations, call them . The landing score at time t can be
expressed as
St = A + t .
iid
Assuming that t ∼ N(0, σ 2 ), then
St ∼ N(A, σ 2 ).
88
Example: making better pilots (continuous version, cont’d)
Denote an average landing score as M. Consider a pilot with A > M.
When he makes an exceptional landing, because 1 > 2σ, he is unlikely to
best it on his next landing.
0.4
0.0
0.2
Density
0.6
0.8
Distribution of landing scores
M
A
A+ε1
S2
For this reason, praise is unlikely to work even though landings are
independent of one another.
89
Example: making better pilots (continuous version, cont’d)
For a poor pilot with A < M a similar argument holds. When he makes a
very poor landing, because 1 < −2σ, he is unlikely to do worse on his
next landing.
0.4
0.0
0.2
Density
0.6
0.8
Distribution of landing scores
A+ε1
A
M
S2
For this reason, criticism is likely to “work” even though landings are
independent.
90
Idea: mean reversion
The previous example illustrates an idea known as mean reversion.
This name refers to the fact that subsequent observations tend to be
“pulled back” towards the overall mean even if the events are
independent of one another.
Mean reversion describes a probabilistic fact, not a physical process.
What might the flight instructors have done (as an experiment) to really
get to the bottom of their question?
91