Download Lectures 12–21 Random Variables

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Statistics wikipedia , lookup

Probability wikipedia , lookup

Randomness wikipedia , lookup

Transcript
Lectures 12–21
Random Variables
Definition: A random variable (rv or RV) is a real valued function defined on the sample space.
The term “random variable” is a misnomer, in view of the normal usage of function and variable.
Random variables are denoted by capital letters from the end of the alphabet, e.g. X, Y , Z,
but other letters are used as well, e.g., B, M , etc.
Hence: X : S 7−→ R and X(e) for e ∈ S is a particular value of X. Since the e are random
outcomes of the experiment, it will make the function values X(e) random as a consequence.
Hence the terminology “random variable”, although “random function value” might have been
less confusing. Random variables are simply a bridge from the sample space to the realm of
numbers, where we can perform arithmetic.
A random variable is different from a random function, where the evaluation for each e is a
function trajectory, e.g., against time as in a stock market index for a given day. Such random
functions are also known as stochastic processes. We will only have limited exposure to them.
Example 1 (Roll of Two Dice): The sum X of the two numbers facing up is an rv.
Example 2 (Toss of Three Coins): The number X of heads in the toss of three coins is a
random variable. Compute P (X = i) = P ({X = i}) for i = 0, 1, 2, 3. The event {X = i} stands
short for {e ∈ S : X(e) = i}.
Example 3 (Urn Problem): Three balls are randomly selected (without replacement) from an
urn containig 20 balls labeled 1, 2, . . . , 20. We bet that we will get at least one label ≥ 17.
What is the probability of winning the bet. This problem could be solved without involving the
notion of a random variable. For the sake of working with the concept of a random variable let X
be the maximum number of the three balls drawn. Hence we are interested in P (X ≥ 17), which
is computed as follows:
P (X ≥ 17) = P (X = 17) + P (X = 18) + P (X = 19) + P (X = 20) = 1 − P (X ≤ 16)
with
P (X = i) =
=⇒ P (X ≥ 17) =
16
2
20
3
+
17
2
20
3
+
18
2
20
3
i−1
2
20
3
+
19
2
20
3
for i = 3, 4, . . . , 20 ,
16
2
34
51
3
3
=
+
+
+
= .50877 = 1 − 20
19 285 380 20
3
Example 4 (Coin Toss with Stopping Rule): A coin (with probability p of heads) is tossed
until either a head is obtained or until n tosses are made. Let X be the number of tosses made.
Find P (X = i) for i = 1, . . . , n.
Solution: P (X = i) = (1 − p)i−1 p for i = 1, . . . , n − 1 and P (X = n) = (1 − p)n−1 .
Check that probabilities add to 1.
Example 5 (Coupon Collector Problem): There are N types of coupons. Each time a coupon
is obtained, it is, independently of previous selections, equally likely to be one of the N types. We
are interested in the random variable T = the number of coupons that needs to be collected to get
a full set of N coupons. Rather than get P (T = n) immediately, we obtain P (T > n).
1
Let Aj be the event that coupon j is not among the first n collected coupons. By the inclusionexclusion formula we have
P (T > n) = P
N
[
!
Ai =
i=1
N
X
P (Ai ) −
i=1
P (Ai1 Ai2 ) + . . . + (−1)N +1 P (A1 A2 . . . AN )
X
i1 <i2
with P (A1 A2 . . . AN ) = 0, of course, and for i1 < i2 < . . . we have
N −1
P (Ai ) =
N
n
N −2
P (Ai1 Ai2 ) =
N
,
n
,...,
P (Ai1 Ai2 . . . Aik ) =
N −k
N
!n
and thus
N −1
P (T > n) = N
N
=
N
−1
X
(−1)
n
i+1
i=1
N
−
2
N
i
!
!
N −2
N
N −i
N
n
N
+
3
!
N −3
N
n
N
− . . . + (−1)
!
N
N −1
1
N
n
n
and P (T > n − 1) = P (T = n) + P (T > n), hence P (T = n) = P (T > n − 1) − P (T > n).
Distribution Functions
Example 6 (First Heads): Toss fair coin until first head lands up. Let X be the number of
tosses required. Then P (X ≤ k) = 1 − P (X ≥ k + 1) = 1 − P (k tails in first k tosses) = 1 − 0.5k .
Definition: The cumulative distribution function (cdf or CDF) or more simply the distibution
function F of the random variable X is defined for all real numbers b as
F (b) = P (X ≤ b) = P ({e : X(e) ≤ b}) .
1.0
Example 7 (Using a CDF): Suppose the cdf of the random variable X is given by
●
0.8
●
1
3≤x
0.6
0.4
2≤x<3
1≤x<2
0.2










0≤x<1
2
3
11
12
●
F(x)
x<0
0.0
F (x) =


0





x



 2
−1
0
1
2
x
Compute P (X < 3), P (X = 1), P (X > .5) and P (2 < X ≤ 4).
11 2
1
, − 12 = 16 , 1 − .52 = .75, 1 − 11
= 12
.
12 3
12
2
3
4
L12
ends
Discrete Random Variables
Definition: A random variable X which can take on at most a countable number of values is
called a discrete random variable. For such a discrete random variable we define its probability
mass function (pmf ) p(a) of X by
p(a) = P (X = a) = P ({e : X(e) = a}) for all a ∈ R .
p(a) is positive for at most a countable number of values of a. If X assumes only one of the
following values x1 , x2 , x3 , . . . then
p(xi ) ≥ 0 for i = 1, 2, 3, . . .
and p(x) = 0 for all other values of x
Graphical representation of p(x) (one die, sum of two dice):
Example 8 (Poisson): Suppose the discrete random variable X has pmf p(i) = cλi /i! for
i = 0, 1, 2, . . . where λ is some positive value and c = exp(−λ) makes the probabilities add to
one. Find P (X = 0) and P (X > 2). P (X = 0) = c = exp(−λ),
P (X > 2) = 1 − P (X = 0) − P (X = 1) − P (X = 2) = 1 − exp(−λ)(1 + λ + λ2 /2).
The cdf F of a discrete random variable X can be expressed as
X
F (a) =
p(x) .
x: x≤a
The c.d.f. of a discrete random variable X is a step function with a possible step at each of its
possible values x1 , x2 , . . . and being flat in between.
Example 9 (Discrete CFD): p(1) = .25, p(2) = .5, p(3) = .125 and p(4) = .125, construct the
c.d.f. and graph it. Interpret the step size.
Expected Value or Mean of X
A very important concept in probability theory is that of the expected value or mean of a
random variable X. For a discrete RV it is defined as
X
E[X] = E(X) = µ = µX =
x · p(x) =
x:p(x)>0
X
x · p(x)
x
the probability weighted average of all possible values of X.
If X takes on the two values 0 and 1 with probabilities p(0) = .5 and p(1) = .5 then
E[X] = 0 · .5 + 1 · .5 = .5, which is half way between 0 and 1.
When p(1) = 2/3 and p(0) = 1/3, then E[X] = 2/3, twice as close to 1 than to 0. That’s because
the probability of 1 is twice that of 0. The double weight 2/3 at 1 balances the weight of 1/3 at 0,
when the fulcrum of balance is set at 2/3 = E[X].
weight1 · moment arm1 = weight2 · moment arm2 or 2/3 · 1/3 = 1/3 · 2/3, where the moment arm
is measured as the distance of the weight from the fulcrum, here at 2/3 = E[X].
moment arm1 = |2/3 − 1| = 1/3 and moment arm2 = |2/3 − 0| = 2/3
This is a general property of E[X], not just limited to RVs with two values.
If a is the location of the fulcrum, then we get balance when
X
x<a
(a − x)p(x) =
X
x>a
(x − a)p(x) or 0 = −
X
(a − x)p(x) +
x<a
3
X
x>a
(x − a)p(x) =
X
x
(x − a)p(x)
or 0 =
X
x
xp(x) − a
X
p(x) or 0 = E[X] − a or a = E[X]
x
The term expectation can again be linked to our long run frequency motivation for probabilities.
If we play the same game repeatedly, say a large number N times, with payoffs being one of the
amounts x1 , x2 , x3 , . . ., then we would roughly see these amounts with approximate relative
frequencies p(x1 ), p(x2 ), p(x3 ), . . ., i.e., with approximate frequencies N p(x1 ), N p(x2 ), N p(x3 ), . . .,
thus realizing in N such games the following total payoff:
N p(x1 ) · x1 + N p(x2 ) · x2 + N p(x3 ) · x3 + . . .
i.e., on a per game basis
N p(x1 ) · x1 + N p(x2 ) · x2 + N p(x3 ) · x3 + . . .
= p(x1 ) · x1 + p(x2 ) · x2 + p(x3 ) · x3 + . . . = E[X]
N
On average we expect to win E[X] (or lose E[X], if E[X] < 0).
Example 10 (Rolling a Fair Die): If X is the number showing face up on a fair die, we get
E[X] =
1
1
1
1
1
3·7
7
1
·1+ ·2+ ·3+ ·4+ ·5+ ·6=
=
6
6
6
6
6
6
6
2
Indicator Variable: For any event E we can define the indicator RV
I = IE (e) = 1 if e ∈ E
and IE (e) = 0 if e ∈
/ E ⇒ E[I] = P (E) · 1 + P (E c ) · 0 = P (E)
Example 11 (Quiz Show): You are asked two different types of questions, but the second one
only when you answer the first correctly. When you answer a question of type i correctly you get
a prize of Vi dollars. In which order should you attempt to answer the question types, when you
know your chances of answering questions of type i are Pi , i = 1, 2, respectively? Or does the
order even matter? Assume that the events of answering questions are independent?
If you choose to answer a question of type i = 1 first, your winnings are
0
with probability 1 − P1
V1
with probability P1 (1 − P2 )
V1 + V2 with probability P1 P2
with expected winnings W1 : E[W1 ] = V1 P1 (1 − P2 ) + (V1 + V2 )P1 P2 .
When answering the type 2 question first, you get the same expression with indices exchanged,
i.e., E[W2 ] = V2 P2 (1 − P1 ) + (V1 + V2 )P1 P2 . Thus
E[W1 ] > E[W2 ] ⇐⇒ V1 P1 (1 − P2 ) > V2 P2 (1 − P1 ) ⇐⇒
V2 P2
V 1 P1
>
1 − P1
1 − P2
the choice should be ordered by odds-weighted payoffs. Example: P1 = .8, V1 = 900, P2 = .4,
V2 = 6000, then 900·.8
= 3600 < 6000·.4
= 4000. E[W1 ] = 2640 < E[W2 ] = 2688.
.2
.6
4
L13
ends
Expectation of g(X): For an RV X and a function g : R → R, we can view Y = g(X) again as
a random variable. Find its expectation.
Two ways: Find the pX (x)-weighted average of all g(X) values, or given the pmf pX (x) of X, find
the pmf pY (y) of Y = g(X), then its expectation as the pY (y) weighted average of all Y values.
Example: Let X have values −2, 2, 4 with p(X (−2) = .25, pX (2) = .25, pX (4) = .5, respectively.
=⇒
x
-2
2
4
pX (x) x2
.25
4
.25
4
.5
16
x2 pX (x)
1
1
8
10
y = x2
4
16
=⇒
pY (y) ypY (y)
.5
2
.5
8
10
pY (4)
X
z
}|
{
x2 pX (x) = (−2)2 · .25 + 22 · .25 + 42 · .5 = 4 · (.25 + .25) +16 · .5 =
x
X
ypY (y) = 10
y
What we see in this special case holds in general for discrete RVs X and functions Y = g(X)
E[Y ] =
X
y
ypY (y) =
X
g(x)pX (x) = E[g(X)]
x
The formal proof idea is already contained in the above example, so we skip it, but see book for
notational formal proof or the graphic above.
5
Example 12 (Business Planning): A seasonal product (say skis), when sold in timely fashion,
yields a net profit of b dollars for each unit sold, and a net loss of ` dollars, when it needs to be
sold at season’s end at a fire sale. Assume that the customer demand for the number X of units
is an RV with pmf pX (x) = p(x) and assume s units are stocked. When X > s, the excess orders
cannot be filled. Then the realized profit Q(s) is an RV, namely
(
bX − (s − X)` if X ≤ s
sb
if X > s
Q(s) =
with expected profit
E[Q(s)] =
s
X
∞
X
(bi − (s − i)`)p(i) + sb
i=0
= (b + `)
s
X
ip(i) − s(b + `)
i=0
p(i) = (b + `)
s
X
i=s+1
s
X
ip(i) − s`
i=0
s
X
"
p(i) + sb 1 −
i=0
p(i) + sb = sb + (b + `)
i=0
s
X
s
X
#
p(i)
i=0
(i − s)p(i)
i=0
Find the value s that maximizes this expected value.
We examine what happens to E[Q(s)] as we increase s to s + 1.
E[Q(s + 1)] = (s + 1)b + (b + `)
s+1
X
s
X
i=0
i=0
(i − s − 1)p(i) = (s + 1)b + (b + `)
=⇒ E[Q(s + 1)] − E[Q(s)] = b − (b + `)
s
X
p(i) > 0 ⇐⇒
s
X
p(i) <
i=0
i=0
(i − s − 1)p(i)
b
b+`
Ps
Since i=0 p(i) increases with s and since b/(b + `) is constant, there is a largest s, say s∗ , for
which this inequality holds, and thus the maximum expected profit is E[Q(s∗ + 1)], achieved
when stocking s∗ + 1 items. We need to know p(i), i = 0, 1, . . ., e.g., from past experience.
Examples of E[g(X)]: 1) Let g(x) = ax + b, with constants a, b, then
E[aX + b] =
X
(ax + b)p(x) = a
X
x
xp(x) + b
x
X
p(x) = aE[X] + b
x
2) let g(x) = xn , then
E[X n ] =
X
xn p(x)
x
is called the nth moment of X,and E[X] is also known as the first moment.
The Variance of X
While E[X] is a measure of the center of a distribution given by a pmf p(x), we also like to have
some measure of the spread or variation of a distribution. While E[X] = 0 for X ≡ 0 with
probability 1, or X = ±1 with probability 1/2 each or X = ±100 with probability 1/2 each, we
would view the variabilities of these three situations quite differently. One plausible measure
would be the expected absolute difference of X from its mean, i.e., E[|X − µ|], where µ = E[X].
For the above three situations we would get E[|X − µ|] = 0, 1, 100, respectively. While this was
easy enough, it turns out the the absolute value function |X − µ| is not very conducive to
6
L14
ends
manipulations. We introduce a different measure that can be exploited much more conveniently
as we will see later on.
Definition: The variance of a random variable X with mean µ = E[X] is defined as
var(X) = E[(X − µ)2 ]
An alternate formula, and example of the manipulative capability of the variance definition, is
var(X) = E[X 2 − 2µX + µ2 ] =
(x2 − 2µx + µ2 )p(x) =
X
x
2
2
X
x2 p(x) −
x
2
2
X
2µxp(x) +
x
X
µ2 p(x)
x
2
2
= E[X ] − 2µE[X] + µ = E[X ] − µ = E[X ] − (E[X])
Example 13 (Variance of a Fair Die): If X denotes the face up of a randomly rolled fair die
then
1
1
1
1
1
91
1
E[X 2 ] = 12 + 22 + 32 + 42 + 52 + 62 =
6
6
6
6
6
6
6
2
and var(X) =
7
91
−
6
2
=
35
12
Variance of aX + b: For constants a and b we have var(aX + b) = a2 var(X), since
var(aX + b) = E[{(aX + b − (aµ + b)}2 ] = E[a2 (X − µ)2 ] = a2 E[(X − µ)2 ] = a2 var(X)
In analogy to the center of gravity interpretation of E[X] we can view var(X) as the moment of
inertia of the pmf p(x), when viewing p(x) as weight in mechanics.
While squaring the deviation of X around µ in the definition of var(X), it creates a distortion
and changes any units of measurements to square units. To bring matters back to its original
units we take the square root of the variance, i.e., the standard deviation SD(X), as the
appropriate measure of spread
SD(X) = σ = σX =
q
var(X)
We now discuss several special discrete distributions.
Bernoulli and Binomial Random Variables
Aside from the constant random variable which takes on only one value, the next level of
simplicity is a random variable with only two values, most often 0 and 1, (canonical choice).
Definition (Bernoulli Random Variable): A random variable X which can take on only the
two values 0 and 1 is called a Bernoulli random variable. We indicate its distribution by
X ∼ B(p). In liberal notational usage we also write P (X ≤ x) = P (B(p) ≤ x).
Such random variables are often employed when we focus on an event E in a particular random
experiment. Let p = P (E). If E occurs we say the experiment results in a success and otherwise
we call it a failure. The Bernoulli rv X is then defined as follows: X(e) = 1 if e ∈ E and
X(e) = 0 if e 6∈ E. Hence X counts the number successes in one performance of the experiment.
Often the following alternate notation is used: IE (e) = 1 if e ∈ E and IE (e) = 0 otherwise. IE is
then also called the indicator function of E.
7
The probability mass function of X or IE is
p(0) = P (X = 0) = P ({e : X(e) = 0}) = P (E c ) = 1 − p
p(1) = P (X = 1) = P ({e : X(e) = 1}) = P (E) = p
where p is usually called the success probability. The mean and variance of X ∼ B(p) is
E[X] = (1 − p) · 0 + p · p = p and var(X) = E[X 2 ] − (E[X])2 = E[X] − p2 = p − p2 = p(1 − p)
where we exploited X ≡ X 2 .
If we perform n independent repetitions of this basic experiment, i.e. n independent trials, then
we can talk of another random variable Y , namely the number of successes in these n trials. Y is
called a binomial random variable and we indicate its distribution by Y ∼ Bin(n, p), again
liberally writing P (Y ≤ y) = P (Bin(n, p) ≤ y).
For parameters n and p, the probability mass function of Y is (as derived previously)
!
n i
p(i) = P (Y = i) =
p (1 − p)n−i for i = 0, 1, 2, . . . , n .1
i
Example 14 (Coin Flips): Flip 5 fair coins and denote by X the number of heads in these 5
flips. Get the probability mass function of X.
Example 15 (Quality Assurance): A company produces parts. The probability that any
given part will be defective is .01. The parts are shipped in batches of 10 and the promise is
made that any batch with two or more defectives will be replaced by two new batches of 10 each.
What proportion of the batches will need to be replaced?
Solution: 1 − P (X = 0) − P (X = 1) = 1 − (1 − p)10 − 10p(1 − p)9 = .0043 where p = .012 . Hence
about .4% of the batches will be affected.
Example 16: (Chuck-a-luck): A player bets on a particular number i = 1, 2, 3, 4, 5, 6 of a fair
die. The die is rolled 3 times and if the chosen bet number appears k = 1, 2, 3 times the player
wins k units, otherwise loses 1 unit. If X denotes the payoff, what is the expected value E[X] of
the game?
! 0 3
3
P (X = −1) =
0
1
6
5
6
! 2 1
1
5
3
P (X = 2) =
2
6
6
! 1 2
125
=
,
216
3
P (X = 1) =
1
15
=
,
216
3
P (X = 3) =
3
1
6
5
6
=
75
,
216
=
1
216
! 3 0
1
6
5
6
125
75
15
1
−17
+1·
+2·
+3·
=
216
216
216
216
216
with an expected loss of 0.0787 units per game in the long run.
Example 17 (Genetics): A particular trait (eye color or left-handedness) on a person is
governed by a particular gene pair, which can either be {d, d}, {d, r} or {r, r}. The dominant
=⇒
E[X] = −1 ·
1
With appropriate values for i, n and p you get p(i) via the command dbinom(i,n,p) in R, while pbinom(i,n,p)
returns P (Y ≤ i). In EXCEL get these via =BINOMDIST(i,n,p,FALSE) and =BINOMDIST(i,n,p,TRUE), respectively.
You may also use the spreadsheet available within the free OpenOffice http://www.openoffice.org/.
2
1-pbinom(1,10,.01) in R and in EXCEL via = 1-BINOMDIST(1,10,.01,TRUE).
8
L15
ends
gene d dominates over the recessive r, i.e., the trait shows whenever there is a d in the gene pair.
An offspring from two parents inherits randomly one gene from each gene pair of its parents. If
both parents are hybrids ({d, r}) what is the chance that of 4 offspring at least 3 show the
outward appearance of the dominant gene?
Solution: p = 3/4 is the probability that any given offspring will have gene pair {d, d} or {d, r}.
Hence P = 4(3/4)3 (1/4) + 1(3/4)4 = 189/256 = .74.
Example 18 (Reliability): On an aircraft we want to compare the reliability (probability of
functioning) of a 3 out of 5 system with a 2 out of 3 system. A k out of n system functions
whenever a majority of the subsystems function properly. Usually n is chosen as odd. We assume
that the probability of failure 1 − p is the same all subsystems and that failures occur
independently. A 3 out of 5 system has a higher reliability than a 2 out of 3 system whenever
!
!
!
!
!
0.4
0.8
3 out of 5 system
2 out of 3 system
0.0
reliability
5 3
5 4
5 5
3 2
3 3
1
p (1−p)2 +
p (1−p)+
p >
p (1−p)+
p ⇐⇒ (1−p)(2p−1) > 0 ⇐⇒ p >
2
3
4
5
2
3
0.0
0.2
0.4
0.6
0.8
1.0
0.98
0.99
1.00
3 out of 5 system
2 out of 3 system
0.9994
0.9988
reliability
1.0000
p
0.95
0.96
0.97
p
9
Mean and Variance of X ∼ Bin(n, p): Using the simple identities
!
!
n
n−1
i
=n
i
i−1
n
X
!
!
n
n−2
and i(i − 1)
= n(n − 1)
i
i−2
n
X
n i
n − 1 i−1
E[X] =
i
p (1 − p)n−i =
np
p (1 − p)n−1−(i−1)
i
i
−
1
i=0
i=1
=⇒
!
!
substituting i − 1 = j
= np
n−1
X
j=0
!
n−1 j
p (1 − p)n−1−j = np
j
Note the connection to Bernoulli RVs Xi , indicating success or failure in the ith trial and
E[X] = E[X1 + . . . + Xn ] = E[X1 ] + . . . + E[Xn ] = np.
Expectation of a sum = sum of the individual (finite) expectations.
n
X
n
X
n i
n − 2 i−2
E[X(X − 1)] =
i(i − 1)
p (1 − p)n−i =
n(n − 1)p2
p (1 − p)n−2−(i−2)
i
i
−
2
i=0
i=2
!
!
substituting i − 2 = j
= n(n − 1)p2
n−2
X
j=0
n(n − 1)p
2
2
= E[X(X − 1)] = E[X − X] =
X
2
(x − x)p(x) =
x
=⇒
=⇒
!
n−2 j
p (1 − p)n−2−j = n(n − 1)p2
j
X
x
x2 p(x) −
X
xp(x)
x
= E[X 2 ] − E[X] = E[X 2 ] − np
E[X 2 ] = np + n(n − 1)p2 = np(1 − p) + (np)2
var(X) = E[X 2 ] − (E[X])2 = np(1 − p)
Note again var(X) = var(X1 + . . . + Xn ) = var(X1 ) + . . . + var(Xn ) = np(1 − p)
Variance of a sum of independent RVs = sum of the (finite) variances of those RVs.
Qualitative Behavior of Binomial Probability Mass Function: If X is a binomial random
variable with parameters (n, p) then the probability mass function p(x) of X first increases
monotonically and then decreases monotonically, reaching its largest value when x is the largest
integer ≤ (n + 1)p.
p n−x
> 1 or < 1 ⇐⇒ (n + 1)p > x + 1 or < x + 1. Of course it
Proof: Look at p(x + 1)/p(x) = 1−p
x+1
is possible that p(x) is entirely monotone (when?). Illustrate with Pascal’s triangle.
L16
ends
The Poisson Random Variable
Definition: A random variable X with possible values 0, 1, 2, . . . is called a Poisson random
variable, indicated by X ∼ P ois(λ), if for some constant λ > 0 its pmf is given by
p(i) = P (X = i) = P (P ois(λ) = i) = e−λ
λi
for i = 0, 1, 2, . . . .3
i!
3
Check summation to 1.
In R get p(i) = P (X = i) via the command dpois(i,lambda), while P (X ≤ i) is obtained by ppois(i,lambda).
In EXCEL you get the same by =POISSON(i,lambda,FALSE) and =POISSON(i,lambda,TRUE), respectively.
10
Approximation to a binomial random variable for small p and large n: Let X be a
binomial rv with parameters n and p. Let n get large and let p get small so that λ = np does
neither degenerate to 0 nor ∞, then
!i
!n−i
n!
n!
λ
λ
P (X = i) =
pi (1 − p)n−i =
1−
i!(n − i)
i!(n − i)! n
n
i
n
i
n(n − 1) · · · (n − i + 1) λ (1 − λ/n)
−λ λ
.
=
≈
e
ni
i! (1 − λ/n)i
i!
Since np represents the expected or average number of successes of the n trials represented by the
binomial random variable it should not be surprising that the Poisson prameter λ should be
interpreted as the average or expected count for such a Poisson random variable.
Actually, for the approximation to work it can be shown that small p is sufficient. In fact, if for
i = 1, 2, 3, . . . , n the Xi are independent Bernoulli random variables with respective success
probabilities pi and if S = X1 + · · · + Xn and if Y is a Poisson random variable with parameter
P
λ = ni=1 pi then
|P (S ≤ x) − P (Y ≤ x)| ≤ 3(max(p1 , . . . , pn ))1/3 for all x
or one can show that
|P (S ≤ x) − P (Y ≤ x)| ≤ 2
n
X
p2i for all x .
i=1
Poisson-Binomial Approximation, see class web page.
A Poisson random variable often serves as a good model for the count of rare events.
Examples:
Number of misprints on a page
Number of telephone calls coming through an exchange
Number of wrong numbers dialed
Number of lightning strikes on commercial aircraft
Number of bird ingestions into the engine of a jet
Number of engine failures on a jet
Number of customers coming into a post office on a given day
Number of meteoroids striking an orbiting space station
Number of discharged α–particles from some radioactive source.
Example 19: (Typos): Let X be the number of typos on a single page of a given book.
Assume that X is Poisson with parameter λ = .5, i.e. we expect about half an error per page or
about one error per every two pages. Find the probability of at least one error. Solution:
P (X ≥ 1) = 1 − P (X = 0) = 1 − exp(−.5) = .393.
Example 20 (Defectives): A machine produces 10% defective items, i.e. an item coming off
the machine has a chance of .1 of being defective. What is the chance that in the next 10 items
coming off the machine we find at most one defective item? Solution: Let X be the number of
defective items among the 10.
P (X ≤ 1) = P (X = 0) + P (X = 1) = (.1)0 (.9)10 + 10(.1)1 (.9)9 = .7361
whereas using a Poisson random variable Y with parameter λ = 10(.1) = 1 we get
P (Y ≤ 1) = P (Y = 0) + P (Y = 1) = e−1 + e−1 = .7358
11
Mean and Variance of the Poisson Distribution: Based on the approximation of the
Binomial(n, p) by a Poisson(λ = np) distribution when p is small, we would expect that
E[Y ] ≈ np = λ and var(Y ) ≈ np(1 − p) ≈ λ. We now show that these approximations are in fact
exact.
∞ −λ i−1
∞ −λ j
∞
X
X
X
e λ
e λ
e−λ λi
=λ
=λ
=λ
E[Y ] =
i
i!
j!
i=1 (i − 1)!
j=0
i=0
∞
X
∞
∞
∞
∞ −λ j
X
X
X
X
e−λ λi
e λ
e−λ λi−1
e−λ λj
e−λ λj
E[Y 2 ] =
i2
=λ
= λ (j + 1)
=λ
+λ
= λ2 + λ
i
j
i!
(i
−
1)!
j!
j!
j!
i=0
i=1
j=0
j=0
j=0
=⇒
var(Y ) = E[Y 2 ] − (E[Y ])2 = λ2 + λ − λ2 = λ
Poisson Distribution for Events in Time (Another Justification): Sometimes we observe
random incidents occurring in time, e.g. arrival of customers, meteoroids, lightning etc. Quite
often these random phenomena appear to satisfy the following basic assumptions for some
positive constant λ:
1. The probability that exactly one incident occurs during an interval of length h is λh + o(h)
where o(h) is a function of h which goes to 0 faster than h, i.e. o(h)/h → 0 as h → 0 (e.g.
o(h) = h2 ). The concept/notation of o(h) was introduced by Edmund Landau.
2. The probability that two or more incidents occur in an interval of length h is the same for
all such intervals and equal to o(h). No clustering of incidents!
3. For any integers n, j1 , . . ., jn and any set of nonoverlapping intervals the events E1 , . . ., En ,
with Ei denoting the occurrence of exactly ji incidents in the ith interval, are independent.
If N (t) denotes the number of incidents in a given interval of length t then it can be shown that
N (t) is a Poisson random variable with parameter λt, i.e. P (N (t) = k) = e−λt (λt)k /k!.
Proof: Take as time interval [0, t] and divide it into n equal parts. P (N (t) = k) = P (k of the
intervals contain exactly one incident and n − k contain 0 incidents)+P (N (t) = k and at least
one subinterval contains two or more incidents). The second probability can be bounded by
n
X
P (i
th
interval contains at least two incidents) ≤
n
X
o
i=1
i=1
t
= n o(t/n) → 0 .
n
The probability of 0 incidents in a particular interval of length t/n is 1 − [λ(t/n) + o(t/n)] so that
the first probability above becomes (in cavalier fashion, not quite air tight. See Poisson-Binomial
Approximation on the class web page for a clean argument.)
"
#k "
n!
λt
+ o(t/n)
k!(n − k)! n
λt
1−
− o(t/n)
n
#n−k
which converges to exp(−λt)(λt)k /k!.
Example 21 (Space Debris): It is estimated that the space station will be hit by space debris
beyond a critical size and velocity on the average about once in 500 years. What is the chance
that the station will survive the first 20 years without such a hit.
Solution: T = 500 then λT = 1 or λ = 1/500. Now t = 20 and
P (N (t) = 0) = exp(−λt) = exp(−20/500) = .9608.
12
L17
ends
Geometric, Negative Binomial and Hypergeometric
Random Variables
Definition: In independent trials with success probability p the number X of trials required to
get the first success is called a geometric random variable. We write X ∼ Geo(p) to indicate its
distribution. Its probability mass function is
p(n) = P (X = n) = P (Geo(p) = n) = (1 − p)n−1 p for n = 1, 2, 3, . . .
Check summation to 1. Some texts (and software, e.g., R and EXCEL as a special negative
binomial) treat X0 = X − 1 = number of failures before the first success as the geometric RV.
Then P (X0 = n) = P (X = n + 1) = (1 − p)n p for n = 0, 1, 2, . . ..
Example 22 (Urn Problem): An urn contains N white and M black balls. Balls are drawn
with replacement until the first black ball is obtained. Find P (X = n) and P (X ≥ k), the latter
in two ways. Probability of success = p = M/(M + N ).
P (X = n) = (1 − p)n−1 p and P (X ≥ k) = (1 − p)k−1
P (X ≥ k) =
∞
X
(1 − p)i−1 p = (1 − p)k−1
i=k
∞
X
(1 − p)i−k p
i=k
= p(1 − p)k−1
∞
X
(1 − p)j = p(1 − p)k−1
j=0
1
= (1 − p)k−1
1 − (1 − p)
Mean and Variance of X ∼ Geo(p):
E[X] − 1 = E[X − 1] =
∞
X
(n − 1)(1 − p)n−1 p = (1 − p)
= (1 − p)
(n − 1)(1 − p)n−2 p
n=2
n=1
∞
X
∞
X
i(1 − p)i−1 p = (1 − p)E[X] =⇒ E[X](1 − (1 − p)) = 1 or E[X] =
i=1
1
p
Fits intuition: If p = 1/1000, then it takes on average 1/p = 1000 trials to see one success.
L18
ends
∞
∞
X
X
(n − 1)2 (1 − p)n−2 p
(n − 1)2 (1 − p)n−1 p = (1 − p)
E[X 2 ] − 2E[X] + 1 = E[(X − 1)2 ] =
n=2
n=1
= (1 − p)
∞
X
i2 (1 − p)i−1 p = (1 − p)E[X 2 ]
i=1
2
1
2
=⇒ E[X 2 ](1 − (1 − p)) =
− 1 or E[X 2 ] = 2 −
p
p
p
2
1
1
1−p
or var(X) = E[X 2 ] − (E[X])2 = 2 − − 2 =
p
p p
p2
Definition: In independent trials with success probability p the number X of trials required to
get the first r successes accumulated is called a negative binomial random variable. We write
X ∼ N egBin(r, p) to indicate its distribution. Its probability mass function is
!
n−1
p(n) = P (X = n) = P (N egBin(r, p) = n) =
(1 − p)n−r pr for n = r, r + 1, r + 2, . . .
r−1
13
For r = 1 we get the geometric distribution as a special case.
Exploiting the equivalence of the two statements:
“it takes at least m trials to get r successes” and
“in the first m − 1 trials we have at most r − 1 successes” we have
P (N egBin(r, p) ≥ m) = 1 − P (N egBin(r, p) ≤ m − 1) = P (Bin(m − 1, p) ≤ r − 1)
(1)
This facilitates the computation of the negative binomial cumulative probabilities in terms of
appropriate binomial cumulative probabilities.
We can view X as the sum of independent geometric random variables Y1 , . . . , Yr , each with
success probability p. Here Y1 denotes the number of trials to the first succes, Y2 the number of
additional trials to the next success thereafter, and so on. Clearly, for i1 , . . . , ir ∈ {1, 2, 3, . . .} we
have
P (Y1 = i1 , . . . , Yr = ir ) = P (Y1 = i1 ) · . . . · P (Yr = ir )
(2)
since the individual statements concern what specifically happens in the first i1 + . . . + ir trials,
all of which are independent, namely we have i1 − 1 failures, then a success, then i2 − 1 failures,
then a success, and so on. From (2) it follows that for E1 , . . . , Er ⊂ {1, 2, 3, . . .} we have
P (Y1 ∈ E1 , . . . , Yr ∈ Er ) =
X
...
i1 ∈E1
=
X
=
...
X
P (Y1 = i1 , . . . , Yr = ir )
ir ∈Er
i1 ∈E1
distributive law of arithmetic
X
X
P (Y1 = i1 ) · . . . · P (Yr = ir )
ir ∈Er
P (Y1 = i1 ) · . . . ·
i1 ∈E1
X
P (Yr = ir )
ir ∈Er
= P (Y1 ∈ E1 ) · . . . · P (Yr ∈ Er )
The same holds for any subset of the Y1 , . . . , Yr , since (2) also holds for any subset. For example,
summing the left and right side over all i1 = 1, 2, 3, . . . yields
∞
X
P (Y1 = i1 , . . . , Yr = ir ) =
i1 =1
∞
X
P (Y1 = i1 ) · . . . · P (Yr = ir )
i1 =1
P (Y1 < ∞, Y2 = i2 , . . . , Yr = ir ) = P (Y1 < ∞) · P (Y2 = i2 ) · . . . · P (Yr = ir )
P (Y2 = i2 , . . . , Yr = ir ) = P (Y2 = i2 ) · . . . · P (Yr = ir )
and similarly by summing over any other and further indices. In particular we get
1 = P (Y1 < ∞) · . . . · P (Yr < ∞) = P (Y1 < ∞, . . . , Yr < ∞) ≤ P (Y1 + . . . + Yr < ∞) = P (X < ∞)
This means that the negative binomial pmf sums to 1, i.e.,
1 = P (X < ∞) =
∞
X
P (X = n) =
n=r
∞
X
n=r
!
n−1
(1 − p)n−r pr
r−1
Some texts (and software such as R and EXCEL) treat X0 = X − r = number of failures prior to
the rth success as a negative binomial RV. Then P (X0 = n) = P (X = n + r) for n = 0, 1, 2, . . .4 .
4
P (X0 = n) and P (X0 ≤ n) can be obtained in R by the commands dnbinom(n,r,p) and pnbinom(n,r,p),
respectively, while in EXCEL use = NEGBINOMDIST(n,r,p) and =1-BINOMDIST(r-1,n+r,p,TRUE) based on (1).
E.g., pnbinom(4,5,.2) and =1-BINOMDIST(4,9,0.2,TRUE) return 0.01958144.
14
L19
ends
Example 23 (r Successes Before m Failures): If independent trials are performed with
success probability p what is the chance of getting r successes before m failures?
Solution: Let X be the number of trials required to get the first r successes. Then we need to
find: P (X ≤ m + r − 1) = P (X0 ≤ m − 1).
n
Mean and Variance of X ∼ N egBin(r, p): Using n n−1
=
r
and V ∼ N egBin(r + 1, p)
r−1
r
∞
X
∞
n−1 r
rX
n r+1
p (1 − p)n−r =
nk−1
p (1 − p)n−r
E[X k ] =
nk
p
r
−
1
r
n=r
n=r
!
∞
n
+
1
−
1 r+1
r X
k−1
p (1 − p)n+1−(r+1)
(n + 1 − 1)
=
r+1−1
p n+1=r+1
!
!
∞
r X
m−1
r
=
(m − 1)k−1
pr+1 (1 − p)m−(r+1) = E[(V − 1)k−1 ]
p m=r+1
r+1−1
p
!
=⇒ E[X] =
r
p
and
r
r
E[X ] = E[V − 1] =
p
p
2
!
r+1
−1
p
rr+1 r
r
=⇒ var(X) =
− −
p p
p
p
!2
=
r(1 − p)
p2
If we write X again as X = Y1 + . . . + Yr with independent Yi ∼ Geo(p), i = 1, . . . , r, we note again
r
E[X] = E[Y1 + . . . + Yr ] = E[Y1 ] + . . . + E[Yr ] =
p
var(X) = var(Y1 + . . . + Yr ) = var(Y1 ) + . . . + var(Yr ) = r
1−p
p2
Definition: If a sample of size n is chosen randomly and without replacement from an urn
containing N balls, of which M = N p are white and N − M = N − N p are black, then the
number X of white balls in the sample is called a hypergeometric random variable. To indicate
its distribution we write X ∼ Hyper(n, M, N ). Its possible values are x = 0, 1, . . . , n with pmf
p(k) = P (X = k) =
M
k
N −M
n−k
N
n
(3)
which is positive only if 0 ≤ k and k ≤ M and 0 ≤ n − k and n − k ≤ N − M , i.e. if
max(0, n − N + M ) ≤ k ≤ min(n, M ).5
Expression (3) also applies when drawing the n balls one by one without replacement since then
P (X = k) =
n
k
M (M − 1) . . . (M − k + 1)(N − M )(N − M − 1) . . . (N − M − (n − k + 1))
N (N − 1) . . . (N − n + 1)
=
M
k
N −M
n−k
N
n
5
In R we can obtain P (X = k) and P (X ≤ k) by the commands dhyper(k,M,N-M,n) and phyper(k,M,N-M,n),
respectively. EXCEL only gives P (X = k) directly via =HYPGEOMDIST(k,n,M,N). For example, for M = 40, N =
100, n = 30 and k = 15 dhyper(15,40,60,30) and =HYPGEOMDIST(15,30,40,100) return P (X = 15) = .07284917,
while phyper(15,40,60,30) returns P (X ≤ 15) = .9399093.
15
Example 24 (Animal Counts): r animals are caught and tagged and released. After a
reasonable time interval n animals are captured and the number X of tagged ones are counted.
The total number N of animals is unknown. Then
r
i
pN (i) = P (X = i) =
N −r
n−i
N
n
Find N which maximizes this pN (i) for the observed value X = i.
pN (i)
(N − r)(N − n)
=
≥1
pN −1 (i)
N (N − r − n + i)
if and only if N ≤ rn/i. Hence our maximum likelihood estimate is N̂ = largest integer ≤ rn/i.
Another way of motivating this estimate is to appeal to r/N ∼ i/n.
Example 25 (Quality Control): Shipments of 1000 items each are inspected by selecting 10
without replacement. If the sample contains more than one defective then the whole shipment is
rejected. What is the chance for rejecting a shipment if at most 5% of the shipment is bad. The
probability of no rejection is
P (X = 0) + P (X = 1) =
50 950
0
10
1000
10
50 950
1
9
1000
10
+
= .91469
hence the chance of rejecting a shipment is at most .08531.
Expections and Variances of X1 + . . . + Xn :
First we prove a basic alternate formula for the expectation of a single random variable X:
E[X] =
X
xpX (x) =
X
X(s)p(s)
s
x
where the first expression involves the pmf pX (x) of X and sums over all possible values of X and the
second expression involves the probability p(s) = P ({s}) for all elements s in the sample space S.
The equivalence is seen as follows. For any of the possible values x of X let Sx = {s ∈ S : X(s) = x}.
For different values x the events/sets Sx are disjoint and their union over all x is S. Thus
X
xpX (x) =
x
X
xP ({s : X(s) = x}) =
X
x
=
X X
x
x
xp(s) =
X X
x s∈Sx
X
p(s)
s∈Sx
X(s)p(s) =
x s∈Sx
X
X(s)p(s)
s
From this we get immediately
E[X1 + . . . + Xn ] =
X
(X1 (s) + . . . + Xn (s))p(s) =
s
=
X
X
[X1 (s)p(s) + . . . + Xn (s)p(s)]
s
X1 (s)p(s) + . . . +
X
s
s
provided the individual expectations are finite.
16
Xn (s)p(s) = E[X1 ] + . . . + E[X2 ]
Next we will we address a corresponding formula for the variance of a sum of independent discrete
random variables X1 , . . . , Xn , namely
var(X1 + . . . + Xn ) = var(X1 ) + . . . + var(Xn ) provided the individual variances are finite.
First we need to define the concept independence for a pair of random variables X and Y in
concordance with the previously introduced independence of events. X and Y are independent,
whenever for all possible values x and y of X and Y we have
P (X = x, Y = y) = P ({s ∈ S : X(s) = x, Y (s) = y})
= P ({s ∈ S : X(s) = x})P ({s ∈ S : Y (s) = y}) = P (X = x)P (Y = y)
As a consequence we have for independent X and Y with finite expectations the following property
E[XY ] = E[X]E[Y ] , i.e., E[XY ] − E[X]E[Y ] = cov(X, Y ) = 0
where cov(X, Y ) is the covariance of X and Y , equivalently defined as
cov(X, Y ) = E[(X − E[X])(Y − E[Y ])] = E[XY − XE[Y ] − Y E[X] + E[X]E[Y ]]
= E[XY ] − E[X]E[Y ] − E[X]E[Y ] + E[X]E[Y ] = E[XY ] − E[X]E[Y ]
Proof of independence =⇒ cov(X, Y ) = 0: Let Sxy = {s ∈ S : X(s) = x, Y (s) = y}
E[XY ] =
X
X(s)Y (s)p(s) =
s
=
X X
X
xyp(s) =
X
X
xy
x,y
X
xyP (X = x, Y = y) =
i
2
E (X1 + . . . + Xn )
=
xP (X = x)

n
X

E
X2
(E[X1 + . . . + Xn ])
X
i
+2
n
X
h
i
n
X
yP (Y = y) = E[X]E[Y ]

Xi Xj 
i<j
n
X
E Xi2 + 2
i=1
2
xyP (X = x)P (Y = y) by independence
y
i=1
=
X
x,y
x
h
p(s) by distributive law
s∈Sxy
x,y
by distributive law =
X(s)Y (s)p(s) by stepwise summation
x,y s∈Sxy
x,y s∈Sxy
=
X X
=
n
X
E
h
Xi2
i
+2
i=1
n
X
E [Xi Xj ]
i<j
E[Xi ]E[Xj ]
i<j
2
= (E[X1 ] + . . . + E[Xn ]) =
n
X
2
(E[Xi ]) + 2
i=1
h
n
X
E[Xi ]E[Xj ]
i<j
i
var(X1 + . . . + Xn ) = E (X1 + . . . + Xn )2 − (E[X1 + . . . + Xn ])2
=
=
n X
i=1
n
X
i=1
E[Xi2 ] − (E[Xi ])2 + 2
n
X
(E [Xi Xj ] − E[Xi ]E[Xj ])
i<j
var(Xi ) + 2
n
X
cov(Xi , Xj ) =
i<j
n
X
var(Xi )
i=1
where the last = holds for pairwise independence of Xi and Xj for i < j.
17
The above rules for mean and variance of Y = X1 + . . . + Xn are now illustrated for two situations.
Let X1 , . . . , Xn be indicator RVs indicating a success or failure in the ith of n trials. In the first
situation we assume these trials are independent and have success probability p each. Then, as
observed previously, from the mean and variance results for Bernoulli RVs we get
E[Y ] = E(X1 +. . .+Xn ) =
n
X
E[Xi ] = np and var(Y ) = var(X1 +. . .+Xn ) =
i=1
n
X
var(Xi ) = np(1−p)
i=1
In the second situation we view the trials in the hypergeometric context, where Xi = 1 when the
ith ball drawn is white and Xi = 0 otherwise. We argued previously that P (Xi = 1) = M/N = p =
proportion of white balls in the population
=⇒
n
X
E[Y ] = E(X1 + . . . + Xn ) =
nM
N
E[Xi ] =
i=1
For var(Y ) we need to involve the covariance terms in our formula for var(X1 + . . . + Xn ). We find
M (M − 1)(N − 2) . . . (N − n + 1)
M (M − 1)
=
N (N − 1)(N − 2) . . . (N − n + 1)
N (N − 1)
M N −M
p(1 − p)
M (M − 1) M M
−
=−
=−
cov(Xi , Xj ) = E[Xi Xj ] − E[Xi ]E[Xj ] =
N (N − 1)
N N
N N (N − 1)
N −1
E[Xi Xj ] = P (Xi = 1, Xj = 1) =
var(Y ) = var(X1 + . . . + Xn ) =
n
X
var(Xi ) + 2
i=1
n
X
cov(Xi , Xj )
i<j
!
n p(1 − p)
n−1
N −n
= np(1 − p) − 2
= np(1 − p) 1 −
= np(1 − p)
2 N −1
N −1
N −1
The factor 1 − (n − 1)/(N − 1) is called the finite population correction factor. For fixed n it
gets close to 1 when N is large, in which case it does not matter much whether we draw with or
without replacement. One easily shows (exercise, or see Text p. 162) that
!
n k
P (Hyper(n, M, N ) = k) −→ P (Bin(n, p) = k) =
p (1−p)n−k
k
as N −→ ∞, where p = M/N
We will now pull forward material from Ch. 8, namely the inequalities of Markov and Chebychev6 .
Markov’s Inequality: Let X be a nonnegative discrete RV with finite expectation E[X] then for
any a > 0 we have
E[X]
P (X ≥ a) ≤
a
Proof:
E[X] =
X
x≥a
xpX (x)+
X
x<a
xpX (x) ≥
X
xpX (x) ≥
x≥a
X
apX (x) = aP (X ≥ a)
x≥a
Markov’s inequality is only meaningful for a > E[X]. It limits the probability far beyond the mean
or expectation of X, in concordance with our previous center of gravity interpretation of E[X].
6
Scholz ← Lehmann ← Neyman ← Sierpinsky ← Voronoy ← Markov ← Chebychev
18
While this inequality is usually quite crude, it can be sharp, i.e., result in equality. Namely, let X
take the two values 0 and a with probability 1 − p and p. Then p = P (X ≥ a) = E[X]/a.
Chebychev’s Inequality: Let X be a discrete RV with finite variance E[(X − µ)2 ] = σ 2 , then
for any k > 0 we have
σ2
P (|X − µ| ≥ k) ≤ 2
k
2
Proof by Markov’s inequality using Y = (X − µ) as our nonnegative RV
P (|X − µ| ≥ k) = P ((X − µ)2 ≥ k 2 ) ≤
E[(X − µ)2 ]
σ2
=
k2
k2
These inequalities hold also for RVs that are not discrete, but why wait that long for the following.
We will now combine the above results into a theorem that proves the long run frequency notion
that we have alluded to repeatedly, in particular when introducing the axioms of probability.
Let X̄ = (X1 + . . . + Xn )/n be the average of n independent and identically distributed random
variables (telegraphically expressed as iid RVs), each with same mean µ = E[Xi ] and variance
σ 2 = var(Xi ). Such random variables can be the result of repeatedly observing a random variable
X in independent repetitions of the same random experiment, like repeatedly tossing a coin or
rolling a die, and denoting the resulting RVs by X1 , . . . , Xn .
Using the rules of expectation and variance (under independence) we have
E[X̄] =
1
1
E[X1 + . . . + Xn ] = (µ + . . . + µ) = µ
n
n
1
1 2
σ2
2
var(X
+
.
.
.
+
X
)
=
(σ
+
.
.
.
+
σ
)
=
1
n
n2
n2
n
and by Chebychev’s inequality applied to X̄ we get for any > 0
var(X̄) =
P (|X̄ − µ| ≥ ) ≤
σ2 1
−→ 0 as n −→ ∞,
n 2
i.e., the probability that X̄ will differ from µ by at least > 0 becomes vanishingly small.
P
We say X̄ converges to µ in probability and write X̄ −→ µ as n −→ ∞.
This result is called the weak law of large numbers (WLLN or LLN without emphasis on weak).
When our random experiment consists of observing whether a certain event E occurs or not, we
observe an indicator variable X = IE with values 1 and 0. If we repeatedly do this experiment
(independently), we observe X1 , . . . , Xn , each with mean µ = p = P (E) and variance σ 2 = p(1−p).
In that case X̄ is the proportion of 1’s among the X1 , . . . , Xn , i.e., the proportion of times we observe
the event E.
P
The above law of large numbers gives us X̄ −→ µ = p = P (E) as n −→ ∞, i.e., in the long run
the observed proportion or relative frequency of observing the event E converges to P (E).
19
Properties of Distribution Functions F :
1. F is nondecreasing, i.e. F (a) ≤ F (b) for a, b with a ≤ b.
2. limb→∞ F (b) = 1
3. limb→−∞ F (b) = 0
4. F is right continuous, i.e. if bn ↓ b then F (bn ) ↓ F (b) or limn→∞ F (bn ) = F (b).
Proof:
1. For a ≤ b we have {e : X(e) ≤ a} ⊂ {e : X(e) ≤ b}.
2., 3. and 4. ⇐= P (limn→∞ En ) = limn→∞ P (En ) for properly chosen monotone sequences En .
T
E.g., if bn ↓ b then En = {e : X(e) ≤ bn } & E = {e : X(e) ≤ b} = ∞
n=1 En
All probability questions about X can be answered in terms of the cdf F of X. For example,
• P (a < X ≤ b) = F (b) − F (a) for all a ≤ b
• P (X < b) = limn→∞ F (b − n1 ) =: F (b−)
• F (b) = P (X ≤ b) = P (X < b) + P (X = b) = F (b−) + (F (b) − F (b−))
20