Download Moments and Discrete Random Variables 1 Random Variables

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Statistics wikipedia , lookup

Probability wikipedia , lookup

Randomness wikipedia , lookup

Transcript
Moments and Discrete Random Variables
OPRE 7310 Lecture Notes by Metin Çakanyıldırım
Compiled at 13:03 on Thursday 13th October, 2016
1 Random Variables
In a probability space (Ω, F , P), the element ω ∈ Ω is not necessarily a numerical value. In the experiment of
tossing a coin once, ω is either H or T. It is useful to convert outcomes of an experiment to numerical values.
Then these numerical values can also be used in usual algebraic operations off additions, subtractions, etc.
To convert each ω to a numerical value (a real number), we use a function X (ω ) : Ω → ℜ. X (ω ) is a function
that maps the sample space to real numbers.
Example: Let Ωn = { H, T } be the sample space of for the nth toss of a coin. We can define a random variable
as the indicator function Xn (ωn ) = 1Iωn = H . Then we can count the number of heads in the N experiments
as ∑nN=1 Xn (ωn ) or we can consider an average (1/N ) ∑nN=1 Xn (ωn ). We can consider the difference between
the number of heads obtained in even numbered experiments and odd numbered experiments in the first
2N experiments: ∑nN=1 X2n (ω2n ) − ∑nN=1 X2n−1 (ω2n−1 ). ⋄
Example: A truck will arrive in the next 1 hour at a distribution center, the arrival time ω when measured in minutes belongs to Ω = [0, 60]. In this case, the outcomes are real numbers and we can use
them as random variables: X (ω ) = ω. We can consider another truck to arrive in the same time window with arrival time Y ∈ [0, 60]. We can consider the earliness of the second truck with respect to the first:
( X − Y )+ = max{ X − Y, 0}. The earliness then becomes another random variable. ⋄
As illustrated above a function of random variable(s) is another random variable. Since random variables
are real numbers, we can average them to obtain means and variances.
Example: A truck arrives at a distribution center over 7 days and each day it arrives over an one-hour time
window [0, 1]. Then Xn (ωn ) denotes the arrival time on the nth day for n = 1 . . . 7. The average arrival time
is ∑7n=1 Xn /7. ⋄
A random variable can be conditioned on another random variable or on an event. When such a conditioning happens, the convention is to talk about the functions characterizing the conditioned random variable rather than writing X |Y or X | A for random variables X, Y and event A. One of the such functions is
cumulative distribution function (cdf) FX for random variable X. The cdf maps the real interval (−∞, a] to the
real interval [0, 1] and is defined through the probability measure:
FX ( a) := P( X (ω ) ∈ (−∞, a]).
Since (−∞, a] ⊆ (−∞, b] for a ≤ b, we have FX ( a) ≤ FX (b), so FX is a non-decreasing function. Recalling also
P(Ω) = 1 and P(∅) = 0, we obtain three properties of the cdf:
i) FX is non-decreasing.
ii) lima→∞ FX ( a) = 1.
ii) lima→−∞ FX ( a) = 0.
Example: Consider the experiment of tossing a fair coin twice with the sample space Ω = { HT, TH, HH, TT }.
1
Let X (ω ) be the number of heads in two tosses, so that X (ω ) ∈ {0, 1, 2}. For the cdf FX we have


0/4 if a < 0






1/4 if 0 ≤ a < 1
FX ( a) =
.⋄
3/4 if 1 ≤ a < 2 





4/4 if 2 ≤ a
We can consider the right-hand side limits of FX ( a) and use the continuity of probability measure to
obtain
lim FX ( x ) = lim P( X (ω ) ∈ (−∞, x ]) = lim P((−∞, x ]) = P(∩ x→a+ (−∞, x ]) = P( X (ω ) ∈ (−∞, a]) = FX ( a).
x → a+
x → a+
x → a+
Note that (−∞, x ] as x → a+ is a decreasing sequence of closed sets, their limit is intersection of these sets and
intersection of infinitely many closed sets is closed. In particular, a ∈ ∩∞
n=1 (− ∞, a + 1/n ] = ∩ x → a+ (− ∞, x ].
Hence, cdfs are right continuous for all random variables. For the left-hand side limits, we obtain
lim FX ( x ) = lim P( X (ω ) ∈ (−∞, x ]) = lim P((−∞, x ]) = P(∪ x→a− (−∞, x ]) = P( X (ω ) ∈ (−∞, a)) ≤ FX ( a).
x → a−
x → a−
x → a−
Note that (−∞, x ] as x → a− is an increasing sequence of closed sets, their limit is union of these sets and
union of infinitely many closed sets is not necessarily closed. In particular, a ∈
/ (−∞, a − 1/n] for any n and
−
so a ∈
/ ∪∞
(−
∞,
a
−
1/n
]
=
∩
(−
∞,
x
]
.
So
cdfs
are
not
necessarily
left-continuous,
as illustrated by the
x→a
n =1
FX in the last example.
Example: 4 people including you and a friend line up at random. Let X be the number of people between
you and your friend and find FX . X = 0 means that you are sitting together. Putting two of you together as
a pair, we have 2 people and a pair to order in 3! ways. By reordering you and your friend within the pair
we obtain 2 × 3! ways. The total number of ways of ordering 4 people is 4!, so P( X = 0) = 2 × 3!/4! = 1/2.
X = 1 means that there is exactly one person between you two. If either of you is at the beginning of the line,
we can depict the permutation as ◦ △⋆, where ◦ and ⋆ are the other people. By exchanging the positions of
with △ and ◦ with ⋆, we can create 4 permutations: ◦ △⋆, △ ◦ ⋆, ⋆ △◦, △ ⋆ ◦. Other 4 permutations
can be created by placing either of you at the second place in the line: ⋆ ◦ △, ⋆△ ◦ , ◦ ⋆ △, ◦△ ⋆ . The
total number of permutations with X = 1 is 8, so P( X = 1) = 8/4! = 1/3. X = 2 means exactly two
people between you and all four such permutations are ◦ ⋆△, △ ◦ ⋆, ⋆ ◦△, △ ⋆ ◦. The total number
of permutations with X = 2 is 4, so P( X = 2) = 4/4! = 1/6.


0 if a < 0






3/6 if 0 ≤ a < 1
.⋄
FX ( a) =
5/6 if 1 ≤ a < 2 





6/6 if 2 ≤ a
Example: Suppose a fair coin is tossed 4 times, we examine the head runs. A head run is consecutive
appearance of heads. Let X be the length of the longest head run and find P( X = x ) for 0 ≤ x ≤ 4. The sample space is Ω = { HHHH, HHHT, HHTH, HTHH, THHH, HHTT, HTHT, THHT, HTTH, THTH, TTHH,
HTTT, THTT, TTHT, TTTH, TTTT }. X ( HHHH ) = 4, X ( HHHT ) = 3, X ( THHH ) = 3, X ( HHTH ) = 2,
X ( HTHH ) = 2, X ( HHTT ) = 2, X ( THHT ) = 2, X ( TTHH ) = 2, X ( TTTT ) = 0 and all the remaining 7
outcomes yield X = 1. So P( X = 0) = 1/16, P( X = 1) = 7/16, P( X = 2) = 5/16, P( X = 3) = 2/16,
P( X = 4) = 1/16. ⋄
2
2 Discrete Random Variables
Most of the examples above had countable sample spaces. A discrete random variable has a countable sample
space. As the sample space is countable, we can let it be Ω = { x1 , xs , x3 , . . . }. Probability measure is nonzero
only at the elements of Ω and zero elsewhere. At the elements of Ω, we can define nonzero probability mass
function (pmf) p X as p( a) = P( X = a). Then we have
FX ( a) =
∑
xi ≤ a
p X ( x i ).
In real-life averages are used to summarize empirical observations: average fuel consumption of my car
is 32 miles per gallon; average August temperature in Dallas is 88o Fahrenheit. On the other hand, expected
values summarize theoretical probability distributions. The expectation of a discrete random variable X with
pmf is p X is given by
E( X ) =
∑ xpX (x).
x:p X ( x )>0
Expectation is a weighted average where the weights are p( x ).
Note that a function of a random variable is another random variable, so we can consider the expected
value of this function. That is, let us define new random variable Y = g( X ) starting with random variable
X. Then
E (Y ) =
∑ g ( x ) p ( x ).
x:p X ( x )>0
For a given random variable X, the function can in particular be g( x ) = ( x − E( X ))2 and can define a new
random variable Y = g(Y ) = ( X − E( X ))2 . Expected value of this new random variable is the variance of X:
V( X ) =
∑
( x − E( X ))2 p( x ).
x:p X ( x )>0
Variance of a random variable is the expected value of square of deviations from its mean. Another particular
g function is g( x ) = x k for an integer k, it helps to define the kth moment of the random variable X
E( X k ) =
∑
x k p ( x ).
x:p X ( x )>0
Example: Consider the example of 4 people lining up and X – the number of people between you and your
friend. The pmf of X is p(0) = 3/6, p(1) = 2/6 and p(2) = 1/6. Then the expected value of the number of
people between you is
E( X ) = 0(3/6) + 1(2/6) + 2(1/6) = 2/3.
The variance of the number of people between you is
V( X ) = (0 − 2/3)2 (3/6) + (1 − 2/3)2 (2/6) + (2 − 2/3)2 (1/6) = (4/9)(3/6) + (1/9)(2/6) + (16/9)(1/6)
= 30/54 = 5/9. ⋄
Example: Consider the example of 4 people lining up and X – the number of people between you and your
friend. Suppose you want to pass a glass of water to your friend. But 1/4 of water spills as glass moves from
one person to another. What is the expected percentage of water remaining in the glass when your friend
receives it? Let g( x ) be percentage of water left in the glass after x + 1 passes. Note that x people between
you causes x + 1 passes. Then g(0) = 3/4, g(1) = (3/4)2 , g(2) = (3/4)3 . So
E( g( X )) = (3/4)(3/6) + (3/4)2 (2/6) + (3/4)3 (1/6) = 81/128 = 0.6328.
3
63.28% of the water remains on average when you friend receives the glass. This example can be extended
to model the amount of information loss over a telecommunication network as the information is sent from
one node to another. ⋄
Expectation and variances have some properties that are straightforward to show:
i) Linearity of expectation: For constant c, E(cg( X )) = cE( g( X )). For functions g1 , g2 , . . . , g N , E( g1 ( X ) +
g2 ( X ) + · · · + g N ( X )) = E( g1 ( X )) + E( g2 ( X )) + · · · + E( g N ( X )).
ii) For constant c, E(c) = c and V(c) = 0.
iii) Nonlinearity of variance: For constant c, V(cg( X )) = c2 V( g( X )).
iv) V( X ) = E( X 2 ) − (E( X ))2 .
3 Independent Discrete Random Variables
Let X and Y be two random variables and consider the two events A := [ X ≤ a] and B := [Y ≤ b] for real
numbers a, b and A ∈ Ω A , B ∈ Ω B . Suppose that these two events are independent so we have the middle
equality below
P( X ≤ a, Y ≤ b) = P( A ∩ B) = P( A)P( B) = P( X ≤ a)P(Y ≤ b).
We extend this equality to all numbers a, b to define the independence of random varibbles. If we have
P( X ≤ a, Y ≤ b) = P( X ≤ a)P(Y ≤ b) for all a, b ∈ ℜ,
random variables X and Y are said to be independent. To indicate independence, we can write X ⊥ Y
Starting from P( X ≤ a, Y ≤ b) = P( X ≤ a)P(Y ≤ b), we can obtain P( X = a, Y = b) for integer valued
random varaibales X, Y in terms of their pmfs:
P( X = a, Y = b) = P( X ≤ a, Y ≤ b) − P( X ≤ a, Y ≤ b − 1) − (P( X ≤ a − 1, Y ≤ b) − P( X ≤ a − 1, Y ≤ b − 1))
= FX ( a) FY (b) − FX ( a) FY (b − 1) − ( FX ( a − 1) FY (b) − FX ( a − 1) FY (b − 1))
= FX ( a) pY (b) − FX ( a − 1) pY (b)
= p X ( a ) pY ( b ) .
This can be extended to non-integer valued random variables but a new notation is neded to express the
largest value of X smaller than a. If you adapt some notation for this purpose, above steps can be repeated
to obtain
P( X = a, Y = b) = p X ( a) pY (b)
for two independent discrete random varaiables X and Y.
Example: For two independent random varaiables, we check the equality E( X1 X2 ) = E( X1 )E( X2 ):
∑
E ( X1 X 2 ) =
( x 1 x 2 ) P ( X1 = x 1 , X 2 = x 2 ) =
x1 ,x2 :P( X1 = x1 ,X2 = x2 )>0
=
∑
x 1 p X1 ( x 1 )
x1 :p X1 ( x1 )>0
∑
x 1 x 2 p X1 ( x 1 ) p X2 ( x 2 )
x1 ,x2 :p X1 ( x1 ),p X2 ( x2 )>0
∑
x 2 p X2 ( x 2 ) = E ( X 1 ) E ( X 2 ) .
x2 :p X2 ( x2 )>0
The first equality above uses the fact that the probability of the event [ X1 = x1 , X2 = x2 ] is p X1 ( x1 ) p X2 ( x2 ). ⋄
Example: For two independent random varaiables, we check the equality V( X1 + X2 ) = V( X1 ) + V( X2 ):
V( X1 + X2 ) = E( X12 + 2X1 X2 + X22 ) − E( X1 + X2 )E( X1 + X2 )
= E( X12 ) − E( X1 )2 + E( X22 ) − E( X2 )2 + 2E( X1 X2 ) − 2E( X1 )E( X2 )
= V ( X1 ) + V ( X 2 ) . ⋄
4
The equalities established in the last two examples are summarized below:
v) For two independent random varaibles X1 and X2 , E( X1 X2 ) = E( X1 )E( X2 ).
vi) Additivity of variance for two independent random varaibles X1 and X2 , V( X1 + X2 ) = V( X1 ) + V( X2 ).
4 Common Discrete Random Variables
Some commonly used discrete random variables are introduced below.
4.1 Bernoulli Random Variable
An experiment is said to consist of Bernoulli trials if
• outcome of each trial can take two values, which can be interpreted as say success and failure,
• each trial is independent of other trials,
• each trial is identical in terms of probability to other trials, so for all trials P(success)= p = 1 − P(failure).
If an experiment has a single Bernoulli trial, the number of success is called a Bernoulli random variable.
Example: Let us consider the experiment of tossing a coin once. Let “success” ≡ “head”, so P(Success)=
P(Head)=p, where p is the probability that coin yields head. P(Failure)=P(Tail)=1 − p. Let X be the number of
heads on a single toss. E( X ) = 1p + 0(1 − p) = p and V( X ) = E( X 2 ) − p2 = 12 p + 02 (1 − p) − p2 = p(1 − p).
⋄
Example: Let us consider the price of a stock, which can increase or decrease over a day. Let “success” ≡
“price increase”, so P(Success)= P(Price increase) and P(Failure)=P(Price decrease). Suppose that the price
either increases by $1 or decreases by $1. Then the expected price change over a day is
Today’s price + 1P(Price increase) − 1P(Price decrease) − Today’s price
= P(Price increase) − P(Price decrease) = 2P(Price increase) − 1.
Price increases on average if P(Price increase) > 1/2. This probability of price increase can be estimated
statistically for stocks. ⋄
4.2 Binomial Random Variable
Number of successes in n independent Bernoulli trials, each with success probability p, is a Binomial random
variable with parameters ( p, n).
Example: Let us consider the experiment of tossing a coin many times. On each toss, P(Success)= P(Head)=p
and P(Failure)=P(Tail)=1 − p. Let Xi be the number of heads on ith toss. Xi is a Bernoulli random variable.
Consider the total number X of heads in n tosses. Then X is a Binomial random variable and
n
X=
∑ Xi .
⋄
i =1
To denote a Binomial random variable, we write X ∼ B(n, p). To obtain the probability mass function of
X, consider the event [ X = m] of having m success among n events for m ≤ n. If the first m trials are success,
we get a string of
. . S S} |F F .{z
. . F F} .
|S S .{z
m successes n−m failures
5
Probability of this string is pm (1 − p)n−m . If we take a permutation of this string, the probability remains
n . Hence,
pm (1 − p)n−m . The number of permutations that we can make up from m Ss and n − m Fs is Cm
n m
P( X = m) = Cm
p (1 − p)n−m for m = 0, 1, 2, . . . , n.
Example: For Xi Bernoulli random variable with P(Success)=p, we have E( Xi ) = p and V( Xi ) = p(1 − p).
By using X = ∑in=1 Xi , linearity of expectation and additivity of variance for independent random variables,
E( X ) = np and V( X ) = np(1 − p). ⋄
Example: An urn contains 6 white balls and 2 black balls. A ball is drawn and put back into the urn. The
process is repeated 4 times. Let X be the number of black balls drawn out of 4 draws. Find P( X = 2, E( X )
and V( X ). Convince yourself that X ∼ B(4, 2/8), so P( X = 2) = C24 (2/8)2 (1 − 2/8)2 , E( X ) = 4(2/8) and
V( X ) = 4(2/8)(6/8). ⋄
Suppose that B(n j , p) is the number of successes in n j trials. Then ∑ j B(n j , p) is the number of successes
in ∑ j n j Bernoulli trials with success probability p. This in turn leads to ∑ j B(n j , p) ∼ B(∑ j n j , p), where ∼
indicates having the same distribution. For example, B(n1 , p) + B(n2 , p) is another Binomial random variable
with n1 + n2 trials and p success probability, i.e., B(n1 , p) + B(n2 , p) ∼ B(n1 + n2 , p).
4.3 Geometric Random Variable
Number of Bernoulli trials until the first success is a Geometric random variable. When counting the number
of trials, we include the trial that resulted in the first success. For example, the first success in FFFFSFFSFS
string is the fifth trial. Some books do not include the trial in which the success happened in counting the
number of trials until the first success and say that it takes four trials to get a success in FFFFSFFSFS. This
convention in the other books changes the pmf, cdf, mean and variance formulas so one has to be careful
when reading other books. With our convention of counting events, a geometric random variable X takes
values in {1, 2, 3, . . . }.
For the Geometric random variable X to be equal to n, we need n − 1 failures first and a success afterwards:
. . S F} .
. . F F} S S
|F F .{z
| F .{z
n−1 failures
irrelevant
Recalling that p is success probability, the pmf for the geometric random variable is
P( X = n) = (1 − p)n−1 p.
Unlike the binomial pmf, the geometric pmf does not have a combinatorial term because the sequence of
failures and the success cannot be altered.
Example: Consider a rice rolling experiment until 6 comes up. The experiment can potentially have infinite
trials and so a countable but infinite sample space. Let “success”=“6 appears” and “failure”=“1,2,3,4, or 5
appears”. P(Success)=1/6 and P(Failure)=5/6. Then the probability of having “6” for the first time in the nth
trial is P( X = n) = (5/6)n−1 (1/6). ⋄
The cumulative distribution function for the Geometric random variable has a simple form:
FX (n) = P( X ≤ n) =
n
n −1
i =1
i =0
1 − (1 − p ) n
∑ ( 1 − p ) i −1 p = p ∑ ( 1 − p ) i = p 1 − ( 1 − p )
6
= 1 − (1 − p ) n .
Hence, F̄ := 1 − F, the complementary cdf, is also simple:
F̄X (n) = 1 − FX (n) = P( X > n) = (1 − p)n .
Example: A graduate student applies to jobs one by one. Within a few days after his job interview, he is
told that he gets the position or not. Each experiment can be thought as a Bernoulli trial where success is
obtaining a position. The chances of getting a position after an interview is estimated to be 0.2. What is the
probability that the student cannot get a job after the first five interviews? We want P( X > 5) for a Geometric
random variable X with success probability 0.2. So, P( X > 5) = (1 − 0.2)5 . ⋄
If the first m trials result in failures, what should the probability of n trials to result in failures as well for
n ≥ m. In particular, we want to see if the probability that the remaining n − m trials turn out to be failures
depend on the failures of the first m trials. In other words, do the outcome of n − m trials independent of the
first m trials? This can be answered formally by examining the probability of the event [ X > n| X > m].
P( X > n | X > m ) =
P( X > n, X > m)
P( X > n )
(1 − p ) n
=
=
= (1 − p ) n − m .
P( X > m )
P( X > m )
(1 − p ) m
On the other hand, (1 − p)n−m is the probability of failures in n − m events, so (1 − p)n−m = P( X > n − m).
Hence, the outcome of n − m trials do not depend on the knowledge (memory) of the first m trials. We have
the memoryless property for the Geometric random variable.
P( X > n | X > m ) = P( X > n − m ).
Example: What other discrete random variable ranging over {1, 2, . . . } has the memoryless property? The
memoryless property can be written as P( X > n + m) = P( X > m)P( X > n) for nonnegative integers m, n.
In terms of complementary cdf, we have F̄ (m + n) = F̄ (m) F̄ (n). Also F̄ (0) = P( X > 0) = 1. We can set
q = F̄ (1) for 0 ≤ q ≤ 1. Then we recursively obtain F̄ (n) = qn . Interpreting q as the failure probability and
defining p = 1 − q as the success probability, we obtain F (n) = 1 − (1 − p)n , which is the cdf of the Geometric
random variable. As a result, the only discrete random variable ranging over {1, 2, . . . } and satisfying the
memoryless property is the Geometric random variable. ⋄
4.4 Hypergeometric Random Variable
One way to obtain Binomial random variable is to consider an urn with s success balls and f failure balls,
and we pick a ball from the urn record it and return it to the urn. Each ball can be either a success or failure
with the success probability s/(s + f ). The probability of m successes after n repetitions of this experiment
is given by the binomial probability
n
Cm
(s/(s + f ))m ( f /(s + f ))n−m .
In the urn experiment above, each ball is returned back to the urn. What happens if we do not return the balls
back to the urn is that the success and failure probabilities are no longer constant. Suppose we again pick
n balls without replacement and wonder about the probability of m balls being success (and equivalently
n − m balls being failure). For this event, we need m ≤ s and n − m ≤ f (or n ≤ s + f ). The number of ways
s , the number of ways of choosing n − m failure balls
of choosing m success balls out of s success balls is Cm
s+ f
f
out of f failure balls is Cn−m and the number of ways of choosing n balls out of s + f balls is Cn . Let X be
the number of successes in this experiment. Then the probability of m success balls is
f
P( X = m ) =
7
s C
Cm
n−m
s+ f
Cn
.
This probability obtained under no ball replacement is different from the binomial probability obtained under replacement. So the random variable X deserves a new name - the Hypergeometric random variable.
The difference between the Binomial and Hypergeometric random variables is caused by the “replacement” of the balls. Replacement changes the success probability significantly if n is close to s + f . Otherwise,
success probabilities do not change much by taking a ball out of the urn. For example, with s = 300 and
f = 700, the success probability for the first ball is 300/1000 for the second it is either 300/999 or 299/999,
both of which are very close to 300/1000. This inspires us to check the Hypergeometric random variable as
N := s + f → ∞. We also define p = s/N so s = N p and f = N (1 − p). With this reparametrization, we can
show that the Hypergeometric pmf converges to the Binomial pmf as N increases:
Np
N (1− p )
Cm Cn−m
CnN
( N p)!( N (1 − p))!n!( N − n)!
( N p − m)!m!( N (1 − p) − n + m)!(n − m)!N!
n ( N p )( N p − 1) . . . ( N p − m + 1) N (1 − p )( N (1 − p ) − 1) . . . ( N (1 − p ) − n + m + 1)
= Cm
N ( N − 1) . . . ( N − n + 1)
n ( p )( p − 1/N ) . . . ( p − ( m − 1) /N ) (1 − p )((1 − p ) − 1/N ) . . . ((1 − p ) − ( n − m − 1) /N )
= Cm
1(1 − 1/N ) . . . (1 − (n − 1)/N )
n m
n−m
→ Cm p (1 − p)
as N → ∞.
=
Example: A package of light bulbs has 6 bulbs. 4 of these are non-defective and 2 are defective. 3 bulbs are
taken from the package for inspection. Let X be the number of defective bulbs observed during inspection.
Find P( X = m) for m ∈ {0, 1, 2}.
P( X = 0) =
C12 C24
C22 C14
C02 C34
,
P
(
X
=
1
)
=
,
P
(
X
=
1
)
=
.
C36
C36
C36
⋄
Example: The sum of Hypergeometric pmf must yield 1. Let us check this sum.
s
∑
s
P( X = m ) =
m =0
s+ f
∑
m =0
f
s C
Cm
n−m
s+ f
Cn
.
f
s+ f
s C
We claim that Cn = ∑sm=0 Cm
the number of
n−m . To convince yourself of this equality, consider Cn
n-element subsets of a (s + f )-element set. An indirect way to obtain these subsets is to split (s + f )-element
set into two sets with s and f elements. To obtain n elements subsets, we can pick m elements from the selement set and n − m elements from f -element set. The indirect way also gives us all the n-element subsets
s+ f
s Cf
s Cf
and in ∑sm=0 Cm
= ∑sm=0 Cm
n−m ways. So Cn
n−m . ⋄
Example: Find the expected value of the Hypergeometric random variable X.
f
s
E( X ) =
∑
m =0
m
s C
Cm
n−m
s+ f
Cn
∑
m =1
s −1
Cm
−1 Cn−m
f
s
=
s
s+ f
Cn
s −1
sn Cm−1 Cn−m
sn
∑ s + f s + f −1 = s + f
Cn−1
m =1
f
s
=
8
s −1
∑
m =0
s −1 C
Cm
n − m −1
f
s + f −1
Cn−1
=
sn
.
s+ f
4.5 Poisson Random Variable
Starting with a Binomial random variable B(n, p), what happens to B(n, p) as n → ∞ and p → 0 while np
remains constant. These limits can be considered by setting λ = np and then n → ∞ implies p = λ/n → 0.
lim P( B(n, λ/n) = m) =
n→∞
=
n
lim Cm
(λ/n)m (1 − λ/n)n−m
n→∞
λm
n ( n − 1) . . . ( n − m + 1)
lim (1 − λ/n)n (1 − λ/n)−m
{z
}|
{z
}|
m! n→∞ |
n{zm
}
→exp(−λ)
= exp(−λ)
λm
m!
→1
→1
for m = 0, 1, 2, . . .
The limiting random variable has the pmf p(m) = exp(−λ)λm /m! and is called the Poisson distribution and
is denoted as Po (λ).
Example: Classical FM plays about 1 Led Zepplin song every 10 hours. Suppose that the number of
LZ songs played every 10 hour has a Po (1) distribution. P(No LZ song played in the last 10 hours) =
P( Po (1) = 0) = exp(−1)10 /0!=exp(−1) = 0.3679. P(1 LZ song played in the last 10 hours) = P( Po (1) =
1) = exp(−1)11 /1!=exp(−1) = 0.3679. P(2 or more LZ songs played in the last 10 hours) = P( Po (1) ≥ 2) =
1 − P( Po (1) ≤ 1) = 1 − exp(−1) − exp(−1) = 0.2642. ⋄
Expected value of Po (λ) is λ:
E( Po (λ)) =
∞
∞
∞
i =0
i =1
j =0
∑ i exp(−λ)λi /i! = λ ∑ exp(−λ)λi−1 /(i − 1)! = λ ∑ exp(−λ)λ j /j! = λ.
Example: A customer enters a store in 1 second with probability p = 0.01; no customer enters the store in 1
second with probability 0.99. Suppose that we consider n = 100 seconds. What is the probability of having
exactly 1 customer in 100 seconds? This probability is P( B(100, 0.01) = 1) = C1 001 (0.01)1 (0.99)99 = 0.9999 =
0.3697. We can also use B(100, 0.01) ≈ Po (1) and have P( Po (1) = 1) = exp(−1) = 0.3679. ⋄
Poisson random variable Po (λ) can be interpreted as the number of arrivals in a time interval. Suppose
that Po (λ1 ) and Po (λ2 ) is the number of type 1 and type 2 customers arriving independent of each other in
an interval. The total number of customers arriving in the interval is Po (λ1 ) + Po (λ2 ). This is the limiting
distribution of B(n1 , p) + B(n2 , p) for λ1 = n1 p and λ2 = n2 p. Since B(n1 , p) + B(n2 , p) ∼ B(n1 + n2 , p) →
Po (λ1 + λ2 ), we obtain Po (λ1 ) + Po (λ2 ) ∼ Po (λ1 + λ2 ). That is, the sum of Poisson random variables is
another Poisson random variable. Formally, for an integer m
P( Bin(n1 , p) + Bin(n2 , p) = m)
= P( Bin(n1 + n2 , p) = m),
P( Bin(n1 , p) + Bin(n2 , p) = m) → P( Po (λ1 ) + Po (λ2 ) = m),
P( Bin(n1 + n2 , p) = m) → P( Po (λ1 + λ2 ) = m).
If two sequences are equal and converge, they must converge to the same number: P( Po (λ1 ) + Po (λ2 ) =
m) = P( Po (λ1 + λ2 ) = m). Note that the convergence used here is the convergence of sequences made up of
real numbers.
5 Solved Examples
1. A broadcasting company runs 4 different commercials for a car company and 96 commercials for other
companies. The total of 100 commercials are randomly shown on TV one by one.
9
a) What is the expected number of commercials to see until (including) the first commercial of the car company?
ANSWER Probability of a car commercial is 4/100 so it takes 1/(100/4)=25 commercials until a car commercial is seen.
b) What is the expected number of commercials to see until (including) all of commercials of the car company
are seen?
ANSWER Probability of a car commercial is 4/100 so it takes 1/(100/4)=25 commercials until a car commercial is seen. Then on, the probability of seeing a distinct car commercial is 3/100 so it takes 1/(100/3)=100/3
commercials until the second car commercial is seen. Then on, the probability of seeing a distinct car commercial is 2/100 so it takes 1/(100/2)=100/2 commercials until the third car commercial is seen. Then on,
the probability of seeing a distinct car commercial is 1/100 so it takes 1/(100/1)=100 commercials until the
fourth car commercial is seen. In total it takes an expected number of 100/4+100/3+100/2+100/1 commercials to see all car commercials.
10
6 Exercises
1. 5 people including you and a friend line up at random. Let X be the number of people between you
and your friend and find FX .
2. N people including you and a friend line up at random. Let X be the number of people between you
and your friend and find P( X = 0) and P( X = N ).
3. Suppose a fair coin is tossed 5 times. Let X be the length of the longest head run and find P( X = x ) for
0 ≤ x ≤ 5.
4. In the case of Geometric random variable, we have Bernoulli trials with success probability p and we
are interested in the number of trial in which the first success happens. Let X (i, p) be the number of
Bernoulli trial in which the ith success happens in a series of Bernoulli trials with success probability
of p. Clearly, X (1, p) is the Geometric random variable. Find the pmf of X (i, p).
5. Let X (i, p) be the number of Bernoulli trial in which the ith success happens in a series of Bernoulli trials
with success probability of p. When X (1, p) is the Geometric random variable, Y (1, p) := X (1, p) − 1 is
the number of failures until the first success. Find the pmf of Y (1, p) and F̄Y (1,p) (n). Does Y (1, p) have
the memoryless property.
ANSWER P(Y (1, p) = m) = p(1 − p)m . F̄Y (1,p) (n) = P(Y (1, p) > n) = P( X (1, p) > n + 1) = (1 − p)n+1 .
P(Y (1, p) > n)/P(Y (1, p) > m) = (1 − p)n−m ̸= (1 − p)n−m+1 = P(Y (1, p) > n − m) so Y (1, p) does
not have the memoryless property.
6. Consider a series of Bernoulli trials with success probability of p and let X (i, p) be the number of
Bernoulli trial in which the ith success happens. Recall that B(n, p) is the number of successes in n
trials. For an integer a, we have
P( X (i, p) > n) = P( B(n, p) < i + a).
What should a be to make this equality valid? After finding this a, check that P( X (i, p) > n) =
P( B(n, p) < i + a) holds for i > n.
7. Let X (i j , p) be the number of Bernoulli trial in which the i j th success happens in a series of Bernoulli
trials with success probability of p. Suppose that X (i j , p)s are independent. Establish that
∑ X ( i j , p ) ∼ X ( ∑ i j , p ),
j
j
where ∼ indicates having the same distribution.
8. For a random variable X parametrized by (r, p) we have the pmf
P( X (r, p) = k) = Ckk+r−1 pk (1 − p)r for k = 0, 1, 2, . . . .
Show that the mean of this distribution is rp/(1 − p).
9. For a random variable X parametrized by (r, p) we have the pmf
P( X (r, p) = k) =
Γ(k + r ) k
p (1 − p )r .
k! Γ(r )
What happens to this pmf as r → ∞ and p → 0 while the mean rp/(1 − p) remains constant? Identify
the limiting pmf and determine which common random variable it corresponds to?
11
ANSWER Let λ := rp/(1 − p) so that p = λ/(r + λ).
P( X (r, p) = k) =
Γ(k + r ) k
p (1 − p)r = [λk /k!][(1 + λ/r )−r ][Γ(k + r )/(Γ(r )(r + λ)k )].
k!Γ(r )
The first term is constant in r, the second term converges to e−λ as r → ∞ and the third term converges
to 1 as r → ∞. Note that the third term can be written as Πik=1 (k/r + 1 − i/r )/(1 + λ/r ). Hence,
P( X (r, p) = k) → [λk /k!][e−λ ]. This is Poisson distribution with parameter λ = rp/(1 − p).
10. Let X (i, p) be the number of Bernoulli trial in which the ith success happens in a series of Bernoulli
trials with success probability of p. Let Y (i, p) be the number of Bernoulli trials resulting in failure
until the ith success happens. What is the relation between X (i, p) and Y (i, p). Find the pmf of Y (i, p).
11. Show that the variance of the Hypergeometric random variable is (ns/(s + f ))(1 − s/(s + f ))((s + f −
n)/(s + f − 1)).
12. Use the parametrization N = s + f , s = N p and f = N (1 − p) in the Hypergeometric variance (ns/(s +
f ))(1 − s/(s + f ))((s + f − n)/(s + f − 1)) and simplify this variance as N → ∞. Relate this limiting
variance to other variances that you may know.
13. For fixed parameters n ≥ 1 and 0 < q < 1, the pmf of a random variable X whose range is {n, n +
1, n + 2, . . . } satisfies p(n) = (1 − q)n and
p(m)
m−1
=
q for m > n.
p ( m − 1)
m−n
Find the expected value of X. You may use ∑i∞=0 Cin+i qi = (1 − q)−n−1 .
m −1
ANSWER We can check that p(m) = Cnm−−11 (1 − q)n qm−n for m ≥ n. The mean is ∑∞
m=n mCn−1 (1 −
m
n m − n = n ∞ C n +i (1 − q ) n q i = n (1 − q ) n ∞ C n +i q i = n (1 − q ) n (1 −
q )n qm−n = ∑∞
∑ i =0 n
∑ i =0 n
m=n nCn (1 − q ) q
−
n
−
1
q)
= n/(1 − q). You may realize that X (n, q) is the number of Bernoulli trial in which the nth
success happens in a series of Bernoulli trials with success probability of 1 − q.
14. In a social experiment to measure the honesty of several cities, Reader’s Digest reporters 1 intentionally
dropped 12 wallets one by one in a central area in each of the following cities and counted the number
of wallets returned to the owners by the people who found the wallets.
City
Index
# returned
City
Index
# returned
Lisbon
1
1
Ljubljana
9
6
Madrid
2
2
Berlin
10
6
Prague
3
3
Amsterdam
11
7
Zurich
4
4
Moscow
12
7
Rio
5
4
New York
13
8
Bucharest
6
4
Budapest
14
8
Warsaw
7
5
Mumbai
15
9
London
8
5
Helsinki
16
11
a) Let Xi,j = 1 if the jth dropped vallet is returned in city i; otherwise, it is zero. What are the range of
values for i and j for the Reader’s Digest experiment? For distint integers of i, j, k from the appropriate
ranges, consider the pair of random variables ( Xi,j , Xi,j+1 ) ( Xi,j , Xi,j+2 ), ( Xi,j , Xi+1,j ) and ( Xi,j , Xi+2,j )
Which of these pairs include random variables that are more likely to be independent within the pair?
b) Consider the random variables Yj = ∑i Xi,j and Zi = ∑ j Xi,j , where sums are taken over all possible
index values of i and j. Express these sums in English.
1 Most
Honest Cities: The Readers Digest Lost Wallet Test at http://www.rd.com/culture/most-honest-cities-lost-wallet-test/
12
c) If possible, list the assumptions needed to conclude that Y3 is a Binomial random variable and specify
the parameters of the associated Binomial distribution. Otherwise, explain why this is not possible.
d) If possible, list the assumptions needed to conclude that Z5 is a Binomial random variable and
specify the parameters of the associated Binomial distribution. Otherwise, explain why this is not
possible.
e) If possible, list the assumptions needed to conclude that Y3 + Z5 is a Binomial random variable
and specify the parameters of the associated Binomial distribution. Otherwise, explain why this is not
possible.
15. 5 people take an elevator from the bottom floor of a building to the higher floors: first, second, third
and fourth floor. Each person on the elevator is independently going to one of the floors with equal
probability. Let i = 0 for the bottom floor and let Si be 1 if the elevator stops on floor i = 1, 2, 3, 4;
otherwise it is 0.
a) What is P(Elevator does not stop on floor i ) for i = 1, 2, 3, 4? What is the distribution of Si ?
ANSWER The elevator does not stop on floor i, if all 5 people go to a different floor than i. This
happens with probability P(Elevator does not stop on floor i ) = (3/4)5 . Si has binomial distribution
with parameters 1 and 1 − (3/4)5 .
b) When the elevator arrives at the first floor, two people get off at the first floor and 3 remain. Subject
to this information, find the range of S2 + S3 + S4 and P(S2 + S3 + S4 = k) for appropriate values of k.
Is the distribution of S2 + S3 + S4 in binomial or not? Do you expect it to be binomial, why? Hint: Part
c) is shorter so you may solve that first.
ANSWER S2 + S3 + S4 = k ∈ {1, 2, 3}. The number of ways of distributing 3 people to 3 floors is the
number of non-negative solutions to x2 + x3 + x4 = 3. There are C33+3−1 = 10 solutions. Let Xi be the
number pf people getting off on floor i. P(S2 + S3 + S4 = 1) = P( X2 = 3, X3 = 0, X4 = 0) + P( X2 =
0, X3 = 3, X4 = 0) + P( X2 = 0, X3 = 0, X4 = 3) = 3 ∗ (3!/(3!0!0!))(1/3)3 = 1/9. P(S2 + S3 + S4 = 2) =
P( X2 = 2, X3 = 1, X4 = 0) + P( X2 = 1, X3 = 2, X4 = 0) +P( X2 = 2, X3 = 0, X4 = 1) + P( X2 = 1, X3 =
0, X4 = 2) +P( X2 = 0, X3 = 2, X4 = 1) + P( X2 = 0, X3 = 1, X4 = 2) = 6 ∗ (3!/(2!1!0!))(1/3)3 = 6/9.
P(S2 + S3 + S4 = 3) = P( X2 = 1, X3 = 1, X4 = 1) = (3!/(1!1!0!))(1/3)3 = 2/9. The distribution is
not binomial. We do not expect it to be binomial because S2 , S3 , S4 are not independent. For example
S2 + S3 + S4 ≥ 1 breaks the independence.
c) When the elevator arrives at the second floor, one person gets off at the second floor and 2 remain.
Subject to this information, find the range of S3 + S4 and P(S3 + S4 = k) for appropriate values of k.
ANSWER S3 + S4 = k ∈ {1, 2}. The number of ways of distributing 2 people to 2 floors is the number
of non-negative solutions to x3 + x4 = 2. There are C22+2−1 = 3 solutions. P(S3 + S4 = 1) = P( X3 =
0, X4 = 2) + P( X3 = 2, X4 = 0) = 2 ∗ (2!/(2!0!))(1/2)2 = 1/2. P(S3 + S4 = 2) = P( X2 = 1, X3 = 1) =
1 ∗ (2!/(1!1!))(1/2)2 = 1/2.
d) When the elevator arrives at the second floor, one person gets off at the second floor and 2 remain.
But these two are talking about a joint project and are going to the same floor. Subject to this information, find the range of S3 + S4 and P(S3 + S4 = k) for appropriate values of k.
ANSWER S3 + S4 = k ∈ {1}, so P(S3 + S4 = 1) = 1.
16. An urn contains 100 marbles, one of which is defective.
a) A marble is picked from the urn. If it is the defective one, we stop. Otherwise, we return the picked
marble to the urn and we pick another marble. We continue to pick marbles from the urn until the
defective is found. Let M be the number of picks, find the range of M and P( M = k ) for appropriate
values of k.
13
b) A marble is picked from the urn. If it is the defective one, we stop. Otherwise, we put the picked
marble aside (outside the urn) and we pick another marble. We continue to pick marbles from the
urn until the defective is found. Let N be the number of picks, find the range of N and P( N = k) for
appropriate values of k.
17. An urn contains 100 marbles, two of which are defective.
a) Two marbles are picked at once from the urn. If these are the defective ones, we stop. Otherwise, we
return the picked marbles to the urn and we pick two other marbles. We continue to pick two marbles
from the urn until the defective is found. Let M be the number of picks of two marbles, find the range
of M and P( M = k) for appropriate values of k.
b) A marble is picked from the urn. Then another marble is picked. If these are the defective ones,
we stop by recording 2 picks. Otherwise, 1 or 2 of the picked marbles is/are non-defective. We put
the non-defective marbles aside (outside the urn). If 1 marble put aside is non-defective, we pick 1
marble from the urn to have two marbles in hand. If 2 marbles put aside are non-defective, we pick 2
marbles at once from the urn to have two marbles in hand. If both marbles in hand are defective, we
stop. Otherwise, we continue to pick marbles from the urn until the defectives are found. Let N be the
number of picks, find the range of N and P( N = k ) for appropriate values of k.
14
7 Appendix: Negatives in Factorials and in Binomial Coefficients
For nonnegative integer n, n! = n(n − 1)(n − 2) . . . 1 and 0! = 1. For positive non-integers as well as negative
real numbers, this definition does not help. Instead, the factorial is defined through the Gamma Γ function:
∫ ∞
n! = Γ(n + 1) =
0
x n e− x dx.
∫∞
Note that 0! = Γ(1) = 0 e− x dx = 1. Also, integration by parts with u( x ) = x n (or du = nx n−1 ) and
v( x ) = −e− x (or dv = e− x ) confirms
Γ ( n + 1) =
∫ ∞
∫ ∞
x n e− x dx = u(∞)v(∞) − u(0)v(0) −
0
nx n−1 (−e− x )dx = n
0
∫ ∞
0
x n−1 e− x dx = nΓ(n).
Because of Γ(n + 1) = nΓ(n), knowing Γ over (0,1) is sufficient to find Γ for any positive real number. For
example, Γ(7/2) = (√
5/2)(3/2)(1/2)Γ(1/2).
We can use u := x to obtain
Γ(1/2) =
∫ ∞
0
Then
Γ(1/2) = 4
2
e
∫ ∞
0
− x −1/2
x
e
− u2
du
∫ ∞
dx =
∫ ∞
0
e
0
− v2
e
−u2 2u
u
dv = 4
du = 2
∫ ∞∫ ∞
0
0
∫ ∞
0
e−(u
e−u du.
2
2 + v2 )
dudv.
Using polar coordinates,
Γ(1/2) = 4
2
√
∫ ∞ ∫ π/2
0
0
e
−(r2 )
π
rdrdθ = 4
2
∫ ∞
0
e
−(r2 )
rdr = 2π
∫ ∞
0
e−t dt = π.
Hence, Γ(1/2) = π.
We also obtain
Γ (0) =
∫ ∞
0
x
−1 − x
e
dx ≥ lim
∫ 1
n →0 n
x
−1 − x
e
dx ≥ e
−1
∫ 1
lim
n →0 n
x −1 dx = e−1 lim (ln(1) − ln(n)) → ∞.
n →0
Γ is undefined for 0. By Γ(n) = Γ(n + 1)/n, Γ is undefined for all negative integers as well. Γ takes ∞ or
−∞ value depending on whether we approach 0 or a negative integer from left or right. But in general, we
should remember |n!| = |Γ(n + 1)| = ∞ for negative integer n. Γ function is plotted in Figure 1, see that the
function is diverging at nonpositive integers such as 0, -1, -2, -3, -4.
n for nonnegative
Having presented negative integer factorials, we consider the binomial coefficients Cm
integers n, m but n < m.
n ( n − 1) . . . ( m + 1)
n
Cm
=
= 0 for 0 ≤ n < m
(n − m)!
because |(n − m)!| → ∞ for n − m < 0. With this understanding, we can write the binomial formula as
(1 + x ) n =
n
∑ Cin xi =
i =0
∞
∑ Cin xi .
i =0
n:
Would this binomial formula hold when n < 0? For integers n < 0 and m > 0, we need to define Cm
n
Cm
:=
(n)(n − 1) . . . (n − m + 2)(n − m + 1)
for n < 0 ≤ m,
m!
15
40
20
0
−40
−20
y
−4
−2
0
2
4
x
Figure 1: Γ(n) for n ∈ {−4.975, −4.925, . . . , 4.975} produced by R commands x <– -5.025 + (1:200)/20; y <–
gamma(x); plot(x, y).
where the factorial notation is applied only onto the nonnegative m. This binomial coefficient can also be
written as
n
Cm
= (−1)m
(−n + m − 1)(−n + m − 2) . . . (n − 1)(n)
− n + m −1
= (−1)m Cm
for n < 0 ≤ m.
m!
Then the negative binomial expansion for n < 0 and | x | < 1 (a technical condition for convergence) is
(1 + x ) n =
∞
∑ Cin xi =
i =0
∞
∑ (−1)i Ci−n+i−1 xi .
i =0
Interestingly, the formula (1 + x )n = ∑i Cin xi applies both for n ≥ 0 and n < 0 with the appropriate interpretation of the binomial coefficients for 0 ≤ n < m and n < 0 ≤ m.
Example: Find (1 − x )−n−1 for n ≥ 0 and 0 < x < 1.
( 1 − x ) − n −1 =
∞
∞
∞
i =0
i =0
i =0
∑ Ci−n−1 (−x)i = ∑ Cin+1+i−1 (−1)i (−1)i xi = ∑ Cin+i xi .
We then obtain the following useful identity
∞
1
∑ Cin+i xi = (1 − x)n+1 .
⋄
i =0
8 Some Identities Involving Combinations
n +1
n + Cn
Cm
m+1 = Cm+1 .
When making up m + 1-element subsets from n + 1 elements, you can fix one element from n + 1 elements. If this fixed element is in your subset, you can choose the remaining m elements from n elements in
n ways. Otherwise, you have m + 1 elements to choose from n elements in C n
Cm
m+1 ways.
∑ik=0 Cim Ckn−i = Ckm+n .
16
A combination of k objects from a group of m + n objects must have 0 ≤ i ≤ k objects from a group of m
objects and remaining from a group of n objects.
i = C n+1 for integers m ≤ n.
∑in=m Cm
m +1
i = C m+1 + C m+1 + C m+2 + C m+3 + · · · + C n = C m+2 + C m+2 + C m+3 + · · · + C n where we use
∑in=m Cm
m
m
m
m
m
m
m
m +1
m +1
m +2
m +1
m+1 . Then C m+2 + C m+2 + C m+3 + · · · + C n = C m+3 + C m+3 + · · · + C n where we use
Cm
=
C
+
C
m
m
m
m
m
m
+1
m +1
m +1
m +1
n
m +3
m +2
n +1
m +2
i
n
n
Cm
+1 = Cm+1 + Cm . Continuing this way we eventually obtain ∑i =m Cm = Cm+1 + Cm = Cm+1 .
If you know of some other useful identities, you can ask me to list them here.
17