Download post-midterm annotated notes combined in one pdf file

Document related concepts
Transcript
8
Discrete Random Variables
Intuitively, to tell whether a random variable is discrete, we simply
consider the possible values of the random variable. If the random
variable is limited to only a finite or countably infinite number of
possibilities, then it is discrete.
Example 8.1. Voice Lines: A voice communication system for
a business contains 48 external lines. At a particular time, the
system is observed, and some of the lines are being used. Let the
random variable X denote the number of lines in use. Then, X
can assume any of the integer values 0 through 48. [15, Ex 3-1]
Definition 8.2. A random variable X is said to be a discrete
random variable if there exists a countable number of distinct
real numbers xk such that
X
P [X = xk ] = 1.
(11)
k
In other words, X is a discrete random variable if and only if X
has a countable support.
Example 8.3. For the random variable N in Example 7.5 (Three
Coin Tosses),
For the random variable S in Example 7.6 (Sum of Two Dice),
8.4. Although the support SX of a random variable X is defined as
any set S such that P [X ∈ S] = 1. For discrete random variable,
SX is usually set to be {x : pX (x) > 0}, the set of all “possible
values” of X.
Definition 8.5. Important Special Case: An integer-valued random variable is a discrete random variable whose xk in (11)
above are all integers.
82
8.6. Recall, from 7.14, that the probability distribution of a
random variable X is a description of the probabilities associated
with X.
For a discrete random variable, the distribution is often characterized by just a list of the possible values (x1 , x2 , x3 , . . .) along
with the probability of each:
(P [X = x1 ] , P [X = x2 ] , P [X = x3 ] , . . . , respectively) .
In some cases, it is convenient to express the probability in
terms of a formula. This is especially useful when dealing with a
random variable that has an unbounded number of outcomes. It
would be tedious to list all the possible values and the corresponding probabilities.
8.1
PMF: Probability Mass Function
Definition 8.7. When X is a discrete random variable satisfying
(11), we define its probability mass function (pmf) by29
pX (x) = P [X = x].
• Sometimes, when we only deal with one random variable or
when it is clear which random variable the pmf is associated
with, we write p(x) or px instead of pX (x).
• The argument (x) of a pmf ranges over all real numbers.
Hence, the pmf is defined for x that is not among the xk
in (11). In such case, the pmf is simply 0. This is usually
expressed as “pX (x) = 0, otherwise” when we specify a pmf
for a particular r.v.
29
Many references (including [15] and MATLAB) use fX (x) for pmf instead of pX (x). We will
NOT use fX (x) for pmf. Later, we will define fX (x) as a probability density function which
will be used primarily for another type of random variable (continuous r.v.)
83
Example 8.8. Continue from Example 7.5. N is the number of
heads in a sequence of three coin tosses.
8.9. Graphical Description of the Probability Distribution: Traditionally, we use stem plot to visualize pX . To do this, we graph
a pmf by marking on the horizontal axis each value with nonzero
probability and drawing a vertical bar with length proportional to
the probability.
8.10. Any pmf p(·) satisfies two properties:
(a) p(·) ≥ 0
(b) there exists numbers x1 , x2 , x3 , . . . such that
p(x) = 0 for other x.
P
k
p(xk ) = 1 and
When you are asked to verify that a function is a pmf, check these
two properties.
8.11. Finding probability from pmf: for any subset B of R, we
can find
X
X
P [X ∈ B] =
P [X = xk ] =
pX (xk ).
xk ∈B
xk ∈B
In particular, for integer-valued random variables,
X
X
P [X ∈ B] =
P [X = k] =
pX (k).
k∈B
k∈B
84
8.12. Steps to find probability of the form P [some condition(s) on X]
when the pmf pX (x) is known.
(a) Find the support of X.
(b) Consider only the x inside the support. Find all values of x
that satisfies the condition(s).
(c) Evaluate the pmf at x found in the previous step.
(d) Add the pmf values from the previous step.
Example 8.13. Suppose a random variable X has pmf
c
/x, x = 1, 2, 3,
pX (x) =
0, otherwise.
(a) The value of the constant c is
(b) Sketch of pmf
(c) P [X = 1]
(d) P [X ≥ 2]
(e) P [X > 3]
85
8.14. Any function p(·) on R which satisfies
(a) p(·) ≥ 0, and
(b) there exists numbers x1 , x2 , x3 , . . . such that
p(x) = 0 for other x
P
k
p(xk ) = 1 and
is a pmf of some discrete random variable.
8.2
CDF: Cumulative Distribution Function
Definition 8.15. The (cumulative) distribution function (cdf )
of a random variable X is the function FX (x) defined by
FX (x) = P [X ≤ x] .
• The argument (x) of a cdf ranges over all real numbers.
• From its definition, we know that 0 ≤ FX ≤ 1.
• Think of it as a function that collects the “probability mass”
from −∞ up to the point x.
8.16. From pmf to cdf: In general, for any discrete random variable with possible values x1 , x2 , . . ., the cdf of X is given by
X
pX (xk ).
FX (x) = P [X ≤ x] =
xk ≤x
Example 8.17. Continue from Examples 7.5, 7.11, and 8.8 where
N is defined as the number of heads in a sequence of three coin
tosses. We have
pN (0) = pN (3) =
1
3
and pN (1) = pN (2) = .
8
8
(a) FN (0)
(b) FN (1.5)
86
(c) Sketch of cdf
8.18. Facts:
• For any discrete r.v. X, FX is a right-continuous, staircase
function of x with jumps at a countable set of points xk .
_c03_066-106.qxd
• When10:58
you are
given the cdf of a discrete random variable, you
AM Page 73
can derive its pmf from the locations and sizes of the jumps.
If a jump happens at x = c, then pX (c) is the same as the
amount of jump at c. At the location x where there is no
jump, pX (x) = 0.
1/7/10
Example 8.19. Consider a discrete random variable X whose cdf
3-3 CUMULATIVE DISTRIBUTION FUNCTIONS
FX (x) is shown in Figure 9.
F(x)
F(x)
1.0
1.000
0.997
0.886
0.7
0.2
–2
0
x
2
Figure 3-3 Cumulative distribution function for
Figure 9: CDF for Example 8.19
Example 3-7.
0
1
2
Figure 3-4 Cumulative distribu
function for Example 3-8.
Determine the pmf pX (x).
EXERCISES FOR SECTION 3-3
3-32. Determine the cumulative distribution function of the
random variable in Exercise 3-14.
3-33. Determine the cumulative distribution function for
the random variable in Exercise 3-15; also determine the following probabilities:
(a) P1X 1.252
(b) P1X 2.22
87
(c) P11.1 X 12 (d) P1X 02
3-34. Determine the cumulative distribution function for the
random variable in Exercise 3-16; also determine the following
Determine each of the following probabilities:
(a) P1X 42 (b) P1X 72
(c) P1X 52 (d) P1X 42
(e) P1X 22
3-41.
0
0.25
F1x2 μ
0.75
1
x 10
10 x 30
30 x 50
50 x

X
 countable set C, P  C   0

FX is continuous
25) Every random variable can be written as a sum of a discrete random variable and a
continuous random variable.
A random variable can30
have at most countably many point x such that
8.20.26)Characterizing
properties of cdf:
P  X  x  0 .
(cumulative) distribution function
(cdf) induced
by a probability P on
CDF1 F27)
is non-decreasing
(monotone
increasing)
X The
  ,   is the function F  x   P  , x .

The (cumulative) distribution function (cdf) of the random variable X is the
function FX  x   P X  , x   P  X  x  .

The distribution P X can be obtained from the distribution function by setting
CDF2 FX is right
the right)
P  , xcontinuous
P .
uniquely determines from
  F  x  ; that is F (continuous
X
X
X

X
0  FX  1
 FX is non-decreasing
 FX is right continuous:
x FX  x    lim FX  y   lim FX  y   FX  x   P  X  x  .
yx
yx


yx
lim FX  x   0 and lim FX  x   1 .
x 
x 
Figure 10: Right-continuous function at jump point
x FX  x    lim FX  y   lim FX  y   P X  , x   P  X  x  .
yx
y x
yx
 x =
P 0xand
 F  x 
F x F
jump=
or saltus
CDF3 lim FP XX (x)
lim
1. in F at x.
 =Xthe(x)

X
x→−∞

 x y
x→∞
8.21. FX can
P  be
x, y written
 F  y   F as
x
X
P   x, y    F  y   F  x 
FX (x) =
pX (xk )u(x − xk ),

xk
where u(x) = 1[0,∞) (x) is the unit step function.
30
These properties hold for any type of random variables. Moreover, for any function F
that satisfies these three properties, there exists a random variable X whose CDF is F .
88
Sirindhorn International Institute of Technology
Thammasat University
School of Information, Computer and Communication Technology
ECS315 2013/1 Part III.2 Dr.Prapun
8.3
Families of Discrete Random Variables
Many physical systems can be modeled by the same or similar
random experiments and random variables. In this subsection,
we present the analysis of several discrete random variables that
frequently arise in applications.31
Definition 8.22. X is uniformly distributed on a finite set S
if
1
, x ∈ S,
pX (x) = P [X = x] = |S|
0, otherwise,
• We write X ∼ U(S) or X ∼ Uniform(S).
• Read “X is uniform on S” or “X is a uniform random variable
on set S”.
• The pmf is usually referred to as the uniform discrete distribution.
• Simulation: When the support S contains only consecutive integers32 , it can be generated by the command randi in MATLAB
(R2008b).
31
As mention in 7.12, we often omit a discussion of the underlying sample space of the
random experiment and directly describe the distribution of a particular random variable.
32
or, with minor manipulation, only uniformly spaced numbers
89
Example 8.23. X is uniformly distributed on 1, 2, . . . , n if
In MATLAB, X can be generated by randi(10).
Example 8.24. Uniform pmf is used when the random variable
can take finite number of “equally likely” or “totally random” values.
• Classical game of chance / classical probability
• Fair gaming devices (well-balanced coins and dice, well-shuffled
decks of cards)
Example 8.25. Roll a fair dice. Let X be the outcome.
Definition 8.26. X is a Bernoulli random variable if

 1 − p, x = 0,
p,
x = 1,
p ∈ (0, 1)
pX (x) =

0,
otherwise,
• Write X ∼ B(1, p) or X ∼ Bernoulli(p)
• X takes only two values: 0 or 1
Definition 8.27. X is a binary random variable if

 1 − p, x = a,
pX (x) =
p,
x = b,
p ∈ (0, 1), b > a.

0,
otherwise,
• X takes only two values: a or b
90
Definition 8.28. X is a binomial random variable with size
n ∈ N and parameter p ∈ (0, 1) if
n x
p (1 − p)n−x , x ∈ {0, 1, 2, . . . , n},
x
pX (x) =
(12)
0,
otherwise
• Write X ∼ B(n, p) or X ∼ binomial(p).
◦ Observe that B(1, p) is Bernoulli with parameter p.
• To calculate pX (x), can use binopdf(x,n,p) in MATLAB.
• Interpretation: X is the number of successes in n independent
Bernoulli trials.
Example 8.29. An optical inspection system is to distinguish
among different part types. The probability of a correct classification of any part is 0.98. Suppose that three parts are inspected
and that the classifications are independent.
(a) Let the random variable X denote the number of parts that
are correctly classified. Determine the probability mass function of X. [15, Q3-20]
(b) Let the random variable Y denote the number of parts that
are incorrectly classified. Determine the probability mass
function of Y .
Solution:
(a) X is a binomial random variable with n = 3 and p = 0.98.
Hence,
3
0.98x (0.02)3−x , x ∈ {0, 1, 2, 3},
x
pX (x) =
(13)
0,
otherwise
In particular, pX (0) = 8 × 10−6 , pX (1) = 0.001176, pX (2) =
0.057624, and pX (3) = 0.941192. Note that in MATLAB, these
probabilities can be calculated by evaluating
binopdf(0:3,3,0.98).
91
(b) Y is a binomial random variable with n = 3 and p = 0.02.
Hence,
3
0.02y (0.98)3−y , y ∈ {0, 1, 2, 3},
y
pY (y) =
(14)
0,
otherwise
In particular, pY (0) = 0.941192, pY (1) = 0.057624, pY (2) =
0.001176, and pY (3) = 8 × 10−6 . Note that in MATLAB, these
probabilities can be calculated by evaluating
binopdf(0:3,3,0.02).
Alternatively, note that there are three parts. If X of them are
classified correctly, then the number of incorrectly classified
parts is n − X, which is what we defined as Y . Therefore,
Y = 3 − X. Hence, pY (y) = P [Y = y] = P [3 − X = y] =
P [X = 3 − y] = pX (3 − y).
Example 8.30. Daily Airlines flies from Amsterdam to London
every day. The price of a ticket for this extremely popular flight
route is $75. The aircraft has a passenger capacity of 150. The
airline management has made it a policy to sell 160 tickets for this
flight in order to protect themselves against no-show passengers.
Experience has shown that the probability of a passenger being
a no-show is equal to 0.1. The booked passengers act independently of each other. Given this overbooking strategy, what is the
probability that some passengers will have to be bumped from the
flight?
Solution: This problem can be treated as 160 independent
trials of a Bernoulli experiment with a success rate of p = 9/10,
where a passenger who shows up for the flight is counted as a success. Use the random variable X to denote number of passengers
that show up for a given flight. The random variable X is binomial distributed with the parameters n = 160 and p = 9/10. The
probability in question is given by
P [X > 150] = 1 − P [X ≤ 150] = 1 − FX (150).
In MATLAB, we can enter 1-binocdf(150,160,9/10) to get 0.0359.
Thus, the probability that some passengers will be bumped from
any given flight is roughly 3.6%. [22, Ex 4.1]
92
Definition 8.31. A geometric random variable X is defined by
the fact that for some constant β ∈ (0, 1),
pX (k + 1) = β × pX (k)
for all k ∈ S where S can be either N or N ∪ {0}.
(a) When its support is N = {1, 2, . . .},
(1 − β) β x−1 , x = 1, 2, . . .
pX (x) =
0,
otherwise.
• Write X ∼ G1 (β) or geometric1 (β).
• In MATLAB, to evaluate pX (x), use geopdf(x-1,1-β).
• Interpretation: X is the number of trials required in
Bernoulli trials to achieve the first success.
In particular, in a series of Bernoulli trials (independent
trials with constant probability p of a success), let the
random variable X denote the number of trials until the
first success. Then X is a geometric random variable with
parameter β = 1 − p and
(1 − β) β x−1 , x = 1, 2, . . .
pX (x) =
0,
otherwise
p(1 − p)x−1 , x = 1, 2, . . .
=
0,
otherwise.
(b) When its support is N ∪ {0},
(1 − β) β x , x = 0, 1, 2, . . .
pX (x) =
0,
otherwise
p(1 − p)x , x = 0, 1, 2, . . .
=
0,
otherwise.
• Write X ∼ G0 (β) or geometric0 (β).
• In MATLAB, to evaluate pX (x), use geopdf(x,1-β).
• Interpretation: X is the number of failures in Bernoulli
trials before the first success occurs.
93
8.32. In 1837, the famous French mathematician Poisson introduced a probability distribution that would later come to be known
as the Poisson distribution, and this would develop into one of the
most important distributions in probability theory. As is often remarked, Poisson did not recognize the huge practical importance of
the distribution that would later be named after him. In his book,
he dedicates just one page to this distribution. It was Bortkiewicz
in 1898, who first discerned and explained the importance of the
Poisson distribution in his book Das Gesetz der Kleinen Zahlen
(The Law of Small Numbers). [22]
Definition 8.33. X is a Poisson random variable with parameter α > 0 if
−α αx
e x! , x = 0, 1, 2, . . .
pX (x) =
0,
otherwise
• In MATLAB, use poisspdf(x,alpha).
• Write X ∼ P (α) or Poisson(α).
• We will see later in Example 9.7 that α is the “average” or
expected value of X.
• Instead of X, Poisson random variable is usually denoted by
Λ. The parameter α is often replaced by λτ where λ is referred
to as the intensity/rate parameter of the distribution
Example 8.34. The first use of the Poisson model is said to have
been by a Prussian (German) physician, Bortkiewicz, who found
that the annual number of late-19th-century Prussian (German)
soldiers kicked to death by horses fitted a Poisson distribution [6,
p 150],[3, Ex 2.23]33 .
33
I. J. Good and others have argued that the Poisson distribution should be called the
Bortkiewicz distribution, but then it would be very difficult to say or write.
94
Example 8.35. The number of hits to a popular website during
a 1-minute interval is given by N ∼ P(α) where α = 2.
(a) Find the probability that there is at least one hit between
3:00AM and 3:01AM.
(b) Find the probability that there are at least 2 hits during the
time interval above.
8.36. One of the reasons why Poisson distribution is important is
because many natural phenomenons can be modeled by Poisson
processes.
Definition 8.37. A Poisson process (PP) is a random arrangement of “marks” (denoted by “×” below) on the time line.
The “marks” may indicate the arrival times or occurrences of
event/phenomenon of interest.
Example 8.38. Examples of processes that can be modeled by
Poisson process include
(a) the sequence of times at which lightning strikes occur or mail
carriers get bitten within some region
(b) the emission of particles from a radioactive source
95
(c) the arrival of
• telephone calls at a switchboard or at an automatic phoneswitching system
• urgent calls to an emergency center
• (filed) claims at an insurance company
• incoming spikes (action potential) to a neuron in human
brain
(d) the occurrence of
• serious earthquakes
• traffic accidents
• power outages
in a certain area.
(e) page view requests to a website
8.39. It is convenient to consider the Poisson process in terms of
customers arriving at a facility.
We focus on a type of Poisson process that is called homogeneous
Poisson process.
Definition 8.40. For homogeneous Poisson process, there is
only one parameter that describes the whole process. This number
is call the rate and usually denoted by λ.
Example 8.41. If you think about modeling customer arrival as
a Poisson process with rate λ = 5 customers/hour, then it means
that during any fixed time interval of duration 1 hour (say, from
noon to 1PM), you expect to have about 5 customers arriving in
that interval. If you consider a time interval of duration two hours
(say, from 1PM to 3PM), you expect to have about 2 × 5 = 10
customers arriving in that time interval.
8.42. One important fact which we will revisit later is that, for a
homogeneous Poisson process, the number of arrivals during a time
interval of duration T is a Poisson random variable with parameter
α = λT .
96
Example 8.43. Examples of Poisson random variables:
• #photons emitted by a light source of intensity λ [photons/second] in time τ
• #atoms of radioactive material undergoing decay in time τ
• #clicks in a Geiger counter in τ seconds when the average
number of click in 1 second is λ.
• #dopant atoms deposited to make a small device such as an
FET
• #customers arriving in a queue or workstations requesting
service from a file server in time τ
• Counts of demands for telephone connections in time τ
• Counts of defects in a semiconductor chip.
Example 8.44. Thongchai produces a new hit song every 7 months
on average. Assume that songs are produced according to a Poisson process. Find the probability that Thongchai produces more
than two hit songs in 1 year.
8.45. Poisson approximation of Binomial distribution: When
p is small and n is large, B(n, p) can be approximated by P(np)
(a) In a large number of independent repetitions of a Bernoulli
trial having a small probability of success, the total number of
successes is approximately Poisson distributed with parameter α = np, where n = the number of trials and p = the
probability of success. [22, p 109]
97
(b) More specifically, suppose Xn ∼ B(n, pn ). If pn → 0 and
npn → α as n → ∞, then
k
n k
n−k
−α α
.
P [Xn = k] =
p (1 − pn )
→e
k!
k n
Example 8.46. Consider Xn ∼ B(n, 1/n).
Example 8.47. Recall that Bortkiewicz applied the Poisson model
to the number of Prussian cavalry deaths attributed to fatal horse
kicks. Here, indeed, one encounters a very large number of trials
(the Prussian cavalrymen), each with a very small probability of
“success” (fatal horse kick).
8.48. Summary:
X∼
Uniform Un
U{0,1,...,n−1}
Support SX
{1, 2, . . . , n}
{0, 1, . . . , n − 1}
Bernoulli B(1, p)
{0, 1}
Binomial B(n, p)
Geometric G0 (β)
Geometric G1 (β)
Poisson P(α)
{0, 1, . . . , n}
N ∪ {0}
N
N ∪ {0}
pX (x)
1
n
1
n
1 − p, x = 0
p,
x=1
n x
p
(1
−
p)n−x
x
(1 − β)β x
(1 − β)β x−1
x
e−α αx!
Table 3: Examples of probability mass functions. Here, p, β ∈ (0, 1). α > 0.
n∈N
98
8.4
Some Remarks
8.49. Sometimes, it is useful to define and think of pmf as a vector
p of probabilities.
When you use MATLAB, it is also useful to keep track of the
values of x corresponding to the probabilities in p. This can be
done via defining a vector x.
Example 8.50. For B 3, 13 , we may define
x = [0, 1, 2, 3]
and
" #
3
1 0 2 3 3
1 1 2 2 3
1 2 2 1 3
1 3 2 0
p=
,
,
,
0
3
3
3
3
3
3
3
3
1
2
3
8 4 2 1
, , ,
=
27 9 9 27
8.51. At this point, we have a couple of ways to define probabilities that are associated with a random variable X
(a) We can define P [X ∈ B] for all possible set B.
(b) For discrete random variable, we only need to define its pmf
pX (x) which is defined as P [X = x] = P [X ∈ {x}].
(c) We can also define the cdf FX (x).
Definition 8.52. If pX (c) = 1, that is P [X = c] = 1, for some
constant c, then X is called a degenerated random variable.
99
ECS315 2013/1
9
Part III.2 Dr.Prapun
Expectation and Variance
Two numbers are often used to summarize a probability distribution for a random variable X. The mean is a measure of the center or middle of the probability distribution, and the variance is a
measure of the dispersion, or variability in the distribution. These
two measures do not uniquely identify a probability distribution.
That is, two different distributions can have the same mean and
variance. Still, these measures are simple, useful summaries of the
probability distribution of X.
9.1
Expectation of Discrete Random Variable
The most important characteristic of a random variable is its expectation. Synonyms for expectation are expected value, mean,
and first moment.
The definition of expectation is motivated by the conventional
idea of numerical average. Recall that the numerical average of n
numbers, say a1 , a2 , . . . , an is
n
1X
ak .
n
k=1
We use the average to summarize or characterize the entire collection of numbers a1 , . . . , an with a single value.
Example 9.1. Consider 10 numbers: 5, 2, 3, 2, 5, -2, 3, 2, 5, 2.
The average is
5 + 2 + 3 + 2 + 5 + (−2) + 3 + 2 + 5 + 2 27
=
= 2.7.
10
10
We can rewrite the above calculation as
−2 ×
1
4
2
3
+2×
+3×
+5×
10
10
10
10
101
Definition 9.2. Suppose X is a discrete random variable, we define the expectation (or mean or expected value) of X by
X
X
x × pX (x).
(15)
x × P [X = x] =
EX =
x
x
In other words, The expected value of a discrete random variable
is a weighted mean of the values the random variable can take on
where the weights come from the pmf of the random variable.
• Some references use mX or µX to represent EX.
• For conciseness, we simply write x under the summation symbol in (15); this means that the sum runs over all x values in
the support of X. (Of course, for x outside of the support,
pX (x) is 0 anyway.)
9.3. Analogy: In mechanics, think of point masses on a line with
a mass of pX (x) kg. at a distance x meters from the origin.
In this model, EX is the center of mass (the balance point).
This is why pX (x) is called probability mass function.
Example 9.4. When X ∼ Bernoulli(p) with p ∈ (0, 1),
Note that, since X takes only the values 0 and 1, its expected
value p is “never seen”.
9.5. Interpretation: The expected value is in general not a typical
value that the random variable can take on. It is often helpful to
interpret the expected value of a random variable as the long-run
average value of the variable over many independent repetitions
of an experiment

 1/4, x = 0
Example 9.6. pX (x) =
3/4, x = 2

0,
otherwise
102
Example 9.7. For X ∼ P(α),
EX =
∞
X
ie
i
−α (α)
i!
i=0
= e−α α
∞
X
k=0
=
∞
X
e
i
−α (α)
i!
i=1
i+0=e
αk
= e−α αeα = α.
k!
−α
∞
X
(α)i−1
(α)
(i − 1)!
i=1
Example 9.8. For X ∼ B(n, p),
n n
X
X
n i
n!
EX =
i
p (1 − p)n−i =
pi (1 − p)n−i
i
i! (n − i)!
i
i=0
i=1
n
n X
X
(n − 1)!
n−1 i
n−i
i
=n
p (1 − p)
=n
p (1 − p)n−i
(i − 1)! (n − i)!
i−1
i=1
i=1
Let k = i − 1. Then,
EX = n
n−1
X
k=0
n−1
X n − 1
n − 1 k+1
n−(k+1)
p (1 − p)
= np
pk (1 − p)n−1−k
k
k
k=0
We now have the expression in the form that we can apply the
binomial theorem which finally gives
EX = np(p + (1 − p))n−1 = np.
We shall revisit this example again using another approach in Example 10.45.
Example 9.9. Pascal’s wager : Suppose you concede that you
don’t know whether or not God exists and therefore assign a 50
percent chance to either proposition. How should you weigh these
odds when deciding whether to lead a pious life? If you act piously
and God exists, Pascal argued, your gain–eternal happiness–is infinite. If, on the other hand, God does not exist, your loss, or
negative return, is small–the sacrifices of piety. To weigh these
possible gains and losses, Pascal proposed, you multiply the probability of each possible outcome by its payoff and add them all up,
forming a kind of average or expected payoff. In other words, the
mathematical expectation of your return on piety is one-half infinity (your gain if God exists) minus one-half a small number (your
loss if he does not exist). Pascal knew enough about infinity to
103
know that the answer to this calculation is infinite, and thus the
expected return on piety is infinitely positive. Every reasonable
person, Pascal concluded, should therefore follow the laws of God.
[14, p 76]
• Pascals wager is often considered the founding of the mathematical discipline of game theory, the quantitative study of
optimal decision strategies in games.
9.10. Technical issue: Definition (15) is only meaningful if the
sum is well defined.
The sum of infinitely many nonnegative terms is always welldefined, with +∞ as a possible value for the sum.
• Infinite Expectation: Consider a random variable X whose
pmf is defined by
1
, x = 1, 2, 3, . . .
pX (x) = cx2
0, otherwise
P
1
2
Then, c = ∞
n=1 n2 which is a finite positive number (π /6).
However,
EX =
∞
X
k=1
∞
∞
X
1X1
11
= +∞.
kpX (k) =
k 2=
ck
c
k
k=1
k=1
Some care is necessary when computing expectations of signed
random variables that take infinitely many values.
• The sum over countably infinite many terms is not always well
defined when both positive and negative terms are involved.
• For example, the infinite series 1 − 1 + 1 − 1 + . . . has the sum
0 when you sum the terms according to (1 − 1) + (1 − 1) + · · · ,
whereas you get the sum 1 when you sum the terms according
to 1 + (−1 + 1) + (−1 + 1) + (−1 + 1) + · · · .
• Such abnormalities cannot happen when all terms in the infinite summation are nonnegative.
104
It is the convention in probability theory that EX should be evaluated as
X
X
EX =
xpX (x) −
(−x)pX (x),
x≥0
x<0
• If at least one of these sums is finite, then it is clear what
value should be assigned as EX.
• If both sums are +∞, then no value is assigned to EX, and
we say that EX is undefined.
Example 9.11. Undefined Expectation: Let
1
, x = ±1, ±2, ±3, . . .
pX (x) = 2cx2
0,
otherwise
Then,
EX =
∞
X
k=1
kpX (k) −
−1
X
(−k) pX (k).
k=−∞
The first sum gives
∞
X
kpX (k) =
k=1
∞
X
k=1
∞
1
1 X1
∞
k
=
=
.
2ck 2
2c
k
2c
k=1
The second sum gives
−1
X
k=−∞
(−k) pX (k) =
∞
X
kpX (−k) =
k=1
∞
X
k=1
∞
1 X1
1
∞
=
k
= .
2
2ck
2c
k
2c
k=1
Because both sums are infinite, we conclude that EX is undefined.
9.12. More rigorously, to define EX, we let X + = max {X, 0} and
X − = − min {X, 0}. Then observe that X = X + − X − and that
both X + and X − are nonnegative r.v.’s. We say that a random
variable X admits an expectation if EX + and EX − are not
both equal to +∞. In which case, EX = EX + − EX − .
105
9.2
Function of a Discrete Random Variable
Given a random variable X, we will often have occasion to define
a new random variable by Y ≡ g(X), where g(x) is a real-valued
function of the real-valued variable x. More precisely, recall that
a random variable X is actually a function taking points of the
sample space, ω ∈ Ω, into real numbers X(ω). Hence, we have the
following definition
Definition 9.13. The notation Y = g(X) is actually shorthand
for Y (ω) := g(X(ω)).
• The random variable Y = g(X) is sometimes called derived
random variable.
Example 9.14. Let
pX (x) =
1 2
cx ,
0,
x = ±1, ±2
otherwise
and
Y = X 4.
Find pY (y) and then calculate EY .
9.15. For discrete random variable X, the pmf of a derived random variable Y = g(X) is given by
X
pY (y) =
pX (x).
x:g(x)=y
106
Note that the sum is over all x in the support of X which satisfy
g(x) = y.
Example 9.16. A “binary” random variable X takes only two
values a and b with
P [X = b] = 1 − P [X = a] = p.
X can be expressed as X = (b − a)I + a, where I is a Bernoulli
random variable with parameter p.
9.3
Expectation of a Function of a Discrete Random
Variable
Recall that for discrete random variable X, the pmf of a derived
random variable Y = g(X) is given by
X
pX (x).
pY (y) =
x:g(x)=y
If we want to compute EY , it might seem that we first have to
find the pmf of Y . Typically, this requires a detailed analysis of g
which can be complicated, and it is avoided by the following result.
9.17. Suppose X is a discrete random variable.
X
g(x)pX (x).
E [g(X)] =
x
This is referred to as the law/rule of the lazy/unconcious
statistician (LOTUS) [23, Thm 3.6 p 48],[9, p. 149],[8, p. 50]
because it is so much easier to use the above formula than to first
find the pmf of Y . It is also called substitution rule [22, p 271].
Example 9.18. Back to Example 9.14. Recall that
1 2
x , x = ±1, ±2
pX (x) = c
0,
otherwise
(a) When Y = X 4 , EY =
107
(b) E [2X − 1]
9.19. Caution: A frequently made mistake of beginning students
is to set E [g(X)] equal to g (EX). In general, E [g(X)] 6= g (EX).
1
(a) In particular, E X1 is not the same as EX
.
(b) An exception is the case of a linear function g(x) = ax + b.
See also (9.23).
Example 9.20. Continue from Example 9.4. For X ∼ Bernoulli(p),
(a) EX = p
(b) E X 2 = 02 × (1 − p) + 12 × p = p 6= (EX)2 .
Example 9.21. Continue from Example 9.7. Suppose X ∼ P(α).
∞
∞
i
X
2 X
αi−1
−α
2 −α α
=e α
i
E X =
ie
i!
(i − 1)!
i=1
i=0
(16)
We can evaluate the infinite sum in (16) by rewriting i as i−1+1:
∞
X
i=1
∞
i
∞
∞
X
X
X αi−1
αi−1
αi−1
αi−1
=
(i − 1 + 1)
=
(i − 1)
+
(i − 1)!
(i − 1)!
(i − 1)!
(i − 1)!
i=1
∞
X
=α
i=2
αi−2
(i − 2)!
+
∞
X
i=1
i=1
i−1
α
(i − 1)!
i=1
= αeα + eα = eα (α + 1).
Plugging this back into (16), we get
E X 2 = α (α + 1) = α2 + α.
9.22.
9.8. For X ∼ B(n, p), one can find
2 Continue from Example
E X = np(1 − p) + (np)2 .
108
9.23. Some Basic Properties of Expectations
(a) For c ∈ R, E [c] = c
(b) For c ∈ R, E [X + c] = EX + c and E [cX] = cEX
(c) For constants a, b, we have
E [aX + b] = aEX + b.
(d) For constants c1 and c2 ,
E [c1 g1 (X) + c2 g2 (X)] = c1 E [g1 (X)] + c2 E [g2 (X)] .
(e) For constants c1 , c2 , . . . , cn ,
" n
#
n
X
X
E
ck gk (X) =
ck E [gk (X)] .
k=1
k=1
Definition 9.24. Some definitions involving expectation of a function of a random variable:
i
i
h
h
0
k
(a) Absolute moment: E |X| , where we define E |X| = 1
(b) Moment: mk = E X k = the k th moment of X, k ∈ N.
• The first moment of X is its expectation EX.
• The second moment of X is E X 2 .
109
9.4
Variance and Standard Deviation
An average (expectation) can be regarded as one number that
summarizes an entire probability model. After finding an average,
someone who wants to look further into the probability model
might ask, “How typical is the average?” or, “What are the
chances of observing an event far from the average?” A measure
of dispersion/deviation/spread is an answer to these questions
wrapped up in a single number. (The opposite of this measure is
the peakedness.) If this measure is small, observations are likely
to be near the average. A high measure of dispersion suggests that
it is not unusual to observe events that are far from the average.
Example 9.25. Consider your score on the midterm exam. After
you find out your score is 7 points above average, you are likely to
ask, “How good is that? Is it near the top of the class or somewhere
near the middle?”.
Example 9.26. In the case that the random variable X is the
random payoff in a game that can be repeated many times under
identical conditions, the expected value of X is an informative
measure on the grounds of the law of large numbers. However, the
information provided by EX is usually not sufficient when X is
the random payoff in a nonrepeatable game.
Suppose your investment has yielded a profit of $3,000 and you
must choose between the following two options:
• the first option is to take the sure profit of $3,000 and
• the second option is to reinvest the profit of $3,000 under the
scenario that this profit increases to $4,000 with probability
0.8 and is lost with probability 0.2.
The expected profit of the second option is
0.8 × $4, 000 + 0.2 × $0 = $3, 200
and is larger than the $3,000 from the first option. Nevertheless,
most people would prefer the first option. The downside risk is
too big for them. A measure that takes into account the aspect of
risk is the variance of a random variable. [22, p 35]
110
9.27. The most important measures of dispersion are the
standard deviation and its close relative, the variance.
Definition 9.28. Variance:
h
2
Var X = E (X − EX)
i
.
(17)
• Read “the variance of X”
2
, or VX [23, p. 51]
• Notation: DX , or σ 2 (X), or σX
• In some references, to avoid confusion from the two expectation symbols, they first define m = EX and then define the
variance of X by
Var X = E (X − m)2 .
• We can also calculate the variance via another identity:
Var X = E X 2 − (EX)2
• The units of the variance are squares of the units of the random variable.
9.29. Basic properties of variance:
• Var X ≥ 0.
• Var X ≤ E X 2 .
• Var[cX] = c2 Var X.
• Var[X + c] = Var X.
• Var[aX + b] = a2 Var X.
111
Definition 9.30. Standard Deviation:
p
σX = Var[X].
• It is useful to work with the standard deviation since it has
the same units as EX.
• Informally we think of outcomes within ±σX of EX as being
in the center of the distribution. Some references would informally interpret sample values within ±σX of the expected
value, x ∈ [EX − σX , EX + σX ], as “typical” values of X and
other values as “unusual”.
• σaX+b = |a| σX .
√
√
9.31. σX and Var X: Note that the
√ · function is a strictly
increasing function. Because σX = Var X, if one of them is
large, another one is also large. Therefore, both values quantify
the amount of spread/dispersion in RV X (which can be observed
from the spread or dispersion of the pmf or the histogram or the
relative frequency graph). However, Var X does not have the same
unit as the RV X.
9.32. In finance, standard deviation is a key concept and is used
to measure the volatility (risk) of investment returns and stock
returns.
It is common wisdom in finance that diversification of a portfolio
of stocks generally reduces the total risk exposure of the investment. We shall return to this point in Example 10.65.
Example 9.33. Continue from Example 9.25. If the standard
deviation of exam scores is 12 points, the student with a score of
+7 with respect to the mean can think of herself in the middle of
the class. If the standard deviation is 3 points, she is likely to be
near the top.
Example 9.34. Suppose X ∼ Bernoulli(p).
(a) E X 2 = 02 × (1 − p) + 12 × p = p.
112
(b) Var X = EX 2 − (EX)2 = p − p2 = p(1 − p).
Alternatively, if we directly use (17), we have
Var X = E (X − EX)2 = (0 − p)2 × (1 − p) + (1 − p)2 × p
= p(1 − p)(p + (1 − p)) = p(1 − p).
Example 9.35. Continue from Example 9.7 and Example 9.21.
Suppose X ∼ P(α). We have
Var X = E X 2 − (EX)2 = α2 + α − α2 = α.
Therefore, for Poisson random variable, the expected value is the
same as the variance.
Example 9.36. Consider
two pmfs shown in Figure 11. The
2.4the
Expectation
85
random variable X with pmf at the left has a smaller variance
than
the random
variable
Y withofpmf
atitsthe
right
because
more
The
variance
is the average
squared deviation
X about
mean.
The variance
characterizes
how likely itmass
is to observe
values of the random
variable
far (their
from its mean.
For example,
probability
is concentrated
near
zero
mean)
in the
consider the two pmfs shown in Figure 2.9. More probability mass is concentrated near zero
graph
at atthe
than
ingraph
the atgraph
at the right. [9, p. 85]
in
the graph
the left
left than
in the
the right.
p ( i)
p ( i)
Y
X
−2 −1
1/3
1/3
1/6
1/6
0
1
2
i
−2 −1
0
1
2
i
Figure
Example
2.27 shows
the random
X with pmf
at the left
has a smaller
variance than
the
Figure2.9.11:
Example
9.36thatshows
thatvariable
a random
variable
whose
probability
mass
random variable Y with pmf at the right.
is concentrated near the mean has smaller variance. [9, Fig. 2.9]
Example 2.27. Let X and Y be the random variables with respective pmfs shown in
Figure
2.9.We
Compute
and var(Y
).
9.37.
havevar(X)
already
talked
about variance and standard de-
viation
as By
a number
of=the
] and
Solution.
symmetry, that
both Xindicates
and Y have spread/dispersion
zero mean, and so var(X)
E[X 2pmf.
2
].
Write
var(Y
)
=
E[Y
More specifically, let’s imagine a pmf that shapes like a bell curve.
2 ] = (−2)2 1 + (−1)2 1 + (1)2 1 + (2)2 1 = 2,
E[Xσ
As the value of
X gets smaller,
6
3 the spread
3
6 of the pmf will be
smaller and hence the pmf would “look sharper”. Therefore, the
and
2 ] = (−2)2 1 + (−1)2 1 + (1)2 1 + (2)2 1 = 3.
E[Ythe
probability that
random
X 6would3 take a value that is
3 variable
6
Thus,
X andthe
Y aremean
both zero-mean
random
variables taking the values ±1 and ±2. But Y
far from
would be
smaller.
is more likely to take values far from its mean. This is reflected by the fact that var(Y ) >
var(X).
When a random variable does not have113
zero mean, it is often convenient to use the
variance formula,
var(X) = E[X 2 ] − (E[X])2 ,
(2.17)
The next property involves the use of σX to bound “the tail
probability” of a random variable.
9.38. Chebyshev’s Inequality :
2
σX
P [|X − EX| ≥ α] ≤ 2
α
or equivalently
P [|X − EX| ≥ nσX ] ≤
1
n2
• Useful only when α > σX
Example 9.39. If X has mean m and variance σ 2 , it is sometimes
convenient to introduce the normalized random variable
Y =
X −m
.
σ
Definition 9.40. Central Moments: A generalization of the
variance is the nth central moment which is defined to be
µn = E [(X − EX)n ] .
(a) µ1 = E [X − EX] = 0.
2
(b) µ2 = σX
= Var X: the second central moment is the variance.
114
Sirindhorn International Institute of Technology
Thammasat University
School of Information, Computer and Communication Technology
ECS315 2013/1 Part IV.1 Dr.Prapun
10
10.1
Continuous Random Variables
From Discrete to Continuous Random Variables
In many practical applications of probability, physical situations
are better described by random variables that can take on a continuum of possible values rather than a discrete number of values.
For this type of random variable, the interesting fact is that
• any individual value has probability zero:
P [X = x] = 0 for all x
(18)
and that
• the support is always uncountable.
These random variables are called continuous random variables.
10.1. We can see from (18) that the pmf is going to be useless for
this type of random variable. It turns out that the cdf FX is still
useful and we shall introduce another useful function called probability density function (pdf) to replace the role of pmf. However,
integral calculus34 is required to formulate this continuous analog
of a pmf.
10.2. In some cases, the random variable X is actually discrete
but, because the range of possible values is so large, it might be
more convenient to analyze X as a continuous random variable.
34
This is always a difficult concept for the beginning student.
115
Example 10.3. Suppose that current measurements are read from
a digital instrument that displays the current to the nearest onehundredth of a mA. Because the possible measurements are limited, the random variable is discrete. However, it might be a more
convenient, simple approximation to assume that the current measurements are values of a continuous random variable.
Example 10.4. If you can measure the heights of people with
infinite precision, the height of a randomly chosen person is a continuous random variable. In reality, heights cannot be measured
with infinite precision, but the mathematical analysis of the distribution of heights of people is greatly simplified when using a
mathematical model in which the height of a randomly chosen
person is modeled as a continuous random variable. [22, p 284]
Example 10.5. Continuous random variables are important models for
(a) voltages in communication receivers
(b) file download times on the Internet
(c) velocity and position of an airliner on radar
(d) lifetime of a battery
(e) decay time of a radioactive particle
(f) time until the occurrence of the next earthquake in a certain
region
Example 10.6. The simplest example of a continuous random
variable is the “random choice” of a number from the interval
(0, 1).
• In MATLAB, this can be generated by the command rand.
In Excel, use rand().
• The generation is “unbiased” in the sense that “any number
in the range is as likely to occur as another number.”
• Histogram is flat over (0, 1).
• Formally, this is called a uniform RV on the interval (0, 1).
116
Definition 10.7. We say that X is a continuous random variable35 if we can find a (real-valued) function36 f such that, for any
set B, P [X ∈ B] has the form
Z
P [X ∈ B] =
f (x)dx.
(19)
B
• In particular,
P [a ≤ X ≤ b] =
Z
b
f( x)dx.
(20)
a
In other words, the area under the graph of f (x) between
the points a and b gives the probability P [a ≤ X ≤ b].
• The function f is called the probability density function
(pdf) or simply density.
• When we want to emphasize that the function f is a density
of a particular random variable X, we write fX instead of f .
35
To be more rigorous, this is the definition for absolutely continuous random variable. At
this level, we will not distinguish between the continuous random variable and absolutely
continuous random variable. When the distinction between them is considered, a random
variable X is said to be continuous (not necessarily absolutely continuous) when condition (18)
is satisfied. Alternatively, condition (18) is equivalent to requiring the cdf FX to be continuous.
Another fact worth mentioning is that if a random variable is absolutely continuous, then it
is continuous. So, absolute continuity is a stronger condition.
36
Strictly speaking, δ-“function” is not a function; so, can’t use δ-function here.
117
Uniform Random Variable on (0,1)
Wednesday, August 28, 2013
8:28 AM
>> X = rand(1e3,1); hist(X,10)
>> X = rand(1e5,1); hist(X,10)
120
12000
100
10000
80
8000
60
6000
40
4000
20
2000
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
315 2013 L1 Page 1
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
315 2013 L1 Page 2
2. The range of possible x values is along the horizontal axis.
3. The probability that x will take on a value between a and b will be the
area under the curve between points a and b, as shown in Figure 7.1. The
FIGURE 7.1
Area = P(a ≤ x ≤ b)
f (x)
For a continuous random
variable, the probability distribution is described by a
curve called the probability
density function, f(x). The
total area beneath the curve
is 1.0, and the probability
that x will take on some
value between a and b is
the area beneath the curve
between points a and b.
a
x
b
Figure 12: For a continuous random variable, the probability distribution is
described by a curve called the probability density function, f (x). The total
area beneath the curve is 1.0, and the probability that X will take on some
value between a and b is the area beneath the curve between points a and b.
Example 10.8. For the random variable generated by the rand
command in MATLAB37 or the rand() command in Excel,
Definition 10.9. Recall that the support SX of a random variable
X is any set S such that P [X ∈ S] = 1. For continuous random
variable, SX is usually set to be {x : fX (x) > 0}.
37
The rand command in MATLAB is an approximation for two reasons:
(a) It produces pseudorandom numbers; the numbers seem random but are actually the
output of a deterministic algorithm.
(b) It produces a double precision floating point number, represented in the computer
by 64 bits. Thus MATLAB distinguishes no more than 264 unique double precision
floating point numbers. By comparison, there are uncountably infinite real numbers in
the interval from 0 to 1.
118
10.2
Properties of PDF and CDF for Continuous Random Variables
10.10. fX is determined only almost everywhere38 . That is, given
a pdf f for a random variable X, if we construct a function g by
changing the function f at a countable number of points39 , then g
can also serve as a pdf for X.
10.11. The cdf of any kind of random variable X is defined as
FX (x) = P [X ≤ x] .
Note that even through there are more than one valid pdfs for
any given random variable, the cdf is unique. There is only one
cdf for each random variable.
10.12. For continuous random variable, given the pdf fX (x), we
can find the cdf of X by
Z x
FX (x) = P [X ≤ x] =
fX (t)dt.
−∞
10.13. Given the cdf FX (x), we can find the pdf fX (x) by
• If FX is differentiable at x, we will set
d
FX (x) = fX (x).
dx
• If FX is not differentiable at x, we can set the values of fX (x)
to be any value. Usually, the values are selected to give simple
expression. (In many cases, they are simply set to 0.)
38
39
Lebesgue-a.e, to be exact
More specifically, if g = f Lebesgue-a.e., then g is also a pdf for X.
119
Example 10.14. For the random variable generated by the rand
command in MATLAB or the rand() command in Excel,
Example 10.15. Suppose that the lifetime X of a device has the
cdf

x<0
 0,
1 2
x , 0≤x≤2
FX (x) =
 4
1,
x>2
Observe that it is differentiable at each point x except at x = 2.
The probability density function is obtained by differentiation of
the cdf which gives
1
x, 0 < x < 2
fX (x) = 2
0, otherwise.
At x = 2 where FX has no derivative, it does not matter what
values we give to fX . Here, we set it to be 0.
10.16. In many situations when you are asked to find pdf, it may
be easier to find cdf first and then differentiate it to get pdf.
Exercise 10.17. A point is “picked at random” in the inside of a
circular disk with radius r. Let the random variable X denote the
distance from the center of the disk to this point. Find fX (x).
10.18. Unlike the cdf of a discrete random variable, the cdf of a
continuous random variable has no jumps and is continuous everywhere.
Rx
10.19. pX (x) = P [X = x] = P [x ≤ X ≤ x] = x fX (t)dt = 0.
Again, it makes no sense to speak of the probability that X will
take on a pre-specified value. This probability is always zero.
10.20. P [X = a] = P [X = b] = 0. Hence,
P [a < X < b] = P [a ≤ X < b] = P [a < X ≤ b] = P [a ≤ X ≤ b]
120
• The corresponding integrals over an interval are not affected
by whether or not the endpoints are included or excluded.
• When we work with continuous random variables, it is usually
not necessary to be precise about specifying whether or not
a range of numbers includes the endpoints. This is quite different from the situation we encounter with discrete random
variables where it is critical to carefully examine the type of
inequality.
R
10.21. fX is nonnegative and R fX (x)dx = 1.
Example 10.22. Random variable X has pdf
−2x
ce , x > 0
fX (x) =
0,
otherwise
Find the constant c and sketch the pdf.
Definition 10.23. A continuous random variable is called exponential if its pdf is given by
−λx
λe , x > 0,
fX (x) =
0,
x≤0
for some λ > 0
Theorem 10.24. Any nonnegative40 function that integrates to
one is a probability density function (pdf) of some random
variable [9, p.139].
40
or nonnegative a.e.
121
−∞
B
for some
integrable function f .a Since P(X ∈ IR) = 1, the function f must integrate to one;
∞
i.e., −∞ f (t) dt = 1. Further, since P(X ∈ B) ≥ 0 for all B, it can be shown that f must be
nonnegative.1 A nonnegative function that integrates to one is called a probability density
function (pdf). 10.25. Intuition/Interpretation:
Usually, the set B is an interval such as B = [a, b]. In this case,
The use of the word “density” originated with the analogy to
the distribution of matterinb space. In physics, any finite volume,
f (t) dt.
P(a ≤ X ≤ b) =
no matter how small, has aa positive mass, but there is no mass at
single point.
similar description
applies
to continuous
See Figure 4.1(a).a Computing
such A
probabilities
is analogous to
determining
the mass of random
a
variables.
piece of wire stretching
from a to b by integrating its mass density per unit length from a to
b. Since most probability
densities we work
are continuous,
for a small interval, say
Approximately,
for awith
small
∆x,
[x, x + ∆x], we have
Z x+∆x
x+∆x
P [X ∈ [x, x + ∆x]] =
fX (t)dt ≈ fX (x)∆x.
P(x ≤ X ≤ x + ∆x) =
x
f (t) dtx ≈ f (x) ∆x.
See Figure 4.1(b).This is why we call fX the density function.
a
x x+ x
b
Figure 13: (a)
P [x ≤ X ≤ x + ∆x] is (b)
the area of the shaded vertical strip.
Figure 4.1. (a) P(a ≤ X ≤ b) = ab f (t) dt is the area of the shaded region under the density f (t). (b) P(x ≤ X ≤
In other words, the probability of random variable X taking
x + ∆x) = xx+∆x f (t) dt is the area of the shaded vertical strip.
on
a value in a small interval around point c is approximately equal
Note that for to
random
variables
with
f (c)∆c
when
∆ca density,
is the length of the interval.
[x<X≤x+∆x]
P(a ≤ X ≤ b) = P(a < X ≤ b) = P(aP ≤
X < b) = P(a < X < b)
• In fact, fX (x) = lim
∆x→0
∆x
since the corresponding integrals over an interval are not affected by whether or not the
• The
number fX (x) itself is not a probability. In particular,
endpoints are included
or excluded.
it does not have to be between 0 and 1.
Some common densities
• fX (c)of is
a relative
measure
forA summary
the likelihood
that
random
Here are some examples
continuous
random
variables.
of the more
comX will
take
a value in the immediate neighborhood
mon ones can be foundvariable
on the inside
of the
backon
cover.
a Later,
of random
pointvariable
c. is involved, we write fX (x) instead of f (x).
when more than one
Stated differently, the pdf fX (x) expresses how densely the
probability mass of random variable X is smeared out in the
neighborhood of point x. Hence, the name of density function.
122
10.26. Histogram
and pdf [22,
143approximation
and 145]:
From Histogram
to ppdf
Number of samples = 5000
2
4
6
Histogram
8
10
12
5000 Samples
14
16
18
Number of occurrences
1000
0.25
Vertical axis scaling
500
0
2
4
6
8
10
12
x
Frequency (%) of occurrences
14
0.216
pdf
Estimated pdf
18
20
0.15
10
0
2
4
6
8
10
x
12
14
16
0.1
18
0.05
6
0
2
4
6
8
10
x
12
14
16
18
Figure 14: From histogram to pdf.
(a) A (probability) histogram is a bar chart that divides the
range of values covered by the samples/measurements into
intervals of the same width, and shows the proportion (relative frequency) of the samples in each interval.
• To make a histogram, you break up the range of values
covered by the samples into a number of disjoint adjacent
intervals each having the same width, say width ∆. The
height of the bar on each interval [j∆, (j + 1)∆) is taken
such that the area of the bar is equal to the proportion
of the measurements falling in that interval (the proportion of measurements within the interval is divided by the
width of the interval to obtain the height of the bar).
• The total area under the histogram is thus standardized/normalized to one.
(b) If you take sufficiently many independent samples from a continuous random variable and make the width ∆ of the base
intervals of the probability histogram smaller and smaller, the
graph of the histogram will begin to look more and more like
the pdf.
123
(c) Conclusion: A probability density function can be seen as a
“smoothed out” version of a probability histogram
10.3
Expectation and Variance
10.27. Expectation: Suppose X is a continuous random variable
with probability density function fX (x).
Z ∞
EX =
xfX (x)dx
(21)
Z−∞
∞
E [g(X)] =
g(x)fX (x)dx
(22)
−∞
In particular,
E X2 =
Var X =
Z
∞
Z−∞
∞
−∞
x2 fX (x)dx
(x − EX)2 fX (x)dx = E X 2 − (EX)2 .
Example 10.28. For the random variable generated by the rand
command in MATLAB or the rand() command in Excel,
Example 10.29. For the exponential random variable introduced
in Definition 10.23,
124
10.30. If we compare other characteristics of discrete and continuous random variables, we find that with discrete random variables,
many facts are expressed as sums. With continuous random variables, the corresponding facts are expressed as integrals.
10.31. All of the properties for the expectation and variance of
discrete random variables also work for continuous random variables as well:
(a) Intuition/interpretation of the expected value: As n → ∞,
the average of n independent samples of X will approach EX.
This observation is known as the “Law of Large Numbers”.
(b) For c ∈ R, E [c] = c
(c) For constants a, b, we have E [aX + b] = aEX + b.
P
P
(d) E [ ni=1 ci gi (X] = ni=1 ci E [gi (X)].
(e) Var X = E X 2 − (EX)2
(f) Var X ≥ 0.
(g) Var X ≤ E X 2 .
(h) Var[aX + b] = a2 Var X.
(i) σaX+b = |a| σX .
10.32. Chebyshev’s Inequality :
P [|X − EX| ≥ α] ≤
2
σX
α2
or equivalently
1
n2
• This inequality use variance to bound the “tail probability”
of a random variable.
P [|X − EX| ≥ nσX ] ≤
• Useful only when α > σX
125
Example 10.33. A circuit is designed to handle a current of 20
mA plus or minus a deviation of less than 5 mA. If the applied
current has mean 20 mA and variance 4 mA2 , use the Chebyshev
inequality to bound the probability that the applied current violates the design parameters.
Let X denote the applied current. Then X is within the design
parameters if and only if |X − 20| < 5. To bound the probability
that this does not happen, write
P [|X − 20| < 5] ≤
Var X
4
=
= 0.16.
52
25
Hence, the probability of violating the design parameters is at most
16%.
10.34. Interesting applications of expectation:
(a) fX (x) = E [δ (X − x)]
(b) P [X ∈ B] = E [1B (X)]
126
Sirindhorn International Institute of Technology
Thammasat University
School of Information, Computer and Communication Technology
ECS315 2013/1 Part IV.2 Dr.Prapun
10.4
Families of Continuous Random Variables
Theorem 10.24 states that any nonnegative function f (x) whose
integral over the interval (−∞, +∞) equals 1 can be regarded as
a probability density function of a random variable. In real-world
applications, however, special mathematical forms naturally show
up. In this section, we introduce a couple families of continuous
random variables that frequently appear in practical applications.
The probability densities of the members of each family all have the
same mathematical form but differ only in one or more parameters.
10.4.1
Uniform Distribution
Definition 10.35. For a uniform random variable on an interval
[a, b], we denote its family by uniform([a, b]) or U([a, b]) or simply
U(a, b). Expressions that are synonymous with “X is a uniform
random variable” are “X is uniformly distributed”, “X has a uniform distribution”, and “X has a uniform density”. This family is
characterized by
0,
x < a, x > b
fX (x) =
1
b−a , a ≤ x ≤ b
• The random variable X is just as likely to be near any value
in [a, b] as any other value.
127
• In MATLAB,
(a) use X = a+(b-a)*rand or X = random(’Uniform’,a,b)
to generate the RV,
84
(b) use pdf(’Uniform’,x,a,b) and cdf(’Uniform’,x,a,b)
to calculate the pdf and cdf, respectively.
0,
x < a, x > b
Exercise 10.36. Show that FX (x) = x−a
b−a , a ≤ x ≤ b
Probability theory, random variables and random processes
Fx(x)
fx(x)
1
1
b–a
Fig. 3.5
a
x
b
0
a
Fig. 3.6
x
The pdf and cdf for the uniform random variable.
Figure 15: The pdf and cdf for the uniform random variable. [16, Fig. 3.5]
Fx(x)
fx(x)
1
Example
10.37 (F2011). Suppose X is uniformly distributed on
2πσ 2
1
the interval (1, 2). (X ∼ U(1, 2).)
(a) Plot the pdf fX (x) of X.
b
0
0
1
2
x
μ
0
μ
x
The
pdf and
cdf of
a Gaussian
(b)
Plot
the
cdf Frandom
of X.
X (x) variable.
G a u ss i a n ( o r n o r m a l ) ra n d o m va r i a b l e
This is a continuous random variable that
is described by the following pdf:
10.38. The uniform distribution provides a probability model for
'
1from the interval
(x − μ)2 [a, b].
selecting a point at frandom
(x) = √
exp −
,
(3.16)
x
2π σ 2
2σ 2
• Use with caution to model a quantity that is known to vary
parameters
whose
meaning
is described
later. Itelse
is usually
denoted
where μrandomly
and σ 2 are two
between
a and
b but
about
which little
is known.
2
as N (μ, σ ). Figure 3.6 shows sketches of the pdf and cdf of a Gaussian random variable.
The Gaussian random variable is the most important and frequently encountered random variable in communications. This is because
128 thermal noise, which is the major source
of noise in communication systems, has a Gaussian distribution. Gaussian noise and the
Gaussian pdf are discussed in more depth at the end of this chapter.
The problems explore other pdf models. Some of these arise when a random variable
Example 10.39. [9, Ex. 4.1 p. 140-141] In coherent radio communications, the phase difference between the transmitter and the
receiver, denoted by Θ, is modeled as having a uniform density on
[−π, π].
(a) P [Θ ≤ 0] =
1
2
(b) P Θ ≤ π2 =
3
4
Exercise
Show that EX =
2 110.40.
2
E X = 3 b + ab + a2 .
10.4.2
a+b
2 ,
2
Var X =
(b−a)
12
, and
Gaussian Distribution
10.41. This is the most widely used model for the distribution
of a random variable. When you have many independent random
variables, a fundamental result called the central limit theorem
(CLT) (informally) says that the sum (technically, the average) of
them can often be approximated by normal distribution.
Definition 10.42. Gaussian random variables:
(a) Often called normal random variables because they occur so
frequently in practice.
(b) In MATLAB, use X = random(’Normal’,m,σ) or X = σ*randn
+ m.
(c) fX (x) =
√ 1 e− 2 (
2πσ
1
x−m 2
σ
).
• In Excel, use NORMDIST(x,m,σ,FALSE).
In MATLAB, use normpdf(x,m,σ) or pdf(’Normal’,x,m,σ).
• Figure 16 displays the famous bell-shaped graph of the
Gaussian pdf. This curve is also called the normal curve.
129
84
Probability theory, random variables and random processes
Fx(x)
fx(x)
(d) FX (x) has no closed-form expression. However,
see 10.48.
1
1
Fig. 3.5
• Inb –MATLAB,
use normcdf(x,m,σ) or cdf(’Normal’,x,m,σ).
a
• In Excel, use NORMDIST(x,m,σ,TRUE).
x
x
a
b
0
b σ2 .
0 N m,
(e) We write aX ∼
The pdf and cdf for the uniform random variable.
Fx(x)
fx(x)
1
2πσ 2
1
1
2
0
Fig. 3.6
x
μ
0
μ
x
The pdf and cdf of a Gaussian random variable.
Figure 16: The pdf and cdf of N (µ, σ 2 ). [16, Fig. 3.6]
G a u ss i a n ( o r n o r m a l ) ra n d o m va r i a b l e
2
10.43.
EX by=themfollowing
and Var
is described
pdf: X = σ .
This is a continuous random variable that
'
(x − μ)2
exp −
,
probabilities:
2σ 2
2π σ 2
fx (x) = √
1
(3.16)
10.44. Important
P [|X
− µ| < σ] = 0.6827;
where μ and σ 2 are two parameters whose meaning is described later. It is usually denoted
P [|X
−(μ,µ|σ 2>
σ] =3.60.3173;
). Figure
shows sketches of the pdf and cdf of a Gaussian random variable.
as N
Gaussian
random
variable is the most important and frequently encountered ranP [|X The
− µ|
> 2σ]
= 0.0455;
dom variable in communications. This is because thermal noise, which is the major source
P [|X
− µ| < 2σ] = 0.9545
of noise in communication systems, has a Gaussian distribution. Gaussian noise and the
These
are illustrated
Figure
19.
Gaussianvalues
pdf are discussed
in more depth in
at the
end of this
chapter.
The problems explore other pdf models. Some of these arise when a random variable
is passed through a nonlinearity. How to determine the pdf of the random variable in this
Example
10.45.
case is discussed
next.Figure 20 compares several deviation scores and
the normal distribution:
n c t i o n s of a ra n d o m va r i a b l e A function of a random variable y = g(x) is itself a
(a) FuStandard
scores have a mean of zero and a standard deviation
random variable. From the definition, the cdf of y can be written as
of 1.0.
Fy (y) = P(ω ∈ : g(x(ω)) ≤ y).
(3.17)
(b) Scholastic Aptitude Test scores have a mean of 500 and a
standard deviation of 100.
130
109
3.5 The Gaussian random variable and process
(a)
0.6
0.4
Signal amplitude (V)
0.2
0
−0.2
−0.4
−0.6
−0.8
0
0.2
0.4
0.6
0.8
1
t (s)
(b)
4
Histogram
Gaussian fit
Laplacian fit
3.5
3
fx(x) (1/V)
2.5
2
1.5
1
0.5
Fig. 3.14
0
−1
−0.5
0
x (V)
0.5
1
(a) A sample skeletal muscle (emg) signal, and (b) its histogram and pdf fits.
Figure 17: Electrical activity of a skeletal muscle: (a) A sample skeletal muscle
(emg) signal, and (b) its histogram and pdf fits. [16, Fig. 3.14]
1=
=
=
2
∞
fx (x)dx
=
∞
K1 e−ax dx
2
−∞
−∞
∞
∞
2
2
K12
e−ax dx
e−ay dy
x=−∞
y=−∞
∞ ∞
2
2
K12
e−a(x +y ) dxdy.
x=−∞ y=−∞
131
2
(3.103)
standard score.
3.5 The Gaussian
random variable
andhelp
processus to determine
The normal
distribution
can
probabilities.
0.4
6.4 Notation
σx = 1
σ =2
0.35
The z notation is critical in the use ofσx =normal
x 5
0.3
distributions.
6.5 Normal Approximation of the
Binomial
0.2
0.25
fx(x)
6.3 Applications of Normal Distributions
111
Binomial probabilities can be estimated by using
a normal distribution.
0.15
0.1
23) Fourier transform:
F ( fX ) =
0.05
∞
∫ f ( x)e
X
− jω x
dt = e
1
− jω m − ω 2σ 2
2
mal Probability
Distributions
ence Scores
.
−∞
∞
24) Note that
∫e
0
x2
−α−15
dx =
−∞
π−10
.
α
−5
0
x
5
10
15
Plots of the zero-meanxGaussian
pdf for different values
−m
x −ofmstandard deviation,
x−m σ .
⎞ x
⎛
⎞
⎛
⎞
⎛
1− Q ⎜
= Q⎜−
.
[ X > x ] =ofQ the
⎜
⎟ ; P [ X < x ] =Gaussian
⎟pdf
Figure 25)
18:P Plots
zero-mean
for
different
values of standard
σ ⎟⎠
⎝ σ ⎠
⎝ σ ⎠
⎝
deviation,• σXP.⎡ X[16,
− μ Fig.
< σ ⎤ =3.15]
0.6827, P ⎡ X − μ > σ ⎤ = 0.3173
Fig. 3.15
⎣
⎦
⎣
⎦
μ > 2σ ⎤⎦Table
= 0.0455,
P ⎡⎣ X −single
μ σ<x2on
σ ⎤⎦different
=most
0.9545
mal probability distributionP ⎡⎣isX −considered
the
important proba3.1 Influence
of
quantities
stribution. An unlimited number
a normal
Range (±kσx )of continuous random
k = 1 variables
k = 2 have either
k=3
k=4
roximately normal distribution.
f ( x)
) x)
P(mx − kσx < x ≤ mxf +( xkσ
0.683
0.955
0.997
0.999
−6
−8
e all familiar with IQ (intelligence
quotient) scores and/or
Test)
10−4
10Aptitude
10
Error probability
10−3 SAT (Scholastic
the mean deviation
3.09
3.72scores have
4.75 a mean
5.61 of
scores have a mean of 100Distance
and afrom
standard
of 16. SAT
95%
68%
a standard deviation of 100. But did you know that these continuous random variables
w a normal distribution? of the pdf are ignorable. Indeed when communication systems are considered later it is the
X
X
μ −σ
μ μ +σ
μ − 2σ
μ
μ + 2σ
presence of these tails that
results2 in bit errors. The probabilities are on the order of 10−3 –
∞
x
−
1
10−12
, very small,
but
still
significant
in terms oftosystem
It N
is of
Q-function
:
corresponds
26)
Q
z
=
e
P [ X >performance.
zof
~
(
)
( 0,1interest
] where
∫z 2π 2 dx density
Figure 19: Probability
function
X ∼X N
(µ,
σ)2; ) . to
see how far, in terms of σx , one must be from the mean value to have the different levels of
A, pictures the comparison of
sevthat is
Q ( z ) is the probability of the “tail” of N ( 0,1) .
error probabilities.
shall
F IAs
GU
R EbeAseen in later chapters this translates to the required SNR to
iation scores and the normal
distriachieve a specified bit error probability.
This
N ( 0,1
) is also shown in Table 3.1.
Standard scores have a mean
of
Having considered the single (or univariate) Gaussian random variable, we turn our
d a standard deviationattention
of 1.0.
to the case of two jointly Gaussian random variables (or the bivariate case). Again
tic Aptitude Test scoresthey
have
a
are described
by their joint pdf which, in general, is an exponential whose exponent
( z(ax
) 2 +bx+cxy+dy+ey2 +f ) , where the conis a quadratic
f 500 and a standard deviation
of in the two variables, i.e., fx,y (x, y) =QKe
stants K, a, b, c, d, e, and f are chosen to satisfy the basic properties
of a valid joint pdf,
0.5
namely being always nonnegative
(≥
0),
having
unit
volume,
and
also
that the marginal
0 z
t Intelligence Scale scores have a ∞
∞
pdfs, fx (x) = −∞ fx,y2%
(x, y)dy and
= −∞34%
fx,y (x,14%
y)dx, are 2%
valid. Written in standard
14%fy (y)34%
100 and a standard deviation of
16.
1
form
joint
pdf is
a) the
Q is
a decreasing
function with Q ( 0 ) = .
case there are 34 percent of the –3.0 –2.0 –1.0 0 2 1.0 2.0 3.0
Standard Scores
b) Q ( − z ) = 1 − Q ( z )
etween the mean and one standard
−1
n, 14 percent between one and
c) Qtwo
) ) = − z 300 400 500 600 700 800
(1 − Q ( z200
SAT
π Scores
d deviations, and 2 percent beyond π x
x
1 2 − 2 sin θ
1 4 − 2 sin θ
2
dard deviations.
d) Q ( x ) = ∫ e52 dθ68
. Q ( x84
e
d116
θ.
) = ∫100
132 148
π
π
2
2
2
0
2
0
Binet Intelligence
Scale Scores
( f ( x ))
2
2
d
1 − x2 d
1 − 2 d
ck, Applying Psychology: Critical and Creative
of
Deviation
QThinking,
e ;6.2Q“Pictures
− Comparison
f ( x ) ) =the
e
f (Several
x) .
e)
( x ) = − Figure
(
dx
dxby permission
2π
2πPearson
the Normal Distribution,” ©Figure
1992 Prentice-Hall,
Inc. Reproduced
of
Education,
Inc.
20: Comparison
of Several
Deviation
Scoresdx
and the Normal
Distribution
132
(c) Binet Intelligence Scale41 scores have a mean of 100 and a
standard deviation of 16.
In each case there are 34 percent of the scores between the
mean and one standard deviation, 14 percent between one and
two standard deviations, and 2 percent beyond two standard
deviations. [Source: Beck, Applying Psychology: Critical and
Creative Thinking.]
10.46. N (0, 1) is the standard Gaussian (normal) distribution.
• In Excel, use NORMSINV(RAND()).
In MATLAB, use randn.
• The standard normal cdf is denoted by Φ(z).
◦ It inherits all properties of cdf.
◦ Moreover, note that Φ(−z) = 1 − Φ(z).
10.47. Relationship between N (0, 1) and N (m, σ 2 ).
(a) An arbitrary Gaussian random variable with mean m and
variance σ 2 can be represented as σZ +m, where Z ∼ N (0, 1).
41
Alfred Binet, who devised the first general aptitude test at the beginning of the 20th
century, defined intelligence as the ability to make adaptations. The general purpose of the
test was to determine which children in Paris could benefit from school. Binets test, like its
subsequent revisions, consists of a series of progressively more difficult tasks that children of
different ages can successfully complete. A child who can solve problems typically solved by
children at a particular age level is said to have that mental age. For example, if a child can
successfully do the same tasks that an average 8-year-old can do, he or she is said to have a
mental age of 8. The intelligence quotient, or IQ, is defined by the formula:
IQ = 100 × (Mental Age/Chronological Age)
There has been a great deal of controversy in recent years over what intelligence tests measure.
Many of the test items depend on either language or other specific cultural experiences for
correct answers. Nevertheless, such tests can rather effectively predict school success. If
school requires language and the tests measure language ability at a particular point of time
in a childs life, then the test is a better-than-chance predictor of school performance.
133
This relationship can be used to generate general Gaussian
RV from standard Gaussian RV.
(b) If X ∼ N m, σ 2 , the random variable
X −m
σ
is a standard normal random variable. That is, Z ∼ N (0, 1).
Z=
• Creating a new random variable by this transformation
is referred to as standardizing.
• The standardized variable is called “standard score” or
“z-score”.
10.48. It is impossible to express the integral of a Gaussian PDF
between non-infinite limits (e.g., (20)) as a function that appears
on most scientific calculators.
• An old but still popular technique to find integrals of the
Gaussian PDF is to refer to tables that have been obtained
by numerical integration.
◦ One such table is the table that lists Φ(z) for many values
of positive z.
◦ For X ∼ N m, σ 2 , we can show that the CDF of X can
be calculated by
x−m
FX (x) = Φ
.
σ
Example 10.49. Suppose Z ∼ N (0, 1). Evaluate the following
probabilities.
(a) P [−1 ≤ Z ≤ 1]
134
(b) P [−2 ≤ Z ≤ 2]
Example 10.50. Suppose X ∼ N (1, 2). Find P [1 ≤ X ≤ 2].
⎛ 1⎞
N ⎜ 0, ⎟
⎝ 2⎠
erf ( z )
Q
10.51. Q-function: Q (z) =
R∞
0
z
2
− x2
√1 e
z 2π
(
2z
)
dx corresponds to P [X > z]
where X ∼ N (0, 1); that is Q (z) is the probability of the “tail”
of N (0, 1). The Q function is then a complementary cdf (ccdf).
N ( 0,1)
1
0.9
0.8
Q(z)
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0
-3
z
-2
-1
0
1
2
3
z
Figure 21: Q-function
(a) Q is a decreasing function with Q (0) = 21 .
(b) Q (−z) = 1 − Q (z) = Φ(z)
10.52. Error function (MATLAB): erf (z) =
√ 1 − 2Q 2z
√2
π
Rz
2
e−x dx =
0
(a) It is an odd function of z.
(b) For z ≥ 0, it corresponds to P [|X| < z] where X ∼ N 0, 21 .
(c) lim erf (z) = 1
z→∞
135
•
⎧0,
k
k −2
E ⎡( X − μ ) ⎤ = ( k − 1) E ⎡( X − μ ) ⎤ = ⎨
⎣
⎦
⎣
⎦ 1⋅ 3 ⋅ 5 ⋅
⎩
•
⎧
2
k
, k odd
k
⎪ 2 ⋅ 4 ⋅ 6 ⋅ ⋅ ( k − 1) σ
⎡
⎤
[Papoulis p 111].
E X −μ =⎨
π
⎣
⎦
⎪1 ⋅ 3 ⋅ 5 ⋅ ⋅ ( k − 1) σ k ,
k even
⎩
k odd
⋅ ( k − 1)σ , k even
k
(d) erf (−z) = −erf (z)
⎤⎦ = 4 μ 2σ 2 + 2
σ4.
• Var ⎡⎣ X 2 1
x
1
x
√
(e) Φ(x) = 2 1 + erf ⎡ √
=⎡ 2 kerfc
− 2
⎧0,
k
−2
⎤
⎤
28) For N ( 0,1) and k ≥ 1 , E ⎣ X ⎦(2)
= ( k − 1) E ⎣ X
⎦ = ⎨1 ⋅ 3 ⋅ 5 ⋅
⎩
k odd
⋅ ( k − 1) , k even
(f)29)The
complementary error 2function:
√e− x dx= 1 − 22Q ( R 2∞z ) corresponds
Error function (Matlab): erf ( z ) =
to
2
∫
erfc (z) = 1 − erf (z) = 2Qπ 0 2z = √π z e−x dx
z
2
⎛ 1⎞
P ⎡⎣ X < z ⎤⎦ where X ~ N ⎜ 0, ⎟ .
⎝ 2⎠
⎛ 1⎞
N ⎜ 0, ⎟
⎝ 2⎠
erf ( z )
Q
0
(
2z
)
z
a) lim erf ( z ) = 1
z →∞
Figure 22: erf-function and Q-function
b) erf ( − z ) = −erf ( z )
10.4.3
Exponential Distribution
Definition 10.53. The exponential distribution is denoted by
E (λ).
(a) λ > 0 is a parameter of the distribution, often called the rate
parameter.
(b) Characterized by
−λx
λe , x > 0,
• fX (x) =
0,
x≤0
1 − e−λx , x > 0,
• FX (x) =
0,
x≤0
136
• Survival-, survivor-, or reliability-function:
(c) MATLAB:
• X = exprnd(1/λ) or random(’exp’,1/λ)
• fX (x) = exppdf(x,1/λ) or pdf(’exp’,x,1/λ)
• FX (x) = expcdf(x,1/λ) or cdf(’exp’,x,1/λ)
Example 10.54. Suppose X ∼ E(λ), find P [1 < X < 2].
Exercise 10.55. Exponential random variable as a continuous
version of geometric random variable: Suppose X ∼ E (λ). Show
that bXc ∼ G0 (e−λ ) and dXe ∼ G1 (e−λ )
Example 10.56. The exponential distribution is intimately related to the Poisson process. It is often used as a probability
model for the (waiting) time until a “rare” event occurs.
• time elapsed until the next earthquake in a certain region
• decay time of a radioactive particle
• time between independent events such as arrivals at a service
facility or arrivals of customers in a shop.
• duration of a cell-phone call
• time it takes a computer network to transmit a message from
one node to another.
137
10.57. EX =
1
λ
Example 10.58. Phone Company A charges $0.15 per minute
for telephone calls. For any fraction of a minute at the end of
a call, they charge for a full minute. Phone Company B also
charges $0.15 per minute. However, Phone Company B calculates
its charge based on the exact duration of a call. If T , the duration
of a call in minutes, is exponential with parameter λ = 1/3, what
are the expected revenues per call E [RA ] and E [RB ] for companies
A and B?
Solution: First, note that ET = λ1 = 3. Hence,
E [RB ] = E [0.15 × T ] = 0.15ET = $0.45.
and
E [RA ] = E [0.15 × dT e] = 0.15E dT e .
Now, recall that dT e ∼ G1 e−λ . Hence, E dT e =
Therefore,
E [RA ] = 0.15E dT e ≈ 0.5292.
1
1−e−λ
≈ 3.53.
10.59. Memoryless property : The exponential r.v. is the only
continuous42 r.v. on [0, ∞) that satisfies the memoryless property:
P [X > s + x |X > s] = P [X > x]
for all x > 0 and all s > 0 [18, p. 157–159]. In words, the future
is independent of the past. The fact that it hasn’t happened yet,
tells us nothing about how much longer it will take before it does
happen.
• Imagining that the exponentially distributed random variable
X represents the lifetime of an item, the residual life of an item
has the same exponential distribution as the original lifetime,
regardless of how long the item has been already in use. In
42
For discrete random variable, geometric random variables satisfy the memoryless property.
138
other words, there is no deterioration/degradation over time.
If it is still currently working after 20 years of use, then today,
its condition is “just like new”.
• In particular, suppose we define the set B+x to be {x + b : b ∈ B}.
For any x > 0 and set B ⊂ [0, ∞), we have
P [X ∈ B + x|X > x] = P [X ∈ B]
because
P [X ∈ B + x]
=
P [X > x]
10.5
−λt
dt τ =t−x
B+x λe
=
e−λx
R
R
B
λe−λ(τ +x) dτ
.
e−λx
Function of Continuous Random Variables: SISO
Reconsider the derived random variable Y = g(X).
Recall that we can find EY easily by (22):
Z
EY = E [g(X)] =
g(x)fX (x)dx.
R
However, there are cases when we have to evaluate probability
directly involving the random variable Y or find fY (y) directly.
Recall that for discrete random variables, it is easy to find pY (y)
by adding all pX (x) over all x such that g(x) = y:
X
pY (y) =
pX (x).
(23)
x:g(x)=y
For continuous random variables, it turns out that we can’t43 simply integrate the pdf of X to get the pdf of Y .
43
When you applied Equation (23) to continuous random variables, what you would get is
0 = 0, which is true but not interesting nor useful.
139
10.60. For Y = g(X), if you want to find fY (y), the following
two-step procedure will always work and is easy to remember:
(a) Find the cdf FY (y) = P [Y ≤ y].
(b) Compute the pdf from the cdf by “finding the derivative”
d
fY (y) = dy
FY (y) (as described in 10.13).
10.61. Linear Transformation: Suppose Y = aX + b. Then,
the cdf of Y is given by
 h
i
 P X ≤ y−b , a > 0,
a i
h
FY (y) = P [Y ≤ y] = P [aX + b ≤ y] =
 P X ≥ y−b , a < 0.
a
Now, by definition, we know that
y−b
y−b
= FX
,
P X≤
a
a
and
y−b
y−b
y−b
=P X>
+P X =
P X≥
a
a
a
y−b
y−b
+P X =
.
= 1 − FX
a
a
i
h
y−b
For continuous random variable, P X = a = 0. Hence,

 FX y−b ,
a > 0,
a
FY (y) =
 1 − FX y−b , a < 0.
a
Finally, fundamental theorem of calculus and chain rule gives

y−b
1

, a > 0,
d
a fX a
fY (y) = FY (y) =
 − 1 fX y−b , a < 0.
dy
a
a
Note that we can further simplify the final formula by using the
| · | function:
1
y−b
fY (y) =
fX
, a 6= 0.
(24)
|a|
a
140
Graphically, to get the plots of fY , we compress fX horizontally
by a factor of a, scale it vertically by a factor of 1/|a|, and shift it
to the right by b.
Of course, if a = 0, then we get the uninteresting degenerated
random variable Y ≡ b.
10.62. Suppose X ∼ N (m, σ 2 ) and Y = aX+b for some constants
a and b. Then, we can use (24) to show that X ∼ N (am+b, a2 σ 2 ).
Example 10.63. Amplitude modulation in certain communication systems can be accomplished using various nonlinear devices
such as a semiconductor diode. Suppose we model the nonlinear
device by the function Y = X 2 . If the input X is a continuous
random variable, find the density of the output Y = X 2 .
141
Exercise 10.64 (F2011). Suppose X is uniformly distributed on
the interval (1, 2). (X ∼ U(1, 2).) Let Y = X12 .
(a) Find fY (y).
(b) Find EY .
Exercise 10.65 (F2011). Consider the function
x, x ≥ 0
g(x) =
−x, x < 0.
Suppose Y = g(X), where X ∼ U(−2, 2).
Remark: The function g operates like a full-wave rectifier in
that if a positive input voltage X is applied, the output is Y = X,
while if a negative input voltage X is applied, the output is Y =
−X.
(a) Find EY .
(b) Plot the cdf of Y .
(c) Find the pdf of Y
142
P [X ∈ B] =
P [X = x] =
Discrete
P
pX (x)
Continuous
R
fX (x)dx
x∈B
B
pX (x) = F (x) − F (x− )
P X ((a, b]) = F (b) − F (a)
Interval prob.
P X ([a, b]) = F (b) − F a−
P X ([a, b)) = F b− − F a−
P X ((a, b)) = F b− − F (a)
P
EX =
0
P X ((a, b]) = P X ([a, b])
= P X ([a, b)) = P X ((a, b))
Zb
= fX (x)dx = F (b) − F (a)
a
+∞
R
xpX (x)
x
xfX (x)dx
−∞
d
P [g(X) ≤ y] .
dy
fY (y) =
Alternatively,
For Y = g(X),
P
pY (y) =
pX (x)
x: g(x)=y
For Y = g(X),
P [Y ∈ B] =
P
pX (x)
x:g(x)∈B
P
E [g(X)] =
g(x)pX (x)
fY (y) =
xk are the real-valued roots
of the equation y = g(x).
R
fX (x)dx
{x:g(x)∈B}
+∞
R
g(x)fX (x)dx
x
E [X 2 ] =
P
−∞
+∞
R
x2 pX (x)
x
Var X =
P
x
X fX (xk )
,
|g 0 (xk )|
k
x2 fX (x)dx
−∞
(x − EX)2 pX (x)
+∞
R
−∞
(x − EX)2 fX (x)dx
Table 4: Important Formulas for Discrete and Continuous Random Variables
143
Sirindhorn International Institute of Technology
Thammasat University
School of Information, Computer and Communication Technology
ECS315 2013/1 Part V.1 Dr.Prapun
11
Multiple Random Variables
One is often interested not only in individual random variables, but
also in relationships between two or more random variables. Furthermore, one often wishes to make inferences about one random
variable on the basis of observations of other random variables.
Example 11.1. If the experiment is the testing of a new medicine,
the researcher might be interested in cholesterol level, blood pressure, and the glucose level of a test person.
11.1
A Pair of Discrete Random Variables
In this section, we consider two discrete random variables, say X
and Y , simultaneously.
11.2. The analysis are different from Section 9.2 in two main
aspects. First, there may be no deterministic relationship (such as
Y = g(X)) between the two random variables. Second, we want
to look at both random variables as a whole, not just X alone or
Y alone.
Example 11.3. Communication engineers may be interested in
the input X and output Y of a communication channel.
144
Example 11.4. Of course, to rigorously define (any) random variables, we need to go back to the sample space Ω. Recall Example
7.4 where we considered several random variables defined on the
sample space Ω = {1, 2, 3, 4, 5, 6} where the outcomes are equally
likely. In that example, we define X(ω) = ω and Y (ω) = (ω − 3)2 .
Example 11.5. Consider the scores of 20 students below:
10, 9, 10, 9, 9, 10, 9, 10, 10, 9, 1, 3, 4, 6, 5, 5, 3, 3, 1, 3.
|
{z
} |
{z
}
Room #1
Room #2
The first ten scores are from (ten) students in room #1. The last
10 scores are from (ten) students in room #2.
Suppose we have the a score report card for each student. Then,
in total, we have 20 report cards.
Figure 23: In Example 11.5, we pick a report card randomly from a pile of
cards.
I pick one report card up randomly. Let X be the score on that
card.
• What is the chance that X > 5? (Ans: P [X > 5] = 11/20.)
145
• What is the chance that X = 10? (Ans: pX (10) = P [X = 10] =
5/20 = 1/4.)
Now, let the random variable Y denote the room# of the student
whose report card is picked up.
• What is the probability that X = 10 and Y = 2?
• What is the probability that X = 10 and Y = 1?
• What is the probability that X > 5 and Y = 1?
• What is the probability that X > 5 and Y = 2?
Now suppose someone informs me that the report card which I
picked up is from a student in room #1. (He may be able to tell
this by the color of the report card of which I have no knowledge.)
I now have an extra information that Y = 1.
• What is the probability that X > 5 given that Y = 1?
• What is the probability that X = 10 given that Y = 1?
146
11.6. Recall that, in probability, “,” means “and”. For example,
P [X = x, Y = y] = P [X = x and Y = y]
and
P [3 ≤ X < 4, Y < 1] = P [3 ≤ X < 4 and Y < 1]
= P [X ∈ [3, 4) and Y ∈ (−∞, 1)] .
In general,
[“Some condition(s) on X”,“Some condition(s) on Y ”]
is the same as the intersection of the individual statements:
[“Some condition(s) on X”] ∩ [“Some condition(s) on Y ”]
which simply means both statements happen.
More technically,
[X ∈ B, Y ∈ C] = [X ∈ B and Y ∈ C] = [X ∈ B] ∩ [Y ∈ C]
and
P [X ∈ B, Y ∈ C] = P [X ∈ B and Y ∈ C]
= P ([X ∈ B] ∩ [Y ∈ C]) .
Remark: Linking back to the original sample space, this shorthand actually says
[X ∈ B, Y ∈ C] = [X
= {ω
= {ω
= [X
∈ B and Y ∈ C]
∈ Ω : X(ω) ∈ B and Y (ω) ∈ C}
∈ Ω : X(ω) ∈ B } ∩ {ω ∈ Ω : Y (ω) ∈ C}
∈ B] ∩ [Y ∈ C] .
147
11.7. The concept of conditional probability can be straightforwardly applied to discrete random variables. For example,
P [“Some condition(s) on X” | “Some condition(s) on Y ”] (25)
is the conditional probability P (A|B) where
A = [“Some condition(s) on X”] and
B = [“Some condition(s) on Y ”].
Recall that P (A|B) = P (A ∩ B)/P (B). Therefore,
P [X = x and Y = y]
,
P [Y = y]
P [X = x| Y = y] =
and
P [3 ≤ X < 4| Y < 1] =
P [3 ≤ X < 4 and Y < 1]
P [Y < 1]
More generally, (25) is
P ([“Some condition(s) on X”] ∩ [“Some condition(s) on Y ”])
P ([“Some condition(s) on Y ”])
P ([“Some condition(s) on X”,“Some condition(s) on Y ”])
=
P ([“Some condition(s) on Y ”])
P [“Some condition(s) on X”,“Some condition(s) on Y ”]
=
P [“Some condition(s) on Y ”]
=
More technically,
P [X ∈ B| Y ∈ C] = P ([X ∈ B] |[Y ∈ C]) =
=
P [X ∈ B, Y ∈ C]
.
P [Y ∈ C]
148
P ([X ∈ B] ∩ [Y ∈ C])
P ([Y ∈ C])
Definition 11.8. Joint pmf : If X and Y are two discrete random variables (defined on a same sample space with probability
measure P ), the function pX,Y (x, y) defined by
pX,Y (x, y) = P [X = x, Y = y]
is called the joint probability mass function of X and Y .
(a) We can visualize the joint pmf via stem plot. See Figure 24.
(b) To evaluate the probability for a statement that involves both
X and Y random variables:
We first find all pairs (x, y) that satisfy the condition(s) in
the statement, and then add up all the corresponding values
joint pmf.
More technically, we can then evaluate P [(X, Y ) ∈ R] by
X
P [(X, Y ) ∈ R] =
pX,Y (x, y).
(x,y):(x,y)∈R
Example 11.9 (F2011). Consider random variables X and Y
whose joint pmf is given by
c (x + y) , x ∈ {1, 3} and y ∈ {2, 4} ,
pX,Y (x, y) =
0,
otherwise.
(a) Check that c = 1/20.
(b) Find P X 2 + Y 2 = 13 .
In most situation, it is much more convenient to focus on the
“important” part of the joint pmf. To do this, we usually present
the joint pmf (and the conditional pmf) in their matrix forms:
149
2.3 Multiple random variables
75
Example11.10.
2.13. In the
precedingboth
example,
is the
that the first
cache valDefinition
When
Xwhat
and
Y probability
take finitely
many
miss occurs after the third memory access?
ues (both have finite supports), say SX = {x1 , . . . , xm } and SY =
Solution. We need to find
{y1 , . . . , yn }, respectively, we can arrange
the probabilities pX,Y (xi , yj )
∞
P(T > 3) = ∑ P(T = k).
in an m × n matrix
k=4

However, since P(T = k) = 0 for k ≤ 0, a finite series is obtained by writing 
pX,Y (x1 , y1 ) pX,Y
(x−,P(T
y2 )≤ 3). . . pX,Y (x1 , yn )
3) = 1 1

 p (x , y )P(T >
 X,Y 2 1 pX,Y (x2 , y3 2 ) . . . pX,Y (x2 , yn ) 
= .1 − ∑ P(T =
(26)
.

..
..
.. k=1 . .k).


.
.
2
].
pX,Y (xm , y1 ) pX,Y=(x1m−,(1y−2 )p)[1.+. .p +ppX,Y
(xm , yn )
• We
callmass
thisfunctions
matrix the joint pmf matrix.
Joint shall
probability
The joint probability mass function of X and Y is defined by
• The
sum of all the entries in the matrix is one.
pXY (xi , y j ) := P(X = xi ,Y = y j ).
(2.7)
An example for integer-valued random variables is sketched in Figure 2.8.
0.06
0.04
0.02
0
8
7
6
5
4
3
2
1
i
0
6
5
3
4
2
1
0
j
2.8. Sketch of bivariate probability mass function pXY (i, j).
Figure 24:Figure
Example
of the plot of a joint pmf. [9, Fig. 2.8]
•
It turns out that we can extract the marginal probability mass functions pX (xi ) and
pY (y j ) from the joint pmf p44
j ) using the formulas
pX,Y (x, y) = 0 if XY (xxi , y ∈
/ SX or y ∈
/ SY . In other words,
we
don’t have to consider
the
the supports
pXY (xiy
, y joutside
)
pX (x
(2.8)of X
i) =x
∑and
j
and Y , respectively.
44
To see this, note that pX,Y (x, y) can not exceed pX (x) because P (A ∩ B) ≤ P (A). Now,
suppose at x = a, we have pX (a) = 0. Then pX,Y (a, y) must also = 0 for any y because it can
not exceed pX (a) = 0. Similarly, suppose at y = a, we have pY (a) = 0. Then pX,Y (x, a) = 0
for any x.
150
11.11. From the joint pmf, we can find pX (x) and pY (y) by
X
pX,Y (x, y)
(27)
pX (x) =
y
pY (y) =
X
pX,Y (x, y)
(28)
x
In this setting, pX (x) and pY (y) are call the marginal pmfs (to
distinguish them from the joint one).
(a) Suppose we have the joint pmf matrix in (26). Then, the sum
of the entries in the ith row is45 pX (xi ), and
the sum of the entries in the jth column is pY (yj ):
pX (xi ) =
n
X
pX,Y (xi , yj ) and pY (yj ) =
m
X
pX,Y (xi , yj )
i=1
j=1
(b) In MATLAB, suppose we save the joint pmf matrix as P XY, then
the marginal pmf (row) vectors p X and p Y can be found by
p_X = (sum(P_XY,2))’
p_Y = (sum(P_XY,1))
Example 11.12. Consider the following joint pmf matrix
45
To see this, we consider A = [X = xi ] and a collection defined by Bj = [Y = yj ]
and
/ SY ]. Note that the collection B0 , B1 , . . . , Bn partitions Ω. So, P (A) =
Pn B0 = [Y ∈
P
(A
∩
B
j ). Of course, because the support of Y is SY , we have P (A ∩ B0 ) = 0. Hence,
j=0
the sum can start at j = 1 instead of j = 0.
151
Definition 11.13. The conditional pmf of X given Y is defined
as
pX|Y (x|y) = P [X = x|Y = y]
which gives
pX,Y (x, y) = pX|Y (x|y)pY (y) = pY |X (y|x)pX (x).
(29)
11.14. Equation (29) is quite important in practice. In most
cases, systems are naturally defined/given/studied in terms of their
conditional probabilities, say pY |X (y|x). Therefore, it is important
the we know how to construct the joint pmf from the conditional
pmf.
Example 11.15. Consider a binary symmetric channel. Suppose
the input X to the channel is Bernoulli(0.3). At the output Y of
this channel, the crossover (bit-flipped) probability is 0.1. Find
the joint pmf pX,Y (x, y) of X and Y .
Exercise 11.16. Toss-and-Roll Game:
Step 1 Toss a fair coin. Define X by
1, if result = H,
X=
0, if result = T.
Step 2 You have two dice, Dice 1 and Dice 2. Dice 1 is fair. Dice 2 is
unfair with p(1) = p(2) = p(3) = 92 and p(4) = p(5) = p(6) =
1
9.
(i) If X = 0, roll Dice 1.
(ii) If X = 1, roll Dice 2.
152
Record the result as Y .
Find the joint pmf pX,Y (x, y) of X and Y .
Exercise 11.17 (F2011). Continue from Example 11.9. Random
variables X and Y have the following joint pmf
c (x + y) , x ∈ {1, 3} and y ∈ {2, 4} ,
pX,Y (x, y) =
0,
otherwise.
(a) Find pX (x).
(b) Find EX.
(c) Find pY |X (y|1). Note that your answer should be of the form

 ?, y = 2,
?, y = 4,
pY |X (y|1) =

0, otherwise.
(d) Find pY |X (y|3).
Definition 11.18. The joint cdf of X and Y is defined by
FX,Y (x, y) = P [X ≤ x, Y ≤ y] .
153
Definition 11.19. Two random variables X and Y are said to be
identically distributed if, for every B, P [X ∈ B] = P [Y ∈ B].
Example 11.20. Let X ∼ Bernoulli(p). Let Y = X and Z = 1 −
X. Then, all of these random variables are identically distributed.
11.21. The following statements are equivalent:
(a) Random variables X and Y are identically distributed .
(b) For every B, P [X ∈ B] = P [Y ∈ B]
(c) pX (c) = pY (c) for all c
(d) FX (c) = FY (c) for all c
Definition 11.22. Two random variables X and Y are said to be
independent if the events [X ∈ B] and [Y ∈ C] are independent
for all sets B and C.
11.23. The following statements are equivalent:
(a) Random variables X and Y are independent.
[Y ∈ C] for all B, C.
|=
(b) [X ∈ B]
(c) P [X ∈ B, Y ∈ C] = P [X ∈ B] × P [Y ∈ C] for all B, C.
(d) pX,Y (x, y) = pX (x) × pY (y) for all x, y.
(e) FX,Y (x, y) = FX (x) × FY (y) for all x, y.
Definition 11.24. Two random variables X and Y are said to be
independent and identically distributed (i.i.d.) if X and
Y are both independent and identically distributed.
11.25. Being identically distributed does not imply independence.
Similarly, being independent, does not imply being identically distributed.
154
Example 11.26. Roll a dice. Let X be the result. Set Y = X.
Example 11.27. Suppose the pmf of a random variable X is given
by

 1/4, x = 3,
α,
x = 4,
pX (x) =

0,
otherwise.
Let Y be another random variable. Assume that X and Y are
i.i.d.
Find
(a) α,
(b) the pmf of Y , and
(c) the joint pmf of X and Y .
155
Example 11.28. Consider a pair of random variables X and Y
whose joint pmf is given by

1/15, x = 3, y = 1,




 2/15, x = 4, y = 1,
4/15, x = 3, y = 3,
pX,Y (x, y) =



β,
x = 4, y = 3,


0,
otherwise.
(a) Are X and Y identically distributed?
(b) Are X and Y independent?
156
11.2
Extending the Definitions to Multiple RVs
Definition 11.29. Joint pmf:
pX1 ,X2 ,...,Xn (x1 , x2 , . . . , xn ) = P [X1 = x1 , X2 = x2 , . . . , Xn = xn ] .
Joint cdf:
FX1 ,X2 ,...,Xn (x1 , x2 , . . . , xn ) = P [X1 ≤ x1 , X2 ≤ x2 , . . . , Xn ≤ xn ] .
11.30. Marginal pmf:
Definition 11.31. Identically distributed random variables:
The following statements are equivalent.
(a) Random variables X1 , X2 , . . . are identically distributed
(b) For every B, P [Xj ∈ B] does not depend on j.
(c) pXi (c) = pXj (c) for all c, i, j.
(d) FXi (c) = FXj (c) for all c, i, j.
Definition 11.32. Independence among finite number of random variables: The following statements are equivalent.
(a) X1 , X2 , . . . , Xn are independent
(b) [X1 ∈ B1 ], [X2 ∈ B2 ], . . . , [Xn ∈ Bn ] are independent, for all
B1 , B2 , . . . , Bn .
Q
(c) P [Xi ∈ Bi , ∀i] = ni=1 P [Xi ∈ Bi ], for all B1 , B2 , . . . , Bn .
Q
(d) pX1 ,X2 ,...,Xn (x1 , x2 , . . . , xn ) = ni=1 pXi (xi ) for all x1 , x2 , . . . , xn .
Q
(e) FX1 ,X2 ,...,Xn (x1 , x2 , . . . , xn ) = ni=1 FXi (xi ) for all x1 , x2 , . . . , xn .
Example 11.33. Toss a coin n times. For the ith toss, let
1, if H happens on the ith toss,
Xi =
0, if T happens on the ith toss.
We then have a collection of i.i.d. random variables X1 , X2 , X3 , . . . , Xn .
157
Example 11.34. Roll a dice n times. Let Ni be the result of the
ith roll. We then have another collection of i.i.d. random variables
N1 , N2 , N3 , . . . , Nn .
Example 11.35. Let X1 be the result of tossing a coin. Set X2 =
X3 = · · · = Xn = X1 .
11.36. If X1 , X2 , . . . , Xn are independent, then so is any subcollection of them.
11.37. For i.i.d. Xi ∼ Bernoulli(p), Y = X1 + X2 + · · · + Xn is
B(n, p).
Definition 11.38. A pairwise independent collection of random variables is a collection of random variables any two of which
are independent.
(a) Any collection of (mutually) independent random variables is
pairwise independent
(b) Some pairwise independent collections are not independent.
See Example (11.39).
Example 11.39. Let suppose X, Y , and Z have the following
joint probability distribution: pX,Y,Z (x, y, z) = 14 for (x, y, z) ∈
{(0, 0, 0), (0, 1, 1), (1, 0, 1), (1, 1, 0)}. This, for example, can be constructed by starting with independent X and Y that are Bernoulli1
2 . Then set Z = X ⊕ Y = X + Y mod 2.
(a) X, Y, Z are pairwise independent.
(b) X, Y, Z are not independent.
158
Sirindhorn International Institute of Technology
Thammasat University
School of Information, Computer and Communication Technology
ECS315 2013/1
11.3
Part V.2
Dr.Prapun
Function of Discrete Random Variables
11.40. Recall that for discrete random variable X, the pmf of a
derived random variable Y = g(X) is given by
X
pY (y) =
pX (x).
x:g(x)=y
Similarly, for discrete random variables X and Y , the pmf of a
derived random variable Z = g(X, Y ) is given by
X
pX,Y (x, y).
pZ (z) =
(x,y):g(x,y)=z
Example 11.41. Suppose the joint

1/15,




 2/15,
pX,Y (x, y) =
4/15,



8/15,


0,
Let Z = X + Y . Find the pmf of Z.
159
pmf of X and Y is given by
x = 0, y = 0,
x = 1, y = 0,
x = 0, y = 1,
x = 1, y = 1,
otherwise.
Exercise 11.42 (F2011). Continue from Exercise 11.9. Let Z =
X +Y.
(a) Find the pmf of Z.
(b) Find EZ.
11.43. In general, when Z = X + Y ,
X
pX,Y (x, y)
pZ (z) =
(x,y):x+y=z
=
X
y
pX,Y (z − y, y) =
X
x
pX,Y (x, z − x).
Furthermore, if X and Y are independent,
X
pX (x) pY (y)
pZ (z) =
(30)
(x,y):x+y=z
=
X
y
pX (z − y) pY (y) =
X
x
pX (x) pY (z − x).
(31)
Example 11.44. Suppose Λ1 ∼ P(λ1 ) and Λ2 ∼ P(λ2 ) are independent. Let Λ = Λ1 +Λ2 . Use (31) to show46 that Λ ∼ P(λ1 +λ2 ).
First, note that pΛ (`) would be positive only on nonnegative
integers because a sum of nonnegative integers (Λ1 and Λ2 ) is still
a nonnegative integer. So, the support of Λ is the same as the
support for Λ1 and Λ2 . Now, we know, from (31), that
X
P [Λ = `] = P [Λ1 + Λ2 = `] =
P [Λ1 = i] P [Λ2 = ` − i]
i
Of course, we are interested in ` that is a nonnegative integer.
The summation runs over i = 0, 1, 2, . . .. Other values of i would
make P [Λ1 = i] = 0. Note also that if i > `, then ` − i < 0 and
P [Λ2 = ` − i] = 0. Hence, we conclude that the index i can only
46
Remark: You may feel that simplifying the sum in this example (and in Exercise 11.45
is difficult and tedious, in Section 13, we will introduce another technique which will make
the answer obvious. The idea is to realize that (31) is a convolution and hence we can use
Fourier transform to work with a product in another domain.
160
be integers from 0 to k:
P [Λ = `] =
`
X
e
i
−λ1 λ1 −λ2
i=0
i!
e
`
−(λ1 +λ2 )
=e
λ`−i
2
(` − i)!
1X
`!
λi1 λ`−i
2
`! i=0 i! (` − i)!
`
−(λ1 +λ2 )
=e
`
1 X i `−i
−(λ1 +λ2 ) (λ1 + λ2 )
λ λ =e
,
`! i=0 1 2
`!
where the last equality is from the binomial theorem. Hence, the
sum of two independent Poisson random variables is still Poisson!
(
`
−(λ1 +λ2 ) (λ1 +λ2 )
, ` ∈ {0, 1, 2, . . .}
e
`!
pΛ (`) =
0,
otherwise
Exercise 11.45. Suppose B1 ∼ B(n1 , p) and B2 ∼ B(n2 , p) are
independent. Let B = B1 + B2 . Use (31) to show that B ∼
B(n1 + n2 , p).
11.4
Expectation of Function of Discrete Random Variables
11.46. Recall that the expected value of “any” function g of a
discrete random variable X can be calculated from
X
g(x)pX (x).
E [g(X)] =
x
Similarly47 , the expected value of “any” function g of two discrete
random variable X and Y can be calculated from
XX
E [g(X, Y )] =
g(x, y)pX,Y (x, y).
x
y
47
Again, these are called the law/rule of the lazy statistician (LOTUS) [22, Thm 3.6
p 48],[9, p. 149] because it is so much easier to use the above formula than to first find the
pmf of g(X) or g(X, Y ). It is also called substitution rule [21, p 271].
161
P [X = Y ]
Discrete
P
pX (x)
x∈B
P
pX,Y (x, y)
(x,y):(x,y)∈R
P
pX (x) = pX,Y (x, y)
y
P
pY (y) = pX,Y (x, y)
P P x
pX,Y (x, y)
xP
y: y<x
P
=
pX,Y (x, y)
y x: x>y
P
pX,Y (x, x)
X Y
Conditional
E [g(X, Y )]
pX,Y (x, y) = pX (x)pY (y)
p
(x,y)
pX|Y (x|y) = X,Y
p
(y)
Y
PP
g(x, y)pX,Y (x, y)
P [X ∈ B]
P [(X, Y ) ∈ R]
Joint to Marginal:
(Law of Total Prob.)
P [X > Y ]
|=
x
x
y
Table 5: Joint pmf: A Summary
11.47. E [·] is a linear operator: E [aX + bY ] = aEX + bEY .
(a) Homogeneous: E [cX] = cEX
(b) Additive: E [X + Y ] = EX + EY
P
P
(c) Extension: E [ ni=1 ci gi (Xi )] = ni=1 ci E [gi (Xi )].
Example 11.48. Recall from 11.37 that when i.i.d. Xi ∼ Bernoulli(p),
Y = X1 + X2 + · · · Xn is B(n, p). Also, from Example 9.4, we have
EXi = p. Hence,
" n
#
n
n
X
X
X
EY = E
Xi =
E [Xi ] =
p = np.
i=1
i=1
i=1
Therefore, the expectation of a binomial random variable with
parameters n and p is np.
162
Example 11.49. A binary communication link has bit-error probability p. What is the expected number of bit errors in a transmission of n bits?
Theorem 11.50 (Expectation and Independence). Two random
variables X and Y are independent if and only if
E [h(X)g(Y )] = E [h(X)] E [g(Y )]
for all functions h and g.
• In other words, X and Y are independent if and only if for
every pair of functions h and g, the expectation of the product
h(X)g(Y ) is equal to the product of the individual expectations.
• One special case is that
Y
|=
X
implies E [XY ] = EX × EY.
(32)
|=
However, independence means more than this property. In
other words, having E [XY ] = (EX)(EY ) does not necessarily
imply X Y . See Example 11.62.
11.51. Let’s combined what we have just learned about independence into the definition/equivalent statements that we already
have in 11.32.
The following statements are equivalent:
(a) Random variables X and Y are independent.
[Y ∈ C] for all B, C.
|=
(b) [X ∈ B]
(c) P [X ∈ B, Y ∈ C] = P [X ∈ B] × P [Y ∈ C] for all B, C.
(d) pX,Y (x, y) = pX (x) × pY (y) for all x, y.
(e) FX,Y (x, y) = FX (x) × FY (y) for all x, y.
(f)
163
Exercise 11.52 (F2011). Suppose X and Y are i.i.d. with EX =
EY = 1 and Var X = Var Y = 2. Find Var[XY ].
11.53. To quantify the amount of dependence between two
random variables, we may calculate their mutual information.
This quantity is crucial in the study of digital communications
and information theory. However, in introductory probability class
(and introductory communication class), it is traditionally omitted.
11.5
Linear Dependence
Definition 11.54. Given two random variables X and Y , we may
calculate the following quantities:
(a) Correlation: E [XY ].
(b) Covariance: Cov [X, Y ] = E [(X − EX)(Y − EY )].
(c) Correlation coefficient: ρX,Y =
Cov[X,Y ]
σX σY
Exercise 11.55 (F2011). Continue from Exercise 11.9.
(a) Find E [XY ].
1
(b) Check that Cov [X, Y ] = − 25
.
11.56. Cov [X, Y ] = E [(X − EX)(Y − EY )] = E [XY ] − EXEY
• Note that Var X = Cov [X, X].
11.57. Var [X + Y ] = Var X + Var Y + 2Cov [X, Y ]
164
Definition 11.58. X and Y are said to be uncorrelated if and
only if Cov [X, Y ] = 0.
11.59. The following statements are equivalent:
(a) X and Y are uncorrelated.
(b) Cov [X, Y ] = 0.
(c) E [XY ] = EXEY .
(d)
|=
11.60. Independence implies uncorrelatedness; that is if X Y ,
then Cov [X, Y ] = 0.
The converse is not true. Uncorrelatedness does not imply independence. See Example 11.62.
11.61. The variance of the sum of uncorrelated (or independent)
random variables is the sum of their variances.
Example 11.62. Let X be uniform on {±1, ±2} and Y = |X|.
165
Exercise 11.63. Suppose two fair dice are tossed. Denote by the
random variable V1 the number appearing on the first dice and by
the random variable V2 the number appearing on the second dice.
Let X = V1 + V2 and Y = V1 − V2 .
(a) Show that X and Y are not independent.
(b) Show that E [XY ] = EXEY .
11.64. Cauchy-Schwartz Inequality :
2 2
(Cov [X, Y ])2 ≤ σX
σY
11.65. Cov [aX + b, cY + d] = acCov [X, Y ]
Cov [aX + b, cY + d] = E [((aX + b) − E [aX + b]) ((cY + d) − E [cY + d])]
= E [((aX + b) − (aEX + b)) ((cY + d) − (cEY + d))]
= E [(aX − aEX) (cY − cEY )]
= acE [(X − EX) (Y − EY )]
= acCov [X, Y ] .
Definition 11.66. Correlation coefficient:
Cov [X, Y ]
σX σY
X − EX
Y − EY
E [XY ] − EXEY
=E
=
.
σX
σY
σX σY
ρX,Y =
• ρX,Y is dimensionless
• ρX,X = 1
• ρX,Y = 0 if and only if X and Y are uncorrelated.
11.67. Linear Dependence and Cauchy-Schwartz Inequality
1, a > 0
(a) If Y = aX + b, then ρX,Y = sign(a) =
−1, a < 0.
• To be rigorous, we should also require that σX > 0 and
a 6= 0.
(b) Cauchy-Schwartz Inequality : |ρX,Y | ≤ 1.
In other words, ρXY ∈ [−1, 1].
166
(c) When σY , σX > 0, equality occurs if and only if the following
conditions holds
≡
≡
≡
≡
∃a 6= 0 such that (X − EX) = a(Y − EY )
∃a 6= 0 and b ∈ R such that X = aY + b
∃c 6= 0 and d ∈ R such that Y = cX + d
|ρXY | = 1
a
In which case, |a| = σσXY and ρXY = |a|
= sgn a. Hence, ρXY
is used to quantify linear dependence between X and Y .
The closer |ρXY | to 1, the higher degree of linear dependence
between X and Y .
Example 11.68. [21, Section 5.2.3] Consider an important fact
that investment experience supports: spreading investments over
a variety of funds (diversification) diminishes risk. To illustrate,
imagine that the random variable X is the return on every invested
dollar in a local fund, and random variable Y is the return on every
invested dollar in a foreign fund. Assume that random variables X
and Y are i.i.d. with expected value 0.15 and standard deviation
0.12.
If you invest all of your money, say c, in either the local or the
foreign fund, your return R would be cX or cY .
• The expected return is ER = cEX = cEY = 0.15c.
• The standard deviation is cσX = cσY = 0.12c
Now imagine that your money is equally distributed over the
two funds. Then, the return R is 21 cX + 12 cY . The expected return
is ER = 21 cEX + 12 cEY = 0.15c. Hence, the expected return
remains at 15%. However,
hc
i c2
c2
c2
Var R = Var (X + Y ) = Var X + Var Y = × 0.122 .
2
4
4
2
√ c ≈ 0.0849c.
So, the standard deviation is 0.12
2
In comparison with the distributions of X and Y , the pmf of
1
2 (X + Y ) is concentrated more around the expected value. The
centralization of the distribution as random variables are averaged
together is a manifestation of the central limit theorem.
167
11.69. [21, Section 5.2.3] Example 11.68 is based on the assumption that return rates X and Y are independent from each other.
In the world of investment, however, risks are more commonly
reduced by combining negatively correlated funds (two funds are
negatively correlated when one tends to go up as the other falls).
This becomes clear when one considers the following hypothetical situation. Suppose that two stock market outcomes ω1 and ω2
are possible, and that each outcome will occur with a probability of
1
2 Assume that domestic and foreign fund returns X and Y are determined by X(ω1 ) = Y (ω2 ) = 0.25 and X(ω2 ) = Y (ω1 ) = −0.10.
Each of the two funds then has an expected return of 7.5%, with
equal probability for actual returns of 25% and 10%. The random
variable Z = 21 (X + Y ) satisfies Z(ω1 ) = Z(ω2 ) = 0.075. In other
words, Z is equal to 0.075 with certainty. This means that an investment that is equally divided between the domestic and foreign
funds has a guaranteed return of 7.5%.
168
Exercise 11.70. The input X and output Y of a system subject
to random perturbations are described probabilistically by the following joint pmf matrix:
x
1
3
y
2
4
5
0.02 0.10 0.08
0.08 0.32 0.40


(a) Evaluate the following quantities.
(i) EX
(ii) P [X = Y ]
(iii) P [XY < 6]
(iv) E [(X − 3)(Y − 2)]
(v) E X(Y 3 − 11Y 2 + 38Y )
(vi) Cov [X, Y ]
(vii) ρX,Y
(b) Calculate the following quantities using what you got from
part (a).
(i) Cov [3X + 4, 6Y − 7]
(ii) ρ3X+4,6Y −7
(iii) Cov [X, 6X − 7]
(iv) ρX,6X−7
169
Answers:
(a)
(i) EX = 2.6
(ii) P [X = Y ] = 0
(iii) P [XY < 6] = 0.2
(iv) E [(X − 3)(Y − 2)] = −0.88
(v) E X(Y 3 − 11Y 2 + 38Y ) = 104
(vi) Cov [X, Y ] = 0.032
(vii) ρX,Y = 0.0447
(b)
(i) Hence, Cov [3X + 4, 6Y − 7] = 3 × 6 × Cov [X, Y ] ≈ 3 ×
6 × 0.032 ≈ 0.576 .
(ii) Note that
Cov [aX + b, cY + d]
σaX+b σcY +d
ac
acCov [X, Y ]
=
=
ρX,Y = sign(ac) × ρX,Y .
|a|σX |c|σY
|ac|
ρaX+b,cY +d =
Hence, ρ3X+4,6Y −7 = sign(3 × 4)ρX,Y = ρX,Y = 0.0447 .
(iii) Cov [X, 6X − 7] = 1 × 6 × Cov [X, X] = 6 × Var[X] ≈
3.84 .
(iv) ρX,6X−7 = sign(1 × 6) × ρX,X = 1 .
170
Tutorial on Sep 20, 2013
Friday, September 20, 2013
9:16 AM
315 2013 L2 Page 1
Sirindhorn International Institute of Technology
Thammasat University
School of Information, Computer and Communication Technology
ECS315 2013/1
11.6
Part V.3
Dr.Prapun
Pairs of Continuous Random Variables
In this section, we start to look at more than one continuous random variables. You should find that many of the concepts and
formulas are similar if not the same as the ones for pairs of discrete
random variables which we have already studied. For discrete random variables, we use summations. Here, for continuous random
variables, we use integrations.
Recall that for a pair of discrete random variables, the joint
pmf pX,Y (x, y) completely characterizes the probability model of
two random variables X and Y . In particular, it does not only
capture the probability of X and probability of Y individually,
it also capture the relationship between them. For continuous
random variable, we replace the joint pmf by joint pdf.
Definition 11.71. We say that two random variables X and Y
are jointly continuous with joint pdf fX,Y (x, y) if48 for any
region R on the (x, y) plane
ZZ
P [(X, Y ) ∈ R] =
fX,Y (x, y)dxdy
(33)
{(x,y):(x,y)∈R}
To understand where Definition 11.71 comes from, it is helpful
to take a careful look at Table 6.
48
Remark: If you want to check that a function f (x, y) is the joint pdf of a pair of random
variables (X, Y ) by using the above definition, you will need to check that (33) is true for any
region R. This is not an easy task. Hence, we do not usually use this definition for such kind
of test. There are some mathematical facts that can be derived from this definition. Such
facts produce easier condition(s) than (33) but we will not talk about them here.
171
P [X ∈ B]
P [(X, Y ) ∈ R]
Discrete
P
pX (x)
x∈B
P
pX,Y (x, y)
(x,y):(x,y)∈R
Continuous
R
fX (x)dx
B
RR
fX,Y (x, y)dxdy
{(x,y):(x,y)∈R}
Table 6: pmf vs. pdf
Example 11.72. Indicate (sketch) the region of integration for
each of the following probabilities
(a) P [1 < X < 2, −1 < Y < 1]
(b) P [X + Y < 3]
11.73. For us, Definition 11.71 is useful because if you know that
a function f (x, y) is a joint pdf of a pair of random variables, then
you can calculate countless possibilities of probabilities involving
these two random variables via (33). (See, e.g. Example 11.76.)
However, the actual calculation of probability from (33) can be
difficult if you have non-rectangular region R or if you have a
complicated joint pdf. In other words, the formula itself is straightforward and simple, but to carry it out may require that you review
some multi-variable integration technique from your calculus class.
11.74. Intuition/Approximation: Note also that the joint
pdf’s definition extends the interpretation/approximation that we
previously discussed for one random variable.
Recall that for a single random variable, the pdf is a measure of
probability per unit length. In particular, if you want to find
the probability that the value of a random variable X falls inside
some small interval, say the interval [1.5, 1.6], then this probability
can be approximated by
P [1.5 ≤ X ≤ 1.6] ≈ fX (1.5) × 0.1.
172
More generally, for small value of interval length d, the probability
that the value of X falls within a small interval [x, x + d] can be
approximated by
P [x ≤ X ≤ x + d] ≈ fX (x) × d.
(34)
Usually, instead of using d, we use ∆x and hence
P [x ≤ X ≤ x + ∆x] ≈ fX (x) × ∆x.
(35)
For two random variables X and Y , the joint pdf fX,Y (x, y)
measures probability per unit area:
P [x ≤ X ≤ x + ∆x , y ≤ Y ≤ y + ∆y] ≈ fX,Y (x, y) × ∆x × ∆y.
(36)
Do not forget that the comma signifies the “and” (intersection)
operation.
11.75. There are two important characterizing properties of joint
pdf:
(a) fX,Y ≥ 0 a.e.
R +∞ R +∞
(b) −∞ −∞ fX,Y (x, y)dxdy = 1
173
Example 11.76. Consider a probability model of a pair of random
variables uniformly distributed over a rectangle in the X-Y plane:
c, 0 ≤ x ≤ 5, 0 ≤ y ≤ 3
fX,Y (x, y) =
0, otherwise.
(a) Find the constant c.
(b) Evaluate P [2 ≤ X ≤ 3, 1 ≤ Y ≤ 3], and P [Y > X]
11.77. Other important properties and definitions for a pair of
continuous random variables are summarized in Table 7 along with
their “discrete counterparts”.
174
PDiscrete
pX,Y (x, y)
P [(X, Y ) ∈ R]
RR Continuous
fX,Y (x, y)dxdy
(x,y):(x,y)∈R
Joint to Marginal:
pX (x) =
P
{(x,y):(x,y)∈R}
+∞
R
pX,Y (x, y)
fX (x) =
y
(Law of Total Prob.)
pY (y) =
P
pX,Y (x, y)
fY (y) =
x
P P
P [X > Y ]
+∞
R Rx
pX,Y (x, y)
P P
P
fX,Y (x, y)dx
fX,Y (x, y)dydx
−∞ −∞
+∞
R R∞
pX,Y (x, y)
=
y x: x>y
P [X = Y ]
fX,Y (x, y)dy
−∞
x y: y<x
=
−∞
+∞
R
fX,Y (x, y)dxdy
−∞ y
0
pX,Y (x, x)
x
|=
X Y
Conditional
pX,Y (x, y) = pX (x)pY (y)
p
(x,y)
pX|Y (x|y) = X,Y
pY (y)
fX,Y (x, y) = fX (x)fY (y)
f
(x,y)
fX|Y (x|y) = X,Y
fY (y)
Table 7: Important formulas for a pair of discrete RVs and a pair of Continuous
RVs
Exercise 11.78 (F2011). Random variables X and Y have joint
pdf
c, 0 ≤ y ≤ x ≤ 1,
fX,Y (x, y) =
0, otherwise.
(a) Check that c = 2.
(b) In Figure 25, specify the region of nonzero pdf.
𝑦
1
1
x
Figure 25: Figure for Exercise 11.78b.
(c) Find the marginal density fX (x).
(d) Check that EX = 23 .
175
(e) Find the marginal density fY (y).
(f) Find EY
Definition 11.79. The joint cumulative distribution function (joint cdf ) of random variables X and Y (of any type(s))
is defined as
FX,Y (x, y) = P [X ≤ x, Y ≤ y] .
• Although its definition is simple, we rarely use the joint cdf
to study probability models. It is easier to work with a probability mass function when the random variables are discrete,
or a probability density function if they are continuous.
11.80. The joint cdf for a pair of random variables (of any type(s))
has the following properties49 .:
(a) 0 ≤ FX,Y (x, y) ≤ 1
(i) FX,Y (∞, ∞) = 1.
(ii) FX,Y (−∞, y) = FX,Y (x, −∞) = 0.
(b) Joint to Marginal: FX (x) = FX,Y (x, ∞) and FY (y) = FX,Y (∞, y).
In words, we obtain the marginal cdf FX and FY directly from
FX,Y by setting the unwanted variable to ∞.
(c) If x1 ≤ x2 and y1 ≤ y2 , then FX,Y (x1 , y1 ) ≤ FX,Y (x2 , y2 )
11.81. The joint cdf for a pair of continuous random variables
also has the following properties:
Rx Ry
(a) FX,Y (x, y) = −∞ −∞ fX,Y (u, v)dvdu.
(b) fX,Y (x, y) =
49
∂2
∂x∂y FX,Y (x, y).
Note that when we write FX,Y (x, ∞), we mean lim FX,Y (x, y). Similar limiting definition
y→∞
applies to FX,Y (∞, ∞), FX,Y (−∞, y), FX,Y (x, −∞), and FX,Y (∞, y)
176
11.82. Independence:
The following statements are equivalent:
(a) Random variables X and Y are independent.
[Y ∈ C] for all B, C.
|=
(b) [X ∈ B]
(c) P [X ∈ B, Y ∈ C] = P [X ∈ B] × P [Y ∈ C] for all B, C.
(d) fX,Y (x, y) = fX (x) × fY (y) for all x, y.
(e) FX,Y (x, y) = FX (x) × FY (y) for all x, y.
Exercise 11.83 (F2011). Let X1 and X2 be i.i.d. E(1)
(a) Find P [X1 = X2 ].
(b) Find P X12 + X22 = 13 .
11.7
Function of a Pair of Continuous Random Variables: MISO
There are many situations in which we observe two random variables and use their values to compute a new random variable.
Example 11.84. Signal in additive noise: When we says that a
random signal X is transmitted over a channel subject to additive
noise N , we mean that at the receiver, the received signal Y will
be X +N . Usually, the noise is assumed to be zero-mean Gaussian
2
2
noise; that is N ∼ N (0, σN
) for some noise power σN
.
Example 11.85. In a wireless channel, the transmitted signal
X is corrupted by fading (multiplicative noise). More specifically,
the received signal Y at the receiver’s antenna is Y = H × X.
Remark : In the actual situation, the signal is further corrupted
by additive noise N and hence Y = HX + N . However, this
expression for Y involves more than two random variables and
hence we we will not consider it here.
177
Discrete
Continuous
PP
E [Z]
Z =X +Y
fX,Y (x, y)dxdy
{(x,y): g(x,y)∈B}
R +∞
fZ (z) = −∞ fX,Y (x, z − x)dx
pX+Y = pX ∗ pY
fX+Y = fX ∗ fY
y
−∞ −∞
RR
y
Y
|=
X
g(x, y)fX,Y (x, y)dxdy
g(x, y)pX,Y (x, y)
P
pX,Y (x, y)
(x,y): g(x,y)∈B
P
pZ (z) = pX,Y (x, z − x)
P x
= pX,Y (z − y, y)
x
P [Z ∈ B]
+∞
R +∞
R
=
R +∞
−∞
fX,Y (z − y, y)dy
Table 8: Important formulas for function of a pair of RVs. Unless stated otherwise, the function is defined as Z = g(X, Y )
11.86. Consider a new random variable Z defined by
Z = g(X, Y ).
Table 8 summarizes the basic formulas involving this derived random variable.
11.87. When X and Y are continuous random variables, it may
be of interest to find the pdf of the derived random variable Z =
g(X, Y ). It is usually helpful to devide this task into two steps:
RR
(a) Find the cdf FZ (z) = P [Z ≤ z] = g(x,y)≤z fX,Y (x, y)dxdy
(b) fW (w) =
d
dw FW (w).
Example 11.88. Suppose X and Y are i.i.d. E(3). Find the pdf
of W = Y /X.
178
Exercise 11.89 (F2011). Let X1 and X2 be i.i.d. E(1).
(a) Define Y = min {X1 , X2 }. (For example, when X1 = 6 and
X2 = 4, we have Y = 4.) Describe the random variable Y .
Does it belong to any known family of random variables? If
so, what is/are its parameters?
(b) Define Y = min {X1 , X2 } and Z = max {X1 , X2 }.
fY,Z (2, 1).
Find
(c) Define Y = min {X1 , X2 } and Z = max {X1 , X2 }.
fY,Z (1, 2).
Find
11.90. Observe that finding the pdf of Z = g(X, Y ) is a timeconsuming task. If you goal is to find E [Z] do not forget that it
can be calculated directly from
Z Z
E [g(X, Y )] =
g(x, y)fX,Y (x, y)dxdy.
11.91. The following property is valid for any kind of random
variables:
#
"
X
X
E [Zi ] .
E
Zi =
i
i
Furthermore,
E
"
X
#
gi (X, Y ) =
i
X
i
179
E [gi (X, Y )] .
Discrete
P
pX (x)
Px∈B
pX,Y (x, y)
P [X ∈ B]
P [(X, Y ) ∈ R]
Joint to Marginal:
(x,y):(x,y)∈R
pX (x) =
P
pX,Y (x, y)
Continuous
R
fX (x)dx
B
RR
fX,Y (x, y)dxdy
{(x,y):(x,y)∈R}
+∞
R
(Law of Total Prob.)
pY (y) =
P
pX,Y (x, y)
fY (y) =
x
P [X > Y ]
P P
pX,Y (x, y)
P P
pX,Y (x, y)
y x: x>y
P [X = Y ]
P
−∞
+∞
R
fX,Y (x, y)dx
−∞
x y: y<x
=
fX,Y (x, y)dy
fX (x) =
y
+∞
R Rx
fX,Y (x, y)dydx
−∞ −∞
+∞
R R∞
=
fX,Y (x, y)dxdy
−∞ y
pX,Y (x, x)
0
x
|=
X Y
Conditional
E [g(X, Y )]
P [g(X, Y ) ∈ B]
Z =X +Y
pX,Y (x, y) = pX (x)pY (y)
p
(x,y)
pX|Y (x|y) = X,Y
pY (y)
PP
g(x, y)pX,Y (x, y)
x y
P
pX,Y (x, y)
(x,y): g(x,y)∈B
P
pZ (z) = pX,Y (x, z − x)
P x
=
pX,Y (z − y, y)
y
Table 9: pmf vs. pdf
180
fX,Y (x, y) = fX (x)fY (y)
f
(x,y)
fX|Y (x|y) = X,Y
fY (y)
+∞
R +∞
R
g(x, y)fX,Y (x, y)dxdy
−∞ −∞
RR
fX,Y (x, y)dxdy
{(x,y): g(x,y)∈B}
R +∞
fZ (z) = −∞ fX,Y (x, z − x)dx
R +∞
= −∞ fX,Y (z − y, y)
11.92. Independence: At this point, it is useful to summarize
what we know about independence. The following statements are
equivalent:
(a) Random variables X and Y are independent.
[Y ∈ C] for all B, C.
|=
(b) [X ∈ B]
(c) P [X ∈ B, Y ∈ C] = P [X ∈ B] × P [Y ∈ C] for all B, C.
(d) For discrete RVs, pX,Y (x, y) = pX (x) × pY (y) for all x, y.
For continuous RVs, fX,Y (x, y) = fX (x) × fY (y) for all x, y.
(e) FX,Y (x, y) = FX (x) × FY (y) for all x, y.
(f) E [h(X)g(Y )] = E [h(X)] E [g(Y )] for all functions h and g.
Definition 11.93. All of the definitions involving expectation of
a function of two random variables are the same as in the discrete
case:
• Correlation between X and Y : E [XY ].
• Covariance between X and Y :
Cov [X, Y ] = E [(X − EX)(Y − EY )] = E [XY ] − EXEY.
• Var X = Cov [X, X].
• X and Y are said to be uncorrelated if and only if Cov [X, Y ] =
0.
• X and Y are said to be orthogonal if E [XY ] = 0.
• Correlation coefficient: ρXY =
181
Cov[X,Y ]
σX σY
Exercise 11.94 (F2011). Continue from Exercise 11.78. We found
that the joint pdf is given by
2, 0 ≤ y ≤ x ≤ 1,
fX,Y (x, y) =
0, otherwise.
Also recall that EX =
2
3
and EY = 13 .
(a) Find E [XY ]
(b) Are X and Y uncorrelated?
(c) Are X and Y independent?
Example 11.95. The bivariate Gaussian or bivariate normal density is a generalization of the univariate N (m, σ 2 ) density. For bivariate normal, fX,Y (x, y) is
 2
2 
y−EY
y−EY


x−EX
x−EX


− 2ρ σX
+ σY
σX
σY
1
p
exp −
.
2)


2
(1
−
ρ
2πσX σY 1 − ρ2


(37)
Important properties:
(a) ρ =
Cov[X,Y ]
σX σY
∈ (−1, 1) [24, Thm. 4.31]
(b) [24, Thm. 4.28]
Y is equivalent to “X and Y are uncorrelated.”
|=
(c) X
182
σX = 1, σY = 1, ρ = 0
6
4
2
2
2
0
0
0
y
4
y
4
−2
−2
−2
−4
−4
−4
−6
−6
6
−5
0
x
5
σX = 1, σY = 2, ρ = 0.5
−5
0
x
−6
5
σX = 1, σY = 2, ρ = 0.8
6
6
4
4
4
2
2
2
0
0
0
y
y
σX = 1, σY = 2, ρ = 0
6
y
y
6
−2
−2
−2
−4
−4
−4
−6
−5
0
x
−6
5
−5
0
x
−6
5
σX = 3, σY = 1, ρ = 0
−5
0
x
5
σX = 1, σY = 2, ρ = 0.99
−5
0
x
5
Figure 26: Samples from bivariate Gaussian distributions.
Correlation coefficient
Number of samples = 2000
3
Number of samples = 2000
3
2
0.2
2
-1
-2
0
-3
2
2
0
-2
0
x
2
-2
y
0.1
0.05
2
0
0
-2
0
-2
0
x
-2
2
-2
0
2
y
x
x
Number of samples = 2000
Number of samples = 2000
3
3
2
0
-1
0.2
1
0.1
-1
0
-2
2
-2
0
x
2
2
-2
y
0.3
0.2
0.1
2
0
0
-2
0
Joint pdf
1
y
Joint pdf
2
y
0.15
0.1
0.05
-2
-3
0.2
0
Joint pdf
y
Joint pdf
y
0
-1
-3
1
0.15
1
-3
0
-2
x
-2
0
x
2
0
-2
-2
0
2
y
x
 Remark: marginal pdfs for both X and Y are standard Gaussian
27: Effect of ρ on bivariate Gaussian distribution. Note that the marginal
pdfs for both X and Y are all standard Gaussian.
8 Figure
183
Sirindhorn International Institute of Technology
Thammasat University
School of Information, Computer and Communication Technology
ECS315 2013/1
12
Part VI
Dr.Prapun
Three Types of Random Variables
12.1. Review: You may recall50 the following properties for cdf
of discrete random variables. These properties hold for any kind
of random variables.
(a) The cdf is defined as FX (x) = P [X ≤ x]. This is valid for
any type of random variables.
(b) Moreover, the cdf for any kind of random variable must satisfies three properties which we have discussed earlier:
CDF1 FX is non-decreasing
CDF2 FX is right-continuous
CDF3 lim FX (x) = 0 and lim FX (x) = 1.
x→−∞
x→∞
(c) P [X = x] = FX (x) − FX (x− ) = the jump or saltus in F at
x.
Theorem 12.2. If you find a function F that satisfies CDF1,
CDF2, and CDF3 above, then F is a cdf of some random variable.
50
If you don’t know these properties by now, you should review them as soon as possible.
184
Example 12.3. Consider an input X to a device whose output Y
will be the same as the input if the input level does not exceed 5.
For input level that exceeds 5, the output will be saturated at 5.
Suppose X ∼ U(0, 6). Find FY (y).
12.4. We can categorize random variables into three types according to its cdf:
(a) If FX (x) is piecewise flat with discontinuous jumps, then X
is discrete.
(b) If FX (x) is a continuous function, then X is continuous.
(c) If FX (x) is a piecewise continuous function with discontinuities, then X is mixed.
185
81
3.1 Random variables
(a)
Fx(x)
1.0
¥
–¥
0
(b)
Fx(x)
x
1.0
–¥
0
(c)
Fx(x)
¥
x
¥
x
1.0
Fig. 3.2
–¥
0
TypicalFigure
cdfs: (a)
variable,
(b) a continuous
random variable,
and (c) a mixed
random
28:a discrete
Typicalrandom
cdfs: (a)
a discrete
random variable,
(b) a continuous
random
variable.
variable, and (c) a mixed random variable [16, Fig. 3.2].
For a discrete random variable, Fx (x) is a staircase function, whereas a random variable
is called continuous if Fx (x) is a continuous function. A random variable is called mixed
if it is neither discrete nor continuous. Typical cdfs for discrete, continuous, and mixed
random variables are shown in Figures 3.2(a), 3.2(b), and 3.2(c), respectively.
Rather than dealing with the cdf, it is more common to deal with the probability density
186 of F (x), i.e.,
function (pdf), which is defined as the derivative
x
fx (x) =
dFx (x)
.
dx
(3.11)
We have seen in Example 12.3 that some function can turn a
continuous random variable into a mixed random variable. Next,
we will work on an example where a continuous random variable
is turned into a discrete random variable.
Example 12.5. Let X ∼ U(0, 1) and Y = g(X) where
1, x < 0.6
g(x) =
0, x ≥ 0.6.
Before going deeply into the math, it is helpful to think about the
nature of the derived random variable Y . The definition of g(x)
tells us that Y has only two possible values, Y = 0 and Y = 1.
Thus, Y is a discrete random variable.
Example 12.6. In MATLAB, we have the rand command to generate U(0, 1). If we want to generate a Bernoulli random variable
with success probability p, what can we do?
Exercise 12.7. In MATLAB, how can we generate X ∼ binomial(2, 1/4)
from the rand command?
187
13
Transform methods: Characteristic Functions
Definition 13.1. The characteristic function of a random variable
X is defined by
ϕX (v) = E ejvX .
Remarks:
(a) If X is a continuous random variable with density fX , then
Z +∞
ϕX (v) =
ejvx fX (x)dx,
−∞
which is the Fourier transform of fX evaluated at −v. More
precisely,
ϕX (v) = F {fX } (ω)|ω=−v .
(38)
(b) Many references use u or t instead of v.
Example 13.2. You may have learned that the Fourier transform
of a Gaussian waveform is a Gaussian waveform. In fact, when
X ∼ N (m, σ 2 ),
F {fX } (ω) =
Z∞
1
fX (x) e−jωx dx = e−jωm− 2 ω
2 2
σ
−∞
Using (38), we have
1 2 2
ϕX (v) = ejvm− 2 v
σ
.
Example 13.3. For X ∼ E(λ), we have ϕX (v) =
188
λ
λ−jv .
.
As with the Fourier transform, we can build a large list of commonly used characteristic functions. (You probably remember that
rectangular function in time domain gives a sinc function in frequency domain.) When you see a random variable that has the
same form of characteristic function as the one that you know, you
can quickly make a conclusion about the family and parameters of
that random variable.
Example 13.4. Suppose a random variable X has the character2
istic function ϕX (v) = 2−jv
. You can quickly conclude that it is
an exponential random variable with parameter 2.
For many random variables, it is easy to find its expected value
or any moments via the characteristic function. This can be done
via the following result.
(k)
(k)
13.5. ϕX (v) = j k E X k ejvX and ϕX (0) = j k E X k .
Example 13.6. When X ∼ E(λ),
(a) EX = λ1 .
(b) Var X =
1
λ2 .
189
Exercise 13.7 (F2011). Continue from Example 13.2.
(a) Show that for X ∼ N (m, σ 2 ), we have
(i) EX = m
(ii) E X 2 = σ 2 + m2 .
(b) for X ∼ N (3, 4), find E X 3 .
One very important properties of characteristic function is that
it is very easy to find the characteristic function of a sum of independent random variables.
|=
13.8. Suppose X Y . Let Z = X + Y . Then, the characteristic
function of Z is the product of the characteristic functions of X
and Y :
ϕZ (v) = ϕX (v)ϕY (v)
Remark: Can you relate this property to the property of the
Fourier transform?
Example 13.9. Use 13.8 to show that the sum of two independent
Gaussian random variables is still a Gaussian random variable:
Exercise 13.10. Continue from Example 11.44. Suppose Λ1 ∼
P(λ1 ) and Λ2 ∼ P(λ2 ) are independent. Let Λ = Λ1 + Λ2 . Use
13.8 to show that Λ ∼ P(λ1 + λ2 ).
Exercise 13.11. Continue from Example 11.45 Suppose B1 ∼
B(n1 , p) and B2 ∼ B(n2 , p) are independent. Let B = B1 + B2 .
Use 13.8 to show that B ∼ B(n1 + n2 , p).
190
Sirindhorn International Institute of Technology
Thammasat University
School of Information, Computer and Communication Technology
ECS315 2013/1
14
14.1
Part VII
Dr.Prapun
Limiting Theorems
Law of Large Numbers (LLN)
Definition 14.1. Let X1 , X2 , . . . , Xn be a collection of random
variables with a common mean E [Xi ] = m for all i. In practice,
since we do not know m, we use the numerical average, or sample
mean,
n
1X
Mn =
Xi
n i=1
in place of the true, but unknown value, m.
Q: Can this procedure of using Mn as an estimate of m be
justified in some sense?
A: This can be done via the law of large number.
14.2. The law of large number basically says that if you have a
sequence of i.i.dPrandom variables X1 , X2 , . . .. Then the sample
means Mn = n1 ni=1 Xi will converge to the actual mean as n →
∞.
14.3. LLN is easy to see via the property of variance. Note that
" n
#
n
X
1
1X
E [Mn ] = E
Xi =
EXi = m
n i=1
n i=1
and
"
Var[Mn ] = Var
n
1X
n
i=1
#
Xi
n
1 X
1
= 2
Var Xi = σ 2 ,
n i=1
n
191
(39)
Remarks:
(a) For (39) to hold, it is sufficient to have uncorrelated Xi ’s.
(b) From (39), we also have
1
σMn = √ σ.
n
(40)
In words, “when uncorrelated (or independent) random variables each having the same distribution are averaged together,
the standard deviation is reduced according to the square root
law.” [21, p 142].
Exercise 14.4 (F2011). Consider i.i.d. random variables X1 , X2 , . . . , X10 .
Define the sample mean M by
10
1 X
M=
Xk .
10
k=1
Let
10
1 X
V1 =
(Xk − E [Xk ])2 .
10
k=1
and
10
1 X
V2 =
(Xj − M )2 .
10 j=1
Suppose E [Xk ] = 1 and Var[Xk ] = 2.
(a) Find E [M ].
(b) Find Var[M ].
(c) Find E [V1 ].
(d) Find E [V2 ].
192
14.2
Central Limit Theorem (CLT)
In practice, there are many random variables that arise as a sum
of many other random variables. In this section, we consider the
sum
n
X
Sn =
Xi
(41)
i=1
where the Xi are i.i.d. with common mean m and common variance
σ2.
• Note that when we talk about Xi being i.i.d., the definition
is that they are independent and identically distributed. It
is then convenient to talk about a random variable X which
shares the same distribution (pdf/pmf) with these Xi . This
allow us to write
i.i.d.
Xi ∼ X,
(42)
which is much more compact than saying that the Xi are
i.i.d. with the same distribution (pdf/pmf) as X. Moreover,
2
we can also use EX and σX
for the common expected value
and variance of the Xi .
Q: How does Sn behave?
For the Sn defined above, there are many cases for which we
know the pmf/pdf of Sn .
Example 14.5. When the Xi are i.i.d. Bernoulli(p),
Example 14.6. When the Xi are i.i.d. N (m, σ 2 ),
Note that it is not difficult to find the characteristic function
of Sn if we know the common characteristic function ϕX (v) of the
193
Xi :
ϕSn (v) = (ϕX (v))n .
If we are lucky, as in the case for the sum of Gaussian random variables in Example 14.6 above, we get ϕSn (v) that is of the form that
we know. However, ϕSn (v) will usually be something we haven’t
seen before or difficult to find the inverse transform. This is one
of the reason why having a way to approximate the sum Sn would
be very useful.
There are also some situations where the distribution of the Xi
is unknown or difficult to find. In which case, it would be amazing
if we can say something about the distribution of Sn .
In the previous section, we consider the sample mean of identically distributed random variables. More specifically, we consider
the random variable Mn = n1 Sn . We found that Mn will converge
to m as n increases to ∞. Here, we don’t want to rescale the sum
Sn by the factor n1 .
14.7 (Approximation of densities and pmfs using the CLT). The
actual statement of the CLT is a bit difficult to state. So, we first
give you the interpretation/insight from CLT which is very easy
to remember and use:
For n large enough, we can approximate Sn by a Gaussian random variable with the same mean and variance as
Sn .
Note that the mean and variance of Sn is nm and nσ 2 , respectively. Hence,
for n large enough we can approximate Sn by
2
N nm, nσ . In particular,
s−nm
√
(a) FSn (s) ≈ Φ σ n .
(b) If the Xi are continuous random variable, then
fSn (s) ≈ √
2
1
√
− 21 ( x−nm
σ n ) .
e
√
2πσ n
(c) If the Xi are integer-valued, then
2
1
1
1
√ )
− 12 ( k−nm
σ
n
P [Sn = k] = P k − < Sn ≤ k +
≈√
.
√ e
2
2
2πσ n
194
[9, eq (5.14), p. 213].
The approximation is best for k near nm [9, p. 211].
Example 14.8. Approximation for Binomial Distribution: For
X ∼ B(n, p), when n is large, binomial distribution becomes difficult to compute directly because of the need to calculate factorial
terms.
(a) When p is not close to either 0 or 1 so that the variance is
also large, we can use CLT to approxmiate
2
1
− (k−EX)
2
Var
X
P [X = k] ≈ √
e
2π Var X
(k−np)2
1
− 2np(1−p)
e
.
=p
2πnp (1 − p)
•
(43)
(44)
This is called Laplace approximation to the Binomial distribution [25, p. 282].
Normal Approximation to Poisson Distribution with large λ:
(b) When p is small, the binomial distribution can be approxin
P(np)ofasasdiscussed
in 8.45.
X can bebythough
a sum of i.i.d.
Let X ~ P ( λ ) . mated
X i ~ P ( λ0 ) , i.e., X = ∑ X i , where
i =1
(c) IfXp isisapproximately
very close to normal
1, thenNn −λ ,X
for behave
nλ0 = λ . Hence
λ large.approximately
( λ )will
Poisson.
Some says that the normal approximation is good when λ > 5 .
p := 0.05
−λ
e
⋅
λ
−λ
−1
1
p := 0.05
x
e
Γ ( x+ 1)
2⋅ π λ
n := 100 λ := 5
⋅e
2 ⋅λ
⋅
0.15
( x− λ )
−1
1
2⋅ π λ
0.1
− x λ−1
⋅e
2 ⋅λ
0.06
( x− λ )
2
0.04
− x λ−1
e ⋅x
Γ ( λ)
Γ ( n− x+ 1) ⋅ Γ ( x+ 1)
x
Γ ( x+ 1)
2
e ⋅x
Γ ( n+ 1)
λ
n := 800 λ := 40
Γ ( λ)
x
⋅ p ⋅ ( 1− p )
Γ ( n+ 1)
n− x 0.05
Γ ( n− x+ 1) ⋅ Γ ( x+ 1)
0
0
5
10
x
x
⋅ p ⋅ ( 1− p )
n− x 0.02
0
0
20
40
60
x
The above Figure
figure compare
1) Poisson
when x istointeger,
2) Gaussian,
3) Gamma,and
4)
29: Gaussian
approximation
Binomial,
Poisson distribution,
Binomial. Gamma distribution.
•
If g : Z + → R is any bounded function and Λ ~ P ( λ ) , then E ⎡⎣λ g ( Λ + 1) − Λg ( Λ ) ⎤⎦ = 0 .
∞
Proof.
∑ ( λ g ( i + 1) − ig ( i ) ) e
i =0
−λ
λi
⎛ ∞
λ i +1 ∞
λi ⎞
= e ⎜ ∑ g ( i + 1)
− ∑ ig ( i ) ⎟
195
i!
i!
i! ⎠
i =1
⎝ i =0
∞
λ i +1 ∞
λ m +1 ⎞
−λ ⎛
= e ⎜ ∑ g ( i + 1)
− ∑ g ( m + 1)
⎟
i! m=0
m! ⎠
⎝ i =0
−λ
Exercise 14.9 (F2011). Continue from Exercise 6.53. The stronger
person (Kakashi) should win the competition if n is very large. (By
the law of large numbers, the proportion of fights that Kakashi wins
should be close to 55%.) However, because the results are random
and n can not be very large, we can not guarantee that Kakashi
will win. However, it may be good enough if the probability that
Kakashi wins the competition is greater than 0.85.
We want to find the minimal value of n such that the probability
that Kakashi wins the competition is greater than 0.85.
Let N be the number of fights that Kakashi wins among the n
fights. Then, we need
h
ni
P N>
≥ 0.85.
(45)
2
Use the central limit theorem and Table 3.1 or Table 3.2 from
[Yates and Goodman] to approximate the minimal value of n such
that (45) is satisfied.
14.10. A more precise statement for CLT can be expressed via the
convergence of the characteristic function. In particular, suppose
that (Xk )k≥1 is a sequence of i.i.d. random
Pn variables with mean m
2
and variance 0 < σ < ∞. Let Sn = k=1 Xk . It can be shown
that
√
(a) the characteristic function of Snσ−mn
converges pointwise to
n
the characteristic function of N (0, 1) and that
(b) the characteristic function of Sn√−mn
converges pointwise to
n
the characteristic function of N (0, σ).
P
iid
To see this, let Zk = Xkσ−m ∼ Z and Yn = √1n nk=1 Zk . Then,
n
t
EZ = 0, Var Z = 1, and ϕYn (t) = ϕZ ( √n ) . By approximating
ex ≈ 1 + x + 21 x2 . We have ϕX (t) ≈ 1 + jtEX − 12 t2 E X 2 and
n
2
1 t2
− t2
ϕYn (t) = 1 −
→e ,
2n
which is the characteristic function of N (0, 1).
196
• The case of Bernoulli(1/2) was derived by Abraham de Moivre
around 1733. The case of Bernoulli(p) for 0 < p < 1 was
considered by Pierre-Simon Laplace [9, p. 208].
197