Download Discrete Random Variables (Last Updated 3/10/14)

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
Transcript
Contents
6 Discrete Random Variables
6.1 Random Variables . . . . . . . . . . . . . . . . . . . . . . . .
6.2 The PMF and CDF . . . . . . . . . . . . . . . . . . . . . . .
6.2.1 The Probability Mass Function of a Discrete Random
6.2.2 The Cumulative Distribution Function . . . . . . . .
6.3 Examples of Discrete Random Variables . . . . . . . . . . .
6.3.1 The Bernoulli Distribution . . . . . . . . . . . . . . .
Categorical and Discrete Uniform Random Variables
6.4 The Algebra of Random Variables . . . . . . . . . . . . . . .
6.4.1 Joint Distributions . . . . . . . . . . . . . . . . . . .
6.5 Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.5.1 Expected Value . . . . . . . . . . . . . . . . . . . . .
6.5.2 Expectation of Functions of a Random Variable . . .
6.5.3 Properties of Expectation . . . . . . . . . . . . . . .
6.5.4 Variance . . . . . . . . . . . . . . . . . . . . . . . . .
6.5.5 Properties of the Variance . . . . . . . . . . . . . . .
6.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
. . . . .
. . . . .
Variable
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
3
3
6
6
8
11
11
12
13
14
16
17
18
19
21
22
24
2
CONTENTS
Chapter 6
Discrete Random Variables
For the rest of the chapter unless otherwise noted, we will let U denote a universe with
finitely or countably infinitely many elements, and P denote a probability measure
defined on all events in U . The domain of P is the power set of U , that is, the set
of all possible subsets (including the empty set as well as U itself). The power set of
U is denoted P(U ).
Together, the triple (U, P(U ), P ) are called a probability space.
6.1
Random Variables
Most universes in real problems are large.
• U = {All lists of 16 Birthdays}
• U = {All subsets of 10 items drawn from a lot of 30}
#(U ) = 36516
#(U ) = 30
10
If the universe itself is large, the power set of the universe is extraordinarily huge:
We have #(P(U )) = 2#(U ) (to count the number of subsets, consider choosing for
every single element whether it gets included or excluded from the subset).
Typically we aren’t interested in every possible event; but only those that pick out
outcomes with some common “feature”:
• E = Lists with a repeated birthday
• E = Subsets with no defective items
3
4
CHAPTER 6. DISCRETE RANDOM VARIABLES
In principle, we could calculate the probability of these events by breaking them
down into smaller disjoint sets whose probabilities we can compute. But it is simpler
to introduce random variables, which let us work with these features directly.
Definition 6.1 (Random Variable). A random variable, X, is a function, whose
domain is a universe, U , and whose range is a set R which is a subset of the real
numbers R. That is, for every a ∈ U , X(a) ∈ R ⊂ R.
Example If U is the set of all subsets of 3 items drawn from a lot of 30, we can
define a random variable X which returns the number of defective items.
The range of X would then be the set {0, 1, 2, 3}.
Definition 6.2 (Discrete Random Variable). A random variable is called discrete
if its range is either a finite set or a countably infinite set (e.g. N, Z, etc.).
The random variable X defined above (counting defective items) is discrete because
its range is a finite set.
In contrast, suppose U is the set of all possible Tucson days. We can define a random
variable Y which, for each day gives its temperature. But since temperatures are
real numbers, the range of Y is uncountably infinite, and so Y does not qualify as a
discrete random variable.
We are often interested in events of this form: E = {a ∈ U : X(a) ∈ A}. In
words, these are events that occur whenever X takes on one of a particular set of
values.
• As a shorthand, we can write E = {X ∈ A}.
• If A is a singleton set (i.e., A = {x} for some x ∈ R), then we can write E as
{X = a}.
• If A is an interval, for example, A = [a, b], we can write E = {a ≤ X ≤ x}.
If the interval is unbounded on one end, such as A = (−∞, b], then we have
E = {X ≤ b}.
The probability of such events follows from the probability measure we have defined
on events in U :
• P (X ∈ A) = P ({a ∈ U ; X(a) ∈ A})
• P (X = x) = P ({a ∈ U ; X(a) = x})
6.1. RANDOM VARIABLES
5
• P (X ≤ x) = P ({a ∈ U ; X(a) ≤ x})
As such, everything we have established so far about the probability of events, their
complements, unions, etc., still applies. All we are doing is providing a bit more
structure to the universe (which, after all, is what all mathematics is about, isn’t
it?).
Example 1. Two fair dice are thrown. What is the probability that the sum of the
numbers appearing on the two upturned faces equals 7?
Solution Note that the sample space consists of all possible ordered pairs of the
numbers 1 through 6:
U = {(1, 1), (1, 2), . . . , (6, 5), (6, 6)}
#(U ) = 6 × 6 = 36
(by the generalized counting theorem)
We can define a random variable X whose value is the sum of the scores of the two
dice. In terms of X, the event of interest is
E = {X = 7} = {a ∈ U ; X(a) = 7}
= {(1, 6), (2, 5), (3, 4), (4, 3), (5, 2), (6, 1)}
=⇒ #(E) = 6
It seems reasonable to assume that the ordered pairs are exchangeable in this case,
as we would not expect the probabilities to change if we relabeled the dice, so we
can compute
6
1
#(E)
=
=
P (E) =
#(U )
36
6
If we repeat this process for other possible values of X, we will find that P (X = x) = 0
unless x ∈ {2, 3, 4, . . . , 11, 12}, and that for these values, we can express the probability that X = x with the following algebraic expression:
P (X = x) =
6 − |x − 7|
36
Note that by restricting our attention to the sum of the dice, we have reduced the
number of fundamental events from 36 to 11.
6
6.2
CHAPTER 6. DISCRETE RANDOM VARIABLES
The PMF and CDF
Notice that a random variable X defines a partition of the universe, whose ’compartments’ are the events {X = x} for each x in the range of X. For our dice
example, the events {X = 2} through {X = 12} partition all the possible outcomes
of the two dice.
6.2.1
The Probability Mass Function of a Discrete Random
Variable
Definition 6.3 (Probability Mass Function). Let X be a discrete random variable
with range R. We define the probability mass function, f , associated with X to
be a function that takes in a valid value for X, and returns the probability that X
takes that value. Formally, the domain of f is the set R (the range of X), and the
range of f is the interval [0, 1], and f (x) is defined for each x ∈ R to be
f (x) = P (X = x)
Theorem 6.1. Let U be a universe, P a probability measure defined on events in U ,
let X be a random variable on U , and let f be the PMF of X as determined by the
probability measure P .
Then, treating R itself as a universe, there is a unique probability measure, P 0 , defined
0
on all subsets of R,
P such that P ({x}) = f (x) for each x ∈ R. For any set S ⊂ R,
0
we have P (S) = x∈S f (x).
In other words, if we know f (x) for every x ∈ R, then there is exactly one way to
extend these values to a probability measure which is defined on all subsets of R.
The way to do this is to take any subset S of R and set its probability to be the sum
of the atomic probabilities of the elements of S.
One reason this matters is that it says that if we know the PMF of a discrete random
variable, X, then we know everything that X is capable of telling us about.
Proof. There are two parts to this claim; an existence claim and a uniqueness claim.
We need to check each part: first that the probability measure defined as stated
obeys the Kolmogorov axioms, and second that no other probability measure on
P(R) agrees with P 0 on all elements of R.
6.2. THE PMF AND CDF
7
Existence. The first Kolmogorov axiom says that all probabilities are nonnegative.
Since f (x) is a probability, we know that P 0 ({x}) is nonnegative for every x ∈ R.
Since the value of P 0 on any other set is obtained by summing these nonnegative
numbers, P 0 is always nonnegative.
Next, we need to check that P 0 (R) = 1. By definition, we have
P 0 (R) =
X
f (x) =
x∈R
X
P (X = x)
x∈R
Since the sets {X = x}P
form a partition of the universe, property (P7) of probability
measures implies that x∈r P (X = x) = 1.
Finally we need to check disjoint additivity. Suppose S and T are disjoint subsets of
R. Then
X
X
X
P 0 (S ∪ T ) =
f (x) =
f (x) +
f (x) = P 0 (S) + P 0 (T )
x∈S
x∈(S∪T )
x∈T
where the second equality follows from the fact that S and T are disjoint, and
therefore do not share any elements whose probabilities might be counted twice if
we split up the sum.
This completes the part of the proof that says that P 0 as defined is a probability
measure, since we have checked all of the Kolmogorov axioms.
Uniqueness. It remains only to check that no other probability measure exists (say,
P̃ ) that satisfies P̃ ({x}) = f (x). The key here is disjoint additivity. Suppose we
define P̃ ({x}) = f (x) on every x ∈ R. Then for any set S = {x1 , . . . , xM }, which is
a subset of R, we can write
S = {x1 } ∪ {x2 } ∪ · · · ∪ {xM }
which is a disjoint union. Hence by the inductive extension of the disjoint additivity
axiom, we must have
P̃ (S) =
M
X
P̃ ({xi }) =
i=1
and so P̃ is indistinguishable from P 0 .
M
X
i=1
f (xi ) = P 0 (S)
8
CHAPTER 6. DISCRETE RANDOM VARIABLES
Note the following important fact about probability mass functions which has come
out in the course of the above proof:
X
f (x) = 1
(6.1)
x∈R
In the case where the range of the random variable is a finite set, we can visualize
the probability mass function p with a probability histogram (or “spike plot”), with
a bar centered at each xj , whose height is p(xj ). If we again consider the random
variable whose value is equal to the sum of the faces of two six-sided dice, then the
probability histogram looks like Fig. 6.1
Figure 6.1: A probability histogram for the random variable representing the sum of
two fair die rolls
6.2.2
The Cumulative Distribution Function
We often are interested in events of the form {X ≤ x} for some x ∈ R. For example,
we might ask “What’s the probability that there is at most one defective item in our
draw?”
We can record probabilities of all events of this form using the cumulative distribution function.
6.2. THE PMF AND CDF
9
Definition 6.4 (Cumulative Distribution Function). Let X be a discrete random
variable with range R.
The cumulative distribution function for X is a function, F , from R to [0, 1],
and defined for each x ∈ R as
F (x) = P (X ≤ x)
Note that if R has a smallest element, x1 , so that we can list the elements of R in
increasing order as x1 , x2 , . . . , then
F (xr ) =
r
X
f (xj )
j=1
where f is the probability mass function associated with X. This follows from the
fact that the event {X ≤ xr } is a disjoint union of the events {X = xj } for all j
ranging from 1 to r, and so the probability of the union is the sum of the probabilities
of these events, which are the f (x)s.
Theorem 6.2 (Basic Properties of the CDF). Let F be the CDF of a random variable
X, with range R indexed by integers, so that xi < xj whenever i < j (technically this
rules out certain discrete ranges, like the rational numbers, which cannot be indexed
in this way, but it covers all the cases that we will be interested in, which are usually
subsets of the integers). Then
(i) F (xr ) − F (xr−1 ) = f (xr ) for all r ≥ 2
(ii) F is a nondecreasing function. That is, whenever xt > xs
F (xt ) ≥ F (xs )
(iii) If R has a largest element, xn , then F (xn ) = 1. (Otherwise limr→∞ F (xr ) = 1).
Proof.
(i) We can write
F (xr ) = P ({X ≤ xr })
= P ({X ≤ xr−1 } ∪ {X = xr })
= P ({X ≤ xr−1 }) + P ({X = xr })
= F (xr−1 ) + f (xr )
(Disjoint additivity)
(Defs of F and f )
Subtracting F (xr−1 ) from the first and last expression gives us (i).
10
CHAPTER 6. DISCRETE RANDOM VARIABLES
(ii) We have, using (i) and the fact that f (xr ) ≥ 0 for all r:
F (xr ) = F (xr−1 ) + f (xr ) ≥ F (xr−1 )
(iii)
F (xn ) =
n
X
f (xr )
(by (6.2.2))
r=1
=1
(by (6.1))
where the last line follows from the fact that we assume xn to be the largest
element of R, and so the sum is over every element of R.
The CDF tells us directly how to compute probabilities of events of the form {X ≤ x}
for some x ∈ R. With barely any additional calculation, we can also use the CDF
to compute probabilities for events of the form {X < x} and {xr ≤ X ≤ xs }, as the
next theorem illustrates.
Theorem 6.3 (More Properties of the CDF). Let X and F be defined as in the
previous two theorems. Then
(i) P (X > xr ) = 1 − F (xr )
(ii) For r ≤ s, we have
P (xr ≤ X ≤ xs ) = F (xs ) − F (xr−1 )
Proof.
(i) We can write
P (X > xr ) = P ({X ≤ xr })
= 1 − P (X ≤ xr )
= 1 − F (xr )
(by the complement rule)
(by the Def. of F)
6.3. EXAMPLES OF DISCRETE RANDOM VARIABLES
11
(ii) We have
P (xr ≤ X ≤ xs ) = P ({X ≤ xs } − {X < xr })
= P ({X ≤ xs }) − P ({X < xr })
((P5), since {X < xr } ⊂ {X ≤ xs })
= P ({X ≤ xs }) − P ({X ≤ xr−1 })
({X < xr } = {X ≤ xr−1 })
= F (xs ) − F (xr−1 )
6.3
Examples of Discrete Random Variables
Every situation has a different sample space. Random variables, however, take these
sample spaces and create a partition, which, as we’ve seen, induces a simplified
domain of events, and a simplified probability measure. As a result, problems that
are very different at the level of sample space can give rise to random variables with
very similar structures. This similarity allows us to generalize several probabilistic
properties from one situation to another which is seemingly very different.
6.3.1
The Bernoulli Distribution
The simplest possible random variable takes the universe and divides it in two pieces.
On one piece, the random variable takes the value 1. On the other, it takes the value
0. Any proposition which, for each outcome in U , is either true or false immediately
defines such a random variable. This random variable can be set to 1 for all outcomes that make the proposition true, and 0 elsewhere. A random variable with this
structure is called a Bernoulli random variable.
• Question: What’s the range of a Bernoulli random variable?
P
Since we must have 1x=0 f (x) = 1, the probability mass function of a Bernoulli
random variable is determined entirely by setting one of its values, since the other
value is obtained from the complement rule.
We define p, 0 ≤ p ≤ 1 to be equal to f (1), which implies that f (0) = 1 − p.
12
CHAPTER 6. DISCRETE RANDOM VARIABLES
This value p is called a parameter of the random variable. If X is a Bernoulli
random variable with parameter p, you may see the notation
X ∼ Bern(p)
where the ∼ symbol is read “is distributed as”.
We can write the probability mass function concisely as
f (x) = px (1 − p)1−x .
The use of exponents here is a bit “clever”: since x is always 0 or 1, one of the
exponents is always 1, and the other is always 0. Whichever term has an exponent
of zero disappears since its value is just 1.
Categorical and Discrete Uniform Random Variables
We can generalize the notion of a Bernoulli random variable to one that takes more
than two (but still a finite number of) values.
A random variable whose range is a finite set, R = {x1 , x2 , . . . , xn }, and which has
no restrictions on its PMF, is called a categorical random variable.
In the most general case, we can’t simplify the probability mass function beyond just
stating all its values. We can denote
p1 = f (x1 ),
p2 = f (x2 ),
...,
pn = f (xn )
and say
X ∼ Cat(p1 , p2 , . . . , pn )
(Of course we do not have complete freedom in choosing the pr s, as they must sum
to 1)
If we’re in a situation where the values of the random variable are exchangeable (that
is, the numbers assigned are basically arbitrary labels), then we can set
p1 = p 2 = · · · = pn
which, since p(·) sums to 1, implies as we’ve seen before that
p(xr ) =
1
n
for all r ∈ {1, 2, . . . , n}
6.4. THE ALGEBRA OF RANDOM VARIABLES
13
Then, provided x1 < x2 < · · · < xn , by (6.2.2), we get
r
F (xr ) =
n
In this case, the random variable has a discrete uniform distribution and we might
write
X ∼ DU(n)
6.4
The Algebra of Random Variables
Remember that random variables are functions from U to R. As such, we can
manipulate random variables in the same way that we manipulate arbitrary realvalued functions.
If X and Y are random variables, α is a real number, and g is a function from R
to R (for example g(x) = x2 or g(x) = sin(x)), then the following are all random
variables:
(i) X + Y
(ii) αX
(iii) X + α
(iv) XY
(v) g(X)
They are defined in the natural way. For any a ∈ U :
(i) (X + Y )(a) = X(a) + Y (a)
(ii) (αX)(a) = αX(a)
(iii) (X + α)(a) = X(a) + α
(iv) (XY )(a) = X(a)Y (a)
(v) (g(X))(a) = g(X(a))
For example:
(i) If X is the number of boys born in Tucson in a given year, and Y is the number
of girls, then X + Y is the total number of babies.
14
CHAPTER 6. DISCRETE RANDOM VARIABLES
(ii) If X is the number of items of a certain kind that a store sells in a given day,
and β is the profit margin per unit, then βX is a random variable representing
the gross profit in a given day.
(iii) If α is the fixed operating cost of running the store for a day, then βX − α is
the daily profit.
(iv) A machine produces window panes to target dimensions, but with some error.
If X is the width of a given window, Y is its height, then XY describes the
area of the windows.
(v) If X is the amplitude of a sound wave, then (roughly) log(X) is the decibel
level.
6.4.1
Joint Distributions
If X and Y are two random variables on the same probability space (U, P(U ), P ),
then any given sample element will have an X value and a Y value. For example,
we might take U to be the set of students in this class, and let X be major (for
simplicity, say ISTA vs. not ISTA) and Y be handedness. Technically we should
assign numbers to each category — say 0 and 1. Using RX to denote the range of X
and RY to denote the range of Y , we have
RX = {0, 1}(= {Non-ISTA, ISTA})
RY = {0, 1}(= {Left, Right})
Then we can talk about the probability of a particular combination of values (say, a
left-handed ISTA major).
Definition 6.5 (Joint Distribution). Let X and Y be discrete random variables with
ranges RX = {x1 , x2 , . . . } and RY = {y1 , y2 , . . . }, and probability mass functions fX
and fY , respectively. We define the joint distribution (or joint mass function)
to be a function fX,Y from RX × RY (the set of all ordered pairs (x, y), where x ∈ RX
and y ∈ RY ) to [0, 1], which sets
fX,Y (x, y) = P ({X = x} ∩ {Y = y})
In a sense we can think of a “two-dimensional” random variable, or random vector,
(X, Y ), whose range is RX × RY (i.e., all ordered pairs whose first element is in
6.4. THE ALGEBRA OF RANDOM VARIABLES
15
RX and whose second element is in RY ), and whose probability mass function is
fX,Y .
In our example, we can think of the joint mass function as applying to the cells of a
2 × 2 table (see Table 6.1)
y (Hand)
0 (Left)
1 (Right)
x (Major)
0 (non-ISTA) 1 (ISTA)
fX,Y (0, 0)
fX,Y (1, 0)
fX,Y (0, 1)
fX,Y (1, 1)
Table 6.1: A Simple Joint Mass Function
Then, for example, the probability of selecting a left-handed ISTA major is given by
fX,Y (1, 0).
It should be easy to see that the joint distribution function has the following properties:
Lemma 6.4 (Marginalization).
P
(i) For each y ∈ RY , x∈RX fX,Y (x, y) = fY (y)
P
(ii) For each x ∈ RX , y∈RY fX,Y (x, y) = fX (x)
P
P
(iii)
x∈RX
y∈RY fX,Y (x, y) = 1
Proof. Every element of the sample space gets mapped to some cell of the distribution
table. The table therefore gives us three different ways of constructing a partition:
one by taking each row as a block, one by taking each column as a block, and the
third by taking each cell as a block.
Part (i) says that when we sum over the cells in a particular column, we get the total
(“marginal”) probability assigned to that column.
Part (ii) says that when we sum over the cells in a particular row, we get the total
(“marginal”) probability assigned to that row.
Part (iii) says that when we sum over all the cells, we get the probability of the entire
sample space, which is 1.
More formally:
16
CHAPTER 6. DISCRETE RANDOM VARIABLES
(i) For any y ∈ RY , we have
X
X
fX,Y (x, y) =
P ({X = x} ∩ {Y = y})
x∈RX
x∈RX
[
= P(
{X = x} ∩ {Y = y})
(Disjoint Additivity)
x∈RX
= P ((
[
{X = x}) ∩ {Y = y})
(Distributive)
x∈RX
= P (U ∩ {Y = y})
= fY (y)
= P (Y = y)
({X = x}s form a partition)
(Def. of PMF)
(ii) Exactly the same as (i), with roles reversed
(iii) We can either use (6.1), applied to (i) or (ii), or we can note that the sets
{X = xr } ∩ {Y = ys }, taking every combination of r and s, form a partition of
S, and hence (P7) applies.
6.5
Expectation
So far, we have characterized random variables mainly in terms of their PMF and
CDF. Both of these give us a list of probabilities that we can use to calculate the
probability of any event we want.
Often we want more concise information, telling us
• What’s a central, “typical” value of the feature represented by the random
variable?
• How much variability is there in this feature?
Both of these questions (along with lots of others) can be addressed by taking
expectations of functions of the random variable.
Analogy Suppose I own an electronics store. On 20% of the days my shop is open,
I don’t sell any laptops. 40% of the time I sell one, 30% of the time I sell 2, and 10%
of the time I sell 3. How many laptops do I expect to sell tomorrow?
6.5. EXPECTATION
17
Our expectations are a weighted average of all the possible things that can happen,
where I weigh more likely outcomes more heavily.
6.5.1
Expected Value
Definition 6.6 (Expected Value/Mean of a Random Variable). Let X be a random
variable with range R = {x1 , x2 , . . . } and probability mass function p. We define the
expectation (or mean) of X, written E [X], to be
X
E [X] =
x · f (x)
x∈R
We will sometimes use the Greek letter µ (pronounced “mew”, like a cat) to stand
for E [X]. This is the Greek version of “m” for “mean”.
In the laptop example, we have R = {0, 1, 2, 3}, so I have:
µ = E [X] =
3
X
x · f (xj )
x=0
= 0 · 0.2 + 1 · 0.4 + 2 · 0.3 + 3 · 0.1
= 0 + 0.4 + 0.6 + 0.3
= 1.3
On average, I sell 1.3 laptops a day. Notice that the average is not a possible
value for X. This is often the case for discrete random variables. One reason to
define expected value this way is that, under a relative frequency interpretation of
probability, it represents the “long run average” if we repeatedly observe the random
variable. Consider repeating a random experiment N times, and observing the value
of the random variable X each time. Letting x(i) denote the value of X on the ith
run of the experiment, the sample mean of the runs will be
N
1 X (i)
x
N i=1
We can simplify this sum by grouping together identical terms, to get
1 X
Count(X = x) · x
N x∈R
X
18
CHAPTER 6. DISCRETE RANDOM VARIABLES
In a sample of size N , we expect x1 to occur about N · f (x1 ) times (on average), x2
to occur N · f (x2 ) times, and so on, so that if we compute the sample mean, it will
tend toward
X
1 X
N · f (x) · x =
f (x) · x
N x∈R
x∈R
X
X
which is exactly how we’ve defined expected value.
6.5.2
Expectation of Functions of a Random Variable
Suppose that, in the laptop example above, I charge $1000 for each laptop I sell.
How much do I expect to bring in tomorrow?
The logic is the same as when I determine the expected number of laptops sold; I
just have to apply the income function first.
Let h(x) = 1000x represent my gross income in a day. Then I can define
E [h(X)] =
3
X
h(x)f (x)
x=0
= 1000 · 0 · 0.2 + 1000 · 1 · 0.4 + 1000 · 2 · 0.3 + 1000 · 2 · 0.3 + 1000 · 3 · 0.1
= 0 + 400 + 600 + 300
= 1300
So I expect to bring in $1300 on average in a given day.
Suppose I buy my laptops for $800 each. If I want to compute my profit, as opposed
to my raw income, I can set g(x) = 1000x − 800x = 200x
and compute my expected profit the same way.
Or I can take into account my fixed cost of operating the store; say, $100 a day.
6.5. EXPECTATION
19
Then I have f (x) = 200x − 100, and I get
3
X
E [f (x)] =
(200x − 100)f (x)
x=0
=
3
X
200xf (x) −
x=0
= 200
3
X
100f (x)
x=0
3
X
xf (x) − 100
x=0
3
X
f (x)
x=0
= 200 · E [X] − 100 · 1
= 260 − 100
= 160
So, I turn a profit of $160 a day, on average.
Notice the general pattern:
Definition 6.7 (Expectation of a Function of a Random Variable). Let X be a
random variable with range R, and let h be any function which is defined on R, and
which returns real numbers. Remember, we showed that h(X) is a random variable
in its own right. Then we have
X
E [h(X)] =
h(x)f (x)
x∈R
6.5.3
Properties of Expectation
Recall from our discussion of the algebra of random variables that if X and Y are both
random variables, then X + Y is also a random variable. Therefore, the expectation
of X + Y is a well-defined quantity.
We could compute it by looking at the range of X +Y and computing a PMF for this
random variable, which we would use to find the expectation. However, it’s often
easier to stick with pairs (x, y), and compute the weighted average of the numbers
x + y, where the weights come from the joint distribution of X and Y . This will give
us the same result; just with potentially more terms in the sum.
20
CHAPTER 6. DISCRETE RANDOM VARIABLES
Definition 6.8. Let X have range RX and PMF fX , let Y have range RY and PMF
fY , and let fX,Y be the joint PMF of X and Y . Then
X
E [X + Y ] =
(x + y)fX,Y (x, y)
(x,y)∈(RX ×RY )
It turns out that we never have to use this definition directly, due to the following
Theorem.
Theorem 6.5 (a). For any two random variables, X and Y , on a common probability
space,
E [X + Y ] = E [X] + E [Y ]
Proof. This is a matter of some algebra, and using the Marginalization Lemma.
E [X + Y ] =
X
(x + y)fX,Y (x, y)
(Def. 6.8)
(x,y)∈RX ×RY
=
X X
(x + y)fX,Y (x, y)
x∈RX y∈RY
=
X X
(xfX,Y (x, y) + yfX,Y (x, y))
x∈RX y∈RY
=
X X
xfX,Y (x, y) +
x∈RX y∈RY
=
X
x∈RX
=
X
x
X
X X
yfX,Y (x, y)
x∈RX y∈RY
fX,Y (x, y) +
y∈RY
y∈RY
xfX (x) +
x∈RX
X
X
yfY (y)
y
X
fX,Y (x, y)
x∈RX
(Marginalization Lemma)
y∈RY
= E [X] + E [Y ]
(Def. of E)
Theorem 6.5 (b). If the function h is a linear function (that is, h(x) = ax + b for
some choices of constants a and b), then we can exchange the order of E and h. That
is:
E [h(X)] = h(E [X]), that is to say,
E [aX + b] = aE [X] + b
6.5. EXPECTATION
21
Proof. This comes straight from the definition. Suppose X has range R and PMF
f . Then
E [aX + b] =
X
(ax + b)f (x)
x∈R
X
=
(axf (x) + bf (x))
x∈R
=
X
axf (x) +
x∈R
=a
X
X
bf (x)
x∈R
xf (x) + b
x∈R
X
f (x)
x∈R
= aE [X] + b
where in the last line we’ve used the definition of Expectation in the first term, and
the fact that the PMF sums to 1 in the second term.
Note that we’re free to set b = 0 or a = 1 in the above, in which case we get the
following two statements:
E [aX] = aE [X]
E [X + b] = E [X] + b
6.5.4
(6.2)
(6.3)
Variance
Suppose we have already calculated µ = E [X]. Then this is just a number. If we
now set h(x) = (x − µ)2 , then E [h(X)] is called the variance of X. It gives us a
way of measuring how far away X tends to be from its mean. It’s also a measure of
how much uncertainty is associated with our expectation of the value of X itself: if
the variance is high, we “expect” to be off by a greater amount.
Definition 6.9 (Variance of a Random Variable). Let X be defined as above, and
let µ = E [X]. We define the variance of X to be
X
Var [X] = E (X − µ)2 =
(x − µ)2 f (x)
x∈R
22
CHAPTER 6. DISCRETE RANDOM VARIABLES
Example In the case of the laptops, I have
Var [X] = E (X − µ)2
E (X − 1.3)2
=
3
X
(x − 1.3)2 f (x)
x=0
= (0 − 1.3)2 · 0.2 + (1 − 1.3)2 · 0.4 + (2 − 1.3)2 · 0.3 + (3 − 1.3)2 · 0.1
= 1.69 · 0.2 + 0.09 · 0.4 + 0.49 · 0.3 + 2.89 · 0.1
= 0.338 + 0.036 + 0.147 + 0.289
= 0.810
The variance has many useful and elegant mathematical properties. One of its limitations, however, is the fact that it’s expressed in different units from those of X
itself: namely, squared units. This fact is captured in the alternative symbol for
variance, σ 2 , where σ is the Greek letter “sigma”.
We can return to the original units by taking the square root, leaving us with σ.
This is called the standard deviation of X.
Definition 6.10 (Standard Deviation of a Random Variable). Let X be defined as
above, with mean µ and variance Var [X]. We define the standard deviation of X
to be the square root of the variance.
p
σ = Var [X]
Example The standard deviation in the daily number of laptops is
0.9.
√
√
σ 2 = 0.81 =
Notice that there’s no function h that we can pick (in general) so that σ = E [h(X)].
We need to compute the variance first in order to get the standard deviation.
6.5.5
Properties of the Variance
The following theorem gives us an easier way to compute variance than using the
direct definition.
6.5. EXPECTATION
23
Theorem 6.6 (a).
Var [X] = E X 2 − E [X]2
Proof. Just follow the algebra, and apply Thm 6.5. Denote µ = E [X].
Var [X] = E (X − µ)2
= E X 2 − 2µX + µ2
= E X 2 − 2µE [X] + µ2
= E X 2 − 2E [X] E [X] + E [X]2
= E X 2 − 2E [X]2 + E [X]2
= E X 2 − E [X]2
(Thm. 6.5(b))
(Def. of µ)
(E [X] is a number)
If we rescale X by multiplying by a constant, the variance changes in a simple way;
but not just by the constant the way the mean does. Remember, variance is in
squared units.
Theorem 6.6 (b). Let a ∈ R be a constant. We already know that aX is then a
random variable, and so it has a variance. In particular,
Var [aX] = a2 Var [X]
Proof. This is just algebra, and we could get the result directly from the definition
of Variance, but it’s easier to use the alternative expression for Variance from Thm
6.6(a), and then apply Thm 6.5(b).
Var [aX] = E (aX)2 − E [aX]2
= E a2 X 2 − E [aX]2
= a2 E X 2 − (aE [X])2
= a2 E X 2 − a2 E [X]2
= a2 E X 2 − E [X]2
= a2 Var [X]
(Thm. 6.6(a))
(Thm. 6.5(b))
(Thm. 6.6(a))
24
CHAPTER 6. DISCRETE RANDOM VARIABLES
What should happen to the variance if we add a constant to a random variable?
We saw that when we add (respectively, subtract) a constant, the mean shifts up
(respectively, down) by that exact amount. But the variance is a measure of how far
away the variable tends to be from its mean. Since both the mean and the individual
values are moving the same way, the variance should not change. Indeed this is the
case:
Theorem 6.7. Let X be a random variable with a defined variance, and let c be a
real constant. Then
Var [X + c] = Var [X]
Proof. Homework
6.6
Exercises
1. (Adapted from Applebaum) A discrete random variable has range R = {0, 1, 2, 3, 4, 5}
and cumulative distribution whose first five values are
F (0) = 0, F (1) = 1/9, F (2) = 1/6
F (3) = 1/3, F (4) = 1/2
(a) Find f (x), the value of the PMF, for each x ∈ R.
(b) Find E [X] and Var [X].
2. (Adapted from Applebaum) Three fair coins are tossed. Find the PMF of the
random variable Y whose value is given by the total number of heads.
3. A bucket has a red balls and b blue ones.
(1 ≤ k ≤ a + b) without replacement from
dom variable representing the number of red
expression representing the probability that Y
You draw k balls at random
the bucket. Let Y be a ranballs in your sample. Find an
= y.
6.6. EXERCISES
25
4. Let X be a discrete random variable with range R = {x1 , . . . , xn }, where the
elements are listed in increasing order. Show that the following probabilities
can be computed from the CDF as follows:
(a) P (xr < X ≤ xs ) = F (xs ) − F (xr )
(b) P (xr < X < xs ) = F (xs−1 ) − F (xr )
(c) P (xr ≤ X < xs ) = F (xs−1 ) − F (xr−1 )
(d) P (X ≥ xr ) = 1 − F (xr−1 )
5. A game show allows contestants to win money by spinning a wheel. The wheel
is divided into four equally sized spaces, three of which contain dollar amounts
and one of which says BUST. The contestant can spin as many times as she
wants, adding any money won to her total, but if she lands on BUST, the game
ends and she loses everything won so far.
(a) Suppose for the moment that the contestant keeps going until she goes
bust. Let W be the random variable that represents the number of successful spins before going bust. Indicate the valid range for W , and find
an expression for its CDF.
(b) Use the CDF you found above to derive an expression for the PMF of W .
(c) The dollar amounts on the non-BUST spaces are $100, $500, and $5000.
Let Xn be the random variable denoting the amount of money won or lost
on the nth spin. Find E [X1 ] (note that there is nothing to lose on the first
spin).
(d) Let Sn be the cumulative amount won after n spins. Find E [Xn |Sn−1 ], as
a function of Sn . How large does Sn have to get before the expected gain
by spinning again is negative?
6. Use induction to extend Theorem 6.5(a), showing that if X1 , . . . , Xn are arbitrary random variables, then
E [X1 + · · · + Xn ] = E [X1 ] + · · · + E [Xn ]
7. Let X be a random variable and c ∈ R a constant. Show that
Var [X + c] = Var [X]
26
CHAPTER 6. DISCRETE RANDOM VARIABLES
8. (Adapted from Bolstad) A discrete random variable, Y , has a probability mass
function given by the following table.
y
0
1
2
3
4
f (y)
0.2
0.3
0.1
0.1
0.1
(a) Calculate P (1 < Y ≤ 3).
(b) Calculate E [Y ].
(c) Calculate Var [Y ].
(d) Let W = 2Y + 3. Calculate E [W ].
(e) Calculate Var [W ].
9. (Adapted from Bolstad) A discrete random variable, Y , has a probability mass
function given by the following table.
y
0
1
2
5
f (y)
0.1
0.2
0.3
0.4
(a) Calculate P (0 < Y < 5).
(b) Calculate E [Y ].
(c) Calculate Var [Y ].
(d) Let W = 3Y − 1. Calculate E [W ].
(e) Calculate Var [W ].
6.6. EXERCISES
27
10. (Adapted from Bolstad) Let X and Y be jointly distributed discrete random
variables. Their joint probability distribution is given in the following table:
x
1
2
3
4
f (y)
1
0.02
0.08
0.05
0.10
2
0.04
0.02
0.05
0.04
y
3
0.06
0.10
0.03
0.05
f (x)
4
0.08
0.02
0.02
0.03
5
0.05
0.03
0.10
0.03
(a) Find the marginal PMF of X.
(b) Find the marginal PMF of Y .
(c) Calculate the conditional probability P (X = 3|Y = 1).