Download Bernoulli Random Variables in n Dimensions

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Statistics wikipedia , lookup

Probability interpretations wikipedia , lookup

Probability wikipedia , lookup

Randomness wikipedia , lookup

Ars Conjectandi wikipedia , lookup

Transcript
1
Chapter 2
Bernoulli Random Variables in n Dimensions
1. Introduction
This chapter is dedicated to my STAT 305C students at Iowa State University in the Fall
2006 semester. It is due, in no small part, to their thoughtful questions throughout the
course, but especially in relation to histogram uncertainty, that has convinced me to
address the issues in this chapter in a rigorous way, and in a format that I believe is
accessible to those who have a general interest in randomness.
There are many phenomena that involve only two possible recordable or measurable
outcomes. Decisions ranging from the yes/no type to the success/failure type abound in
every day life. Will I get to work on time today, or won’t I? Will I pass my exam, or
won’t I? Will the candidate get elected, or not? Will my friend succeed in her business, or
won’t she? Will my house withstand an earth quake of 6+ magnitude, or won’t it? Will I
meet an interesting woman at the club tonight, or won’t I? Will my sister’s cancer go into
remission, or won’t it. And the list of examples could go on for volumes. They all entail
an element of uncertainty; else why would one ask the question. With enough knowledge,
this uncertainty can be captured by an assigned probability for one of the outcomes. It
doesn’t matter which outcome is assigned the said probability, since the other outcome
will hence have a probability that is one minus the assigned probability. The act of asking
any of the above questions, and then recording the outcome is the essence of what is in
the realm of probability and statistics termed a Bernoulli random variable, as now
defined.
Definition 1.1 Let X denote a random variable (i.e. an action, operation, observation, etc.)
the result of which is a recorded zero or one. Let the probability that the recorded
outcome is one be specified as p. Then X is said to be a Bernoulli(p) random variable.
This definition specifically avoided the use of any real mathematical notation, in order to
allow the reader to not be unduly distracted from the conceptual meaning of a Ber(p)
random variable. While this works for a single random variable, when we address larger
collections of them, then it is extremely helpful to have a more compact notation. For this
reason, we now give a more mathematical version of the above definition.
Definition 1.2 Let X be a random variable whose sample space is S X  {0,1} , and let p
denote the probability of the set {1}. In compact notation, this is often written as
Pr[ X  1]  p . Then X is said to be a Bernoulli(p), or, simply, a Ber(p) random variable.
Since this author feels that many people grasp concepts better with visuals, the
probability structure of a Ber(p) random variable is shown in Figure 1.
At one level, Figure 1 is very simple. The values that X can take on are included in the
horizontal axis, and the probabilities associated with them are included on the vertical
axis. However, conceptually, the implications of Figure 1 are deep.
2
1
0.9
0.8
0.7
Pr[X=x]
0.6
0.5
0.4
0.3
0.2
0.1
0
0
0.2
0.4
0.6
0.8
1
1.2
1.4
This axis includes the sample space ofr X
1.6
1.8
Figure 1. The probability structure for a Ber(p=0.7) random variable.
X is a 1-dimensional (1-D) random variable, since the values that it can take on are its
sample space S X  {0,1} , which includes simply numbers, or scalars. So, these numbers
can be identified as a subset of the real line, which in Figure 1 is the horizontal axis.
Since probabilities are also just numbers, they require only one axis, which in Figure 1 is
the vertical line. But what if X were a 2-D random variable; that is, its sample space was a
collection of ordered pairs? As we will see presently, then we would need to use a plane
(i.e. an area associated with, say, a horizontal line and a vertical line). In that case, the
probabilities would have to be associated with a third line (e.g. a line coming out of the
page). To summarize this concept, the probability description for any random variable
requires that one first identify its sample space. In the case of Figure 1, that entailed
drawing a line, and then marking the values zero and one on that line. Second, one then
associates probability information associated with the sample space. In the case of Figure
1, that entailed drawing a line perpendicular to the first line, and including numerical
probabilities associated with zero and one.
Another conceptually deep element of Figure 1 is an element that Figure 1 (as does
almost any probability figure in any text book in the area) fails to highlight. It is the fact
that, in Figure 1, the probability 0.7 is not, I repeat, NOT the probability associated with
the number 1. Rather, it is the probability associated with the set {1}. While many might
argue that this distinction is overly pedantic, I can assure you that ignoring this
distinction is, in my opinion, one of the most significant sources of confusion for students
taking a first course in probability and statistics (and even for some students in graduate
level courses I have taught). Ignoring this distinction in the 1-D case shown in Figure 1
might well cause no problems. But ignoring it for higher dimensional case can result in
big problems. So, let’s get it straight here and now.
Definition 1.3 The probability entity Pr(•) is a measure of the size of a set.
3
In view of this definition, Pr(1) makes no sense, since 1 is a number, not a set. However,
Pr({1}) makes perfect sense, since {1} is a set (as defined using { }, and this set contains
only the number 1 in it. Since Pr(A) measures the “size” of a set A, we can immediately
apply natural reasoning to arrive at what some books term “axioms of probability”. These
include the following:
Axiom 1. Pr( S X )  1 .
Axiom 2. Pr( )  0 , where   { } ; that is,  is the empty set.
Axiom 3. Let A and B be two subset of S X . Pr( A  B)  Pr( A)  Pr( B)  Pr( A  B) .
The first axiom simply says that when one performs the action and records a resulting
number, the probability that the number is in S X must equal one. When you think about
it, by definition, it cannot be a number that is not in S X . The second axiom simply states
that the probability that you get no number when you perform the action and record a
number must be zero. To appreciate the reasonableness of the third axiom, we will use
the visual aid of the Venn diagram shown in Figure 2.
Figure 2. The yellow rectangle corresponds to the entire sample space, S X . The “size”
(i.e. probability) of this set equals one. The blue and red circles are clearly subsets of S X .
The probability of A is the area in blue. The probability of B is the area in red. The black
area where A and B intersect is equal to Pr( A  B) .
Since Pr(•) is a measure of size, it can be visualized as area, as is done in Figure 2.
Imagining the sample space, S X , to be the interior of the rectangle, it follows that the
area shown in yellow must be assigned a value of one. The circle in red has an area
4
whose size is Pr(A), and the circle in blue has a size that is Pr(B). These two circles have
a common area, as shown in black, and that area has a size that is Pr( A  B) . Finally, it
should be mentioned that the union of two sets is, itself, a set. And that set includes all
the elements that are in either set. If there are elements that are common to both of those
sets, it is a mistake to misinterpret that to mean that those elements are repeated twice
(once in each set). They are not repeated. They are simply common to both sets. Clearly,
if sets A and B have no common elements, then A  B   . Hence, from Axiom 2, the
rightmost term on Axiom 3 is zero. In relation to Figure 2 above, that would mean that
the blue and red circles did not intersect. Hence, the area associated with their union
would simply be the sum of their areas. We will encounter this situation often in this
chapter. For this reason, we now formally state this as a special case of Axiom 3.
Axiom 3’- A Special Case: Let A and B be two subsets of S X . If A  B   , then
Pr( A  B)  Pr( A)  Pr( B) .
We are now in a position to apply address the above axioms and underlying concepts in
relation to the Ber(p) random variable, X, whose sample space is S X  {0,1} . To this end,
let’s begin by identifying all the possible subsets of S X . Since S X has only two elements
in it, there are four possible subsets of this set. These include {0}, {1}, S X and  . The
first two sets here are clearly subsets of S X . However, the set S X is also, formally
speaking, a subset of itself. However, since this subset is, in fact, the set itself, it is
sometimes called an improper subset. Nonetheless, it is a subset of S X . The last subset of
S X , namely the empty set,  , is simply, by definition, a subset of any set. Even so, it has
a real significance, as we will presently describe. And so, the collection of all the possible
subsets of S X  {0,1} is the following set:
X  {{0},{1}, S X , } .
It is crucially important to understand that X is, itself a set. And the elements of this set
are, themselves sets. Why is this of such conceptual importance? It is because Pr(•) is a
measure of the “size” of a set. Hence, Pr(•) measures the size of the elements of X . It
does not measure the size of the elements of S X  {0,1} , since the elements of this set are
numbers, and not sets.
In relation to Figure 2, we have the following results:
(i) Pr({0})  0.3 ;
(ii) Pr({1})  0.7 ;
(iii) Since {0}  {1}   , we have Pr({0}  {1})  Pr({0})  Pr({1})  0.3  0.7  1.0
(iv) Since {0}  {1}  {0,1}  S X , we could also arrive at the rightmost value, 1.0, in (iii)
via Axiom 2; namely, Pr({0,1})  Pr( S X )  1.0 .
5
The practical beauty of the set X is that any question one could fathom in relation to X
can be identified as one of the elements of X . Here are some examples:
Question 1: What is the probability that you either fail ( {0} ) or you succeed ( {1} ) in
understanding this material? Well, since “or” represents a union set operation, the “event”
that you either fail or succeed is simply S X , which is an element of X .
Question 2: What is the probability that you fail? Since here, “failure” has been identified
with the number, 0, the “event” that you fail is a set that includes only the number 0; that
is, {0}. And, of course, this set is in X .
Question 3: What is the probability that you only partially succeed in understanding this
material? Well, our chosen sample space does not recognize partial success. It has only
two elements in it: 0 = failure, and 1 = success. And so, while this is a valid question for
one to ask, the element in X that corresponds to this event of partial success is the empty
set,  . So, the probability that you partially succeed in this setting is zero.
2. Two-Dimensional Bernoulli Random Variables.
It might seem to some (especially those who have some background in probability and
statistics) that the developments in the last section were belabored and overly pedantic or
complicated. If that is the case, wonderful! Those individuals should then have no trouble
in following this and subsequent sections. If, on the other hand, some troubles are
encountered, then it is suggested that these individuals return to the last section and
review it. For, all of the basic concepts covered there are simply repeated in this and
future sections; albeit simply in two dimensions. However, in fairness, it should be
mentioned that the richness of this topic is most readily exposed in the context of not one,
but two random variables. It is far more common to encounter situations where the
relationship between two variables is of primary interest; as opposed to the nature of a
single variable. In this respect, this section is distinct from the last. It requires that the
reader take a different perspective on the material.
Definition 2.1. Let X1 ~ Ber ( p1 ) and X 2 ~ Ber ( p2 ) be Bernoulli random variables. Then
the 2-dimensional (2-D) random variable X  ( X1, X 2 ) is said to be a 2-D Bernoulli
random variable.
The first item to address in relation to any random variable is its sample space. The
possible values that the 2-D variable X  ( X1, X 2 ) can take on are not numbers, but,
rather ordered pairs of numbers. Hence, the sample space for X is
S X  {(0,0) , (1,0) , (0,1) , (1,1) } .
(2.1)
6
Key things to note here include the fact that since X is 2-D, its sample space is contained
in the plane, and not the line. Hence, to visualize its probability description will require
three dimensions. Also, since now, S X has 4 elements (as opposed to 2 elements for the
1-D case), its probability description will require the specification of 3 probabilities (not
only one, as in the 1-D case). Define the following probabilities:

pi j  Pr( { (i, j ) } ) ; i, j {0,1}
(2.2)
Even though (2.1) defines four probabilities ( p0 0 , p1 0 , p0 1 , p11 ), in view of Axiom 2
above, only three of these four quantities need be specified, since the fourth must be one
minus the sum of the other three.
p0 1
p0 0
p1 0
x2
p11
0
0
1
x1
Figure 3. Visual description of the probability structure of a 2-D Bernoulli random
variable.
Having defined the sample space for X, and having a general idea of what its probability
description is, the next natural step is to identify all the possible subsets of (2.1). Why?
Because remember, any question one can fathom to ask in relation to X corresponds to
one of these subset. And so, having all possible subset of X in hand can give confidence
in answering any question that one might pose. It can also illuminate questions that one
might not otherwise contemplate asking. Since this set contains 4 elements, the total
number of subsets of this set will be 24 = 16. Let’s carefully develop this collection, since
it will include a procedure that can be used for higher dimensional variables, as well.
A procedure for determining the collection, X of all the subsets of (2.1):
7
(i) All sets containing only a single element: {(0,0}, {(1,0)}, {(0,1)}, {(1,1)}
(ii) All sets containing two elements:
-pair (0,0) with each of 3 elements to its right elements: {00, 10}, (00, 01}, {00, 11}
-pair (1,0) with each of the two elements to its right: {10, 01}, {10, 11}
-pair (0,1) with the one remaining element to its right: {10 , 11}
[Notation: for simplicity we use 10 to mean the element (1,0), etc.]
(iii) All sets containing 3 elements:
-pair {00 10} with the first element to the right: {00 10 01}
-pair {00 10} with the second element to the right: {00 10 11}
-pair {00 01} with the element to the right of 01: {00 01 11}
-pair {10 01} with the element to the right: {10 01 11}
(iv) S X and 
If you count the total number of set in (i) – (iv) you will find there are 16. Specifically,
X  { {00}, {10}, {01}, {11}, {00,10}, {00,01},{00,11},{10,01},{10,11},
{01,11}, {00,10,01} , {00,10,11}, {00,01,11} , {10,01,11} , S X ,  }
(2.3)
It is important to note that the four singleton sets {(0,0)}, {(1,0)}, {(0,1)} and {(1,1)}
have no elements in common with one another. Since they are each a 1-element set, to
say that two of them have an element in common would be to say that they each have one
and the same element. While the ordered pairs (0,0) and (0,1) do, indeed, have the same
first coordinate, their second coordinates are different. As shown in Figure 3, they are
two distinctly separate points in the plane. Thus, the intersection of the sets {(0,0)} and
{(0,1)} is the empty set.
A second point to note is that any element (i.e. set) in the collection (2.3) can be
expressed as a union of two or more of these disjoint singleton sets. For example,
{(0,0), (1,1) } = {(0,0)}  {(1,1)}.
Hence, from Axiom 3’ above,
8
Pr( {( 0,0), (1,1) } )  Pr( {( 0,0}  {(1,1)} }  Pr( {( 0,0)} )  Pr( {(1,1)} )  p0 0  p11 .
It follows that if we know the probabilities of the singleton sets, then we can compute the
probability of any set in X . We now state this in a formal way.
Fact: The probability structure of a 2-D Bernoulli random variable is completely
specified when 3 of the 4 probabilities { p0 0 , p1 0 , p0 1 , p11} are specified.
In view of this fact, and the above Definition 2.1, it should be apparent that Definition 2.1
is incomplete, in the sense that it does not define a unique 2-D Bernoulli random variable.
This is because in that definition only two parameters were specified; namely, p1 and p2 .
Even so, the given definition is a natural extension of the definition of a 1-D Bernoulli
random variable. We now offer an alternative to Definition 2.1 that does completely and
unambiguously define a 2-D Bernoulli random variable.
Definition 2.1’ The random variable X  ( X1, X 2 ) is said to be a completely defined 2-D
Bernoulli random variable if its sample space is S X  {( 0,0), (1,0), (0,1), (1,1)} and if any
three of the four singleton set probabilities { p0 0 , p1 0 , p0 1 , p11} are specified.
This alternative definition eliminates the lack of the complete specification of the 2-D
Bernoulli random variable, but at the expense of not seeming to be a natural extension of
the 1-D random variable.
Now, let’s address the question of how the specification of { p0 0 , p1 0 , p0 1 , p11} leads to
the specification of p1 and p2 . To this end, it is of crucial conceptual importance to
understand what is meant when one refers to “the event that X1 equals one”, within the
2-D framework. Remember: ANY question one can ask, in relation to X  ( X1, X 2 ) can be
identified as one unique set in the collection of sets given by (2.3). This includes
questions such as: what is the probability that X1 equals one? In the 2-D sample space for
X, this event is:
“The event that X1 equals one” (often written as [ X1 =1] ) is the set {(1,0), (1,1)}.
This set includes all elements whose first coordinate is 1, but whose second coordinate
can be anything. Why? Because there was no mention of X 2 ; only X1 . If you are having
difficulty with this, then consider when you were first learning about x, y and graphing in
high school math. If there is no y, then you would identify the relation x=1 as just the
point 1.0 on the x-axis. However, in the x-y plane, the relation x=1 is a vertical line that
intersects the x-axis at the location 1.0. You are allowing y to be anything, because no
information about y was given.
9
And so, we have the following relation between p1  Pr[ X  1] and { p0 0 , p1 0 , p0 1 , p11} :
1
p1  Pr[ X1  1]  Pr{(1,0), (1,1)}  Pr({(1,0)})  Pr({(1,1)})  p1 0  p11   p1 j
(2.4a)
j 0
Similarly,
1
p2  Pr[ X 2  1]  Pr{( 0,1), (1,1)}  Pr({( 0,1)})  Pr({(1,1)})  p0 1  p11   pi 1
(2.4b)
i 0
From (2.4) we observe more of the missing details when one specifies only p1 and p2 in
relation to a 2-D Bernoulli random variable. If these parameters are specified, then one
still needs to specify one of the four parameters { p0 0 , p1 0 , p0 1 , p11} for a complete,
unambiguous description of the probability structure of X  ( X1, X 2 ) .
There is one common situation where specification of only p1 and p2 is sufficient to
completely specify the probability structure of X  ( X1, X 2 ) . It is the situation where
X1 ~ Ber ( p1 ) and X 2 ~ Ber ( p2 ) are statistically independent. In more simple terms, this
situation is one wherein knowledge of the value X1 has no influence on the probability
that X 2 equals any specified value. For example, if you toss a coin and you read “heads”
(map “heads” to 1), then that result in no way changes the probability that you will get a
“heads” on a second flip, does it? If we agree that it does not, then the 2-coin flip is an
example of a 2-D Bernoulli random variable, where the two components of
X  ( X1, X 2 ) are statistically independent. As another example, consider parts inspection
at the delivery dock of a company. If a randomly selected part passes inspection, it is
natural to assume that the probability that the next selected part passes is not influenced
by knowledge that the first part passed. While this is a natural assumption in parts
inspection protocols, it may not necessarily be true. If the parts were manufactured in the
presence of an unknown systematic manufacturing malfunction, then the fact that the first
part passed may well influence the probability that the second part passes; for example, if
there is only one good part, then the probability that the second part passes, given the first
part passed will be zero.
We will presently address the mathematical details of what is required for X1 and X 2 to
be assumed to be statistically independent. However, in order to expedite that
investigation, it is appropriate to first address yet another source of major confusion to
novices in this area.
Unions and Intersections, And’s and Or’s, and Commas and CommasIt should be apparent by now that probability is intimately related to sets. As noted above,
it is, in fact, a well-defined measure of the “size” of a set. Yet, as was noted above, the
vast majority of text books dealing with probability and statistics use a notation that, at
10
the very least, de-emphasizes the notion of sets. For example, in the case of the 2-D
Bernoulli random variable X  ( X1, X 2 ) , most books will use notation such as
Pr[ X1  1]  p1
(2.5a)
If one realizes that Pr(A) measures the “size” (i.e. probability) of the set A, then it must be
that, depending on how you read (2.5a), either X1  1 is a set, or [ X1  1] is a set. In either
case, it is very likely that a student who has had some prior exposure to sets has never
seen either one of the above expressions for a set. To this point, we have been using a
more common notation for a set; namely the {•} notation. Let’s rewrite (2.5a) in this
more common notation.
p1  Pr( { (1,0) , (1,1) } )
(2.5b)
There is nothing ambiguous or vague about (2.5b). The set in question has two elements
in it; namely the element (1,0) and the element (1,1). In particular, (1,0) is an element,
and not a set. Whereas, { (1,0) } is a set, and that set contains the element (1,0). One
might argue that (2.5a) is clear, and that (2.5b) is involves too many unnecessary symbols
that can cause confusion. However, let’s consider the following probability:
Pr[ X1  1  X 2  1]
(2.5c)
This expression includes a set operation symbol, namely the intersection symbol  . This
suggests that X1  1 and X 2  1 are sets. Moreover, (2.5c) suggests that these two sets
may have elements in common. But what exactly is the set X1  1 ? Well, if we ignore
X 2 , then we have only a 1-D Bernoulli random variable, whose sample space is {0,1}. In
that case, the expression X1  1 means the set{1}. However, if we include X 2 in our
investigation, then the expression X1  1 means { (1,0) , (1,1) }. These are two distinctly
different sets, and yet each set is expressed as X1  1 . A seasoned student of probability
might argue that one must keep in mind the entire setting when interpreting the meaning
of X1  1 . However, for a student who has no prior background in the field, it is often not
so easy to keep the entire setting in mind. Before we can reconcile this ambiguity, we
need to first address the set notion of a union (  ).
The union of two sets is a set whose elements include all of the elements of each set; but
where elements in common in these two sets are not counted twice. For example
{ (1,0) , (1,1) }  { (0,1) , (1,1) }  { (1,0) , (0,1) , (1,1) }
(2.5d)
Notice that the common element (1,1) is not counted twice in this union of the sets.
Unfortunately, the same type of notation used in (2.5c) for intersections is commonly also
used for unions. Specifically, we have the expression
11
Pr[ X1  1  X 2  1]
(2.5e)
Since (2.5c) and (2.5e) each involve a set operation, the above discussion related to the
ambiguity and vagueness of expressions such as X1  1 applies to (2.5e).
Now let’s reconcile these two types of expressions. In doing so, we will discover that
there are commas and there are commas. To this end, we will express (2.5c) in the
unambiguous notation associated with a 2-D Bernoulli random variable.
Pr[ X 1  1  X 2  1]  Pr({ (1,0), (1,1) }  {( 0,1) , (1,1) } )  Pr( {(1,1) } )  p11
(2.5f)
The leftmost expression in (2.5f) is ambiguous when not accompanied by a note that we
are dealing only with a 2-D random variable. (What if we actually had a 3-D random
variable?) The middle expression is unambiguous. Furthermore, any student with even a
cursory exposure to sets would be able to identify the single element, (1,1), that is
common to both sets. The equality of the leftmost and rightmost expressions reveals that
in this 2-D framework, we can refer to the element (1,1) as “the element whose first
component is one and whose second component is one. Hence the comma that separates
these two components of the element (1,1) may be read as an and comma.
Similarly, rewriting (2.5e) and referring to (2.5c) gives
Pr[ X1  1  X 2  1]  Pr({(1,0)}  {(0,1)}  {(1,1)})  Pr({(1,0), (0,1), (1,1)})
(2.5g)
Hence, the commas that separate the elements (1,0), (0,1) and (1,1) may be read as or
commas. After a bit of reflection, the reader may find all of this to be obvious. In that
event, this brief digression will have served its purpose. In that case, let’s proceed to the
following examples to further assess the reader’s grasp of this topic.
Example 2.1 Let X  ( X1, X 2 ) ~Ber( { p0 0 , p1 0 , p0 1 , p11} ). Notice that there is no
assumption of independence here.
(a) Clearly state the sets corresponding to the following events:
(i) [ X1  X 2 ] : Answer: {( x1, x2 )  S X | x1  x2}  { (0,1) } .
(ii): [ X1  X 2 ] : Answer: {( x1, x2 )  S X | x1  x2}  { (0,0), (0,1) } .
(iii) [ X1  X 2 ] : Answer: {( x1, x2 )  S X | x1  x2}  { (0,0), (1,1) } .
(iv) [| X1  X 2 | 1] : Answer: {( x1, x2 )  S X | | x1  x2 | 1}  { (0,1), (1,0) } .
(b) Compute the probabilities of the events in (a), in terms of { p0 0 , p1 0 , p0 1 , p11} .
12
(i) Pr[ X1  X 2 ] : Answer: Pr( { (0,1) } ) 
p0 1 .
(ii): Pr[ X1  X 2 ] : Answer: Pr( { (0,0), (0,1) } ) 
p0 0  p0 1 .
(iii) Pr[ X1  X 2 ] : Answer: Pr( { (0,0), (1,1) } ) 
p0 0  p11 .
(iv) Pr[| X1  X 2 | 1] : Answer: Pr( { (0,1), (1,0) } ) 
p0 1  p1 0 .
Hopefully, the reader felt that the answers in the above example were self-evident, once
the sets in question were clearly described as such. The next example is similar to the last
one. However, it extends the conceptual understanding of this topic to arrive at a very
important and useful quantity; namely the cumulative distribution function (CDF).
Example 2.2 Again, let X  ( X1, X 2 ) ~Ber( { p0 0 , p1 0 , p0 1 , p11} ). Now, let ( x1, x2 ) be any
pair of real numbers (i.e. any point in the plane). Notice here that ( x1, x2 ) is not
constrained to be an element of S X .

(a) Develop an expression for Pr[ X1  x1 ]  FX 1 ( x1 ) as a function of x1 while
ignoring X2 .
Solution: If we want to, we can approach this problem in exactly the same manner as
in the above example; namely, by clearly describing the set corresponding to the
expressed event [ X1  x1 ] ; namely,
[ X1  x1 ]  {(u, v)  S X | u  x1}.
However, since our interest here is only in the random variable X1 , whose sample space
is extremely simple (i.e. {0,1} ) we will choose this approach. The p-value for this
random variable in terms of the 2-D probabilities is given above in (2.4a). The probability
description for X1 ~ Ber ( p1  0.7) was shown in Figure 1 above. But that figure only
utilized x-values in the range [0,2]. The expression we are to develop here should
consider any value of x1 . The following expression is hopefully clear from Figure 1:
 0 for x1  0

FX 1 ( x1 )  1  p1 for 0  x1  1

1 for x1  1

This expression is plotted below for the value p1  0.7
(2.6)
13
1.5
F(x1)
1
0.5
0
-2
-1.5
-1
-0.5
0
x1
0.5
1
1.5
2

Figure 4. Graph of Pr[ X1  x1 ]  FX 1 ( x1 ) given by (2.6).
So, how exactly did Figure 4 arise from Figure 1? Well, Figure 1 shows where the
“lumps” of probability are, and also gives the values of these “lumps”. For example, the
“lump” at location x1  0 is the probability Pr[ X1  0]  Pr({( 0,0), (0,1)})  0.3 . If the
reader is confused by the fact that Figure 1 is for the Ber(p) random variable, X, while
Figure 2 is for the Ber( p1 ) random variable, X1 , it should be remembered that when we
discussed Figure 1 there was only one random variable. However, now there are two.
And so, now, we need to have some way to distinguish one from the other. Nonetheless,
both are Bernoulli random variables. And so, both will have the general probability
structure illustrated in Figure 1; albeit with possibly differing p-values.
So, again- how did Figure 4 arise from Figure 1? The key to answering this question is to

observe that Pr[ X1  x1 ]  FX 1 ( x1 ) is the totality of the probability associated with the
interval (, x1 ] . Hence, as long as x1  0 , the value of FX 1 ( x1 ) will be zero, since the
first “lump” of probability is at x1  0 . So, at this location, FX 1 ( x1 ) will experience an
increase in probability, in the amount Pr[ X1  0]  1  p1 . This increase, or jump in
FX 1 ( x1 ) is shown in Figure 4. As we allow x1 to continue its travel to the right of zero,
since there are no lumps of probability in the interval (0,1) the value of FX 1 ( x1 ) will
remain at the value 1  p1 throughout this region. When x1  1 , the value of FX 1 ( x1 ) will
increase by an amount p1 , since that is the value of the “lump” of probability at this
location: Pr[ X1  1]  p1 . Hence, when x1  1 we have FX 1 (1)  1.0 . In words, the
probability that the random variable X1 equals any number less than or equal to one is
1.0. It follows that there are no more “lumps” of probability to be “accumulated” as
x1 continues it travel to the right beyond the number 1. This is the reason that
14
FX 1 ( x1 ) remains flat to the right of x1  1 in Figure 4. It is the “accumulating” feature of
FX 1 ( x1 ) as x1 “travels” from left to right that is responsible for the following definition.

Definition 2.3 Let X be any 1-D random variable. Then Pr[ X  x]  FX ( x) is called the
cumulative distribution function (CDF) for X.
Example 2.2 continued:

(b) Develop an expression for Pr[ X1  x1 ]  FX 1 ( x1 ) as a function of x1 while not
ignoring X2.
Solution: Again, as in (a), we write
[ X1  x1 ]  {(u, v)  S X | u  x1}  {(u,0), (u,1) | u  x1} .
As we compute the probability of this set, let’s actually identify the actual set that
corresponds to a given value for u:
(i)
for x1  0, [ X1  x1 ]    Pr[ X1  x1 ]  0 ,
1
(ii)
for 0  x1  1, [ X1  x1 ]  {( 0,0), (0,1)}  Pr[ X1  x1 ]   p0 j
j 0
1
1
j 0
j 0
(iii) for x1  1, [ X1  x1 ]  {( 0,0), (0,1), (1,0), (1,1)}  Pr[ X1  x1 ]   p0 j  p1 j .

Notice that the rightmost expressions in (ii) and (iii) above are summations. It is fair to
argue that the summation notation is unduly heavy, in the sense that (ii), for example,
could have been written more simply as p0 0  p01 . Not only is this a fair argument, it
points to yet another example where the biggest stumbling block to a novice might be the
notation, and not the concept. However, in this particular situation the summation
notation was chosen (at the risk of frustrating some novices) in order to highlight a
concept that is central in dealing with two (or more) random variables. We now state this
concept for the more general case of two random variables, say, X and Y, whose joint
probability structure is specified by a collection of joint probabilities, say,
{ pi j where i  S X and j  SY } .
Fact 2.1 Consider a 2-D random variable, (X,Y) having a discrete 2-D sample space

S( X ,Y )  {x, y) | x  S X and y  SY }  S X  SY , and corresponding joint probabilities
{ px y where x  S X and y  SY } . Then Pr[ X  x] 
p
yS Y
xy
.
In many books on the subject, Fact 2.1 is stated as a theorem, and often it is accompanied
by a proof. However, we do not believe that this fact is worthy of the theorem label. It is
an immediate consequence of the realization that the set [ X  x]  {( x, y )  S X Y }
15
A reader who has had a course in integral calculus might recognize that integration is
synonymous with accumulation. The above Fact 2.1 says, in words: To obtain
Pr[ X  x] integrate the joint probabilities over the values of y. For the benefit of such
readers, consider the following example.
Example 2.3 Consider a random variable, say, U, whose sample space is SU  [0,1] (i.e.
the closed interval with left endpoint 0, and with right endpoint 1). Furthermore, assume
that U has a uniform probability distribution on this interval. Call this distribution fU (u) .
The meaning of the term uniform here, is that the probability of any sub-interval of
SU  [0,1] depends only on the width of that interval, and not on its location. For a subinterval of width 0.1 (be it the interval (0,0.1) or (0.2,0.3), or [0.8,0.9]) the probability
that U falls in the interval is 0.1. This distribution is shown in Figure 5 below. It follows
that the probability the U falls in the interval [0,u] is equal to u. Another way of
expressing this is Pr[U  u ]  u . But this is exactly the definition of the CDF for U. And

so Pr[U  u ]  FU (u )  u . This CDF is also shown in Figure 5 below. Notice that this
CDF is linear in u, and has a slope equal to 1.0. The derivative of this CDF is, therefore,
just its slope, which is exactly fU (u) . Hence, here, we can conclude that fU (u) is the
derivative of FU (u ) ; or, equivalently, FU (u ) is the integral of fU (u) . 
1.5
1
0.5
0
-0.5
0
0.5
u
1
1.5
Figure 5 Graphs of the CDF, FU (u ) (thick line), and its derivative fU (u) (thin line).
The above example is a demonstration of the following general definition that holds for
any random variable.
16

Definition 2.4 Let W be any random variable, and let FW ( w)  Pr[W  w] be its
cumulative distribution function (CDF). Then the (possibly generalized) derivative of
FW (w) is fW (w) , which is called the (possibly generalized) probability density function
(PDF) for W.
In Example 3.2 above, indeed, the derivative of the CDF FU (u )  u is exactly the
PDF fU (u )  1 . However, in the case of X ~ Ber ( p) with a CDF having the general
structure illustrated in Figure 4 above, we see that the CDF has a slope equal to zero,
except at the jump locations. And at these locations the slope is infinite (or, if you like,
undefined). What is the derivative of such a function? Well, properly speaking, the
derivative does not exist at the jump locations. Hence, properly speaking,
X ~ Ber ( p) does not have a PDF. However, “generally speaking” (i.e. in the generalized
sense) we can say that its derivative has the form illustrated in Figure 1 above.
Specifically, the PDF is identically zero, except at the jump locations where it contains
“lumps” of probability. [For those readers who are familiar with Dirac-δ functions, these
lumps are, in fact, weighted δ-functions, whose weights are the probability values]. The
key points here are two:
Key Point #1: Every random variable has a well-defined CDF, and
Key Point #2: If the CDF is not differentiable, then, properly speaking, the PDF does not
exist. Nonetheless, if we allow generalized derivatives, then it does exist everywhere,
except at a discrete number of locations.
In the next chapter we will discuss the relation between the CDF and PFD of a wide
variety of random variables. However, for the time being, let’s return to Bernoulli
random variables. In particular, there are two topics that still need to be addressed before
we move on to n-D Bernoulli random variables. One is the topic of statistical
independence,and the second is the topic of conditional probability. As we shall see
shortly, these two topics are strongly connected.
Definition 2.5 Let (X,Y) be a 2-D random variable with sample space S ( X ,Y ) . Let A be a
subset of this space that relates only to X, and let B be a subset that relates only to Y. Then
the subsets (i.e. events) A and B are said to be (statistically) independent events if
Pr( A  B)  Pr( A) Pr( B) . If all events relating to X are independent of all events relating
to Y, then the random variables X and Y are said to be (statistically) independent.
Before we investigate just exactly how the notion of statistical independence relates to a
2-D Bernoulli random variable, let’s demonstrate its practical implications in an example.
Example 2.4 Consider the act of tossing a fair coin twice. Let X k correspond to the action
that is the kth toss, and let a “heads” correspond to one, and a “tails” correspond to a zero.
17
Then, X  ( X1, X 2 ) is a 2-D Bernoulli random variable. Since the coin is assumed to be a
fair coin, we have


p1  Pr[ X 1  1]  Pr{(1,0), (1,1)}  0.5 and p2  Pr[ X 2  1]  Pr{( 0,1), (1,1)}  0.5 .
But because the coin is fair, each of the four possible outcomes, {(0,0)}, {(1,0)}, {(0,1)},
{(1,1)} should have the same probability. Hence, Pr{(1,1)}  0.25 . Rewriting this
probability in the usual notation gives
Pr{(1,1)}  0.25  P[ X1  1  X 2  1]  0.5  0.5  Pr[ X1  1]  Pr[ X 2  1] .
So, we see that the events [ X1  1] and [ X 2  1] are statistically independent. In exactly
the same manner, one can show that all of the events related to X1 (i.e. [ X1  0] and
[ X1  1] ) are independent of all the events related to X 2 (i.e. [ X 2  0] and [ X 2  1] ). We
can conclude that the assumption of a fair coin, and in particular, that the above four
outcomes have equal probability, is equivalent to the assumption that X1 and X 2 are
statistically independent. 
Now, let’s look more closely at a 2-D Bernoulli random variable X  ( X1, X 2 ) with
specified probabilities { p0 0 , p1 0 , p0 1 , p11} . Without loss of generality, let’s assume the
first three probabilities have been specified. Then p11  1  ( p0 0  p1 0  p0 1 ) . We now
address the question:
UNDER WHAT CONDITIONS ARE THE EVENTS [ X1  1] AND
[ X 2  1] INDEPENDENT ?
ANSWER: Let’s first express the condition for independence in terms of the usual
notation. Then we will translate the condition in terms of sets. These events are
independent if:
Pr[ X1  1  X 2  1]  Pr[ X1  1] Pr[ X 2  1] .
(2.7a)
In terms of sets, (2.7a) becomes
Pr{(1,1)}  Pr{(1,0), (1,1)}  Pr{( 0,1), (1,1)} .
(2.7b)
In terms of the specified probabilities, (2.7b) becomes
p11
 ( p1 0  p11 )  ( p0 1  p11 )
(2.7c)
Even though (2.7c) is the condition on the specified probabilities for these events to be
independent, we can arrive at a more simple expression by using the fact that
p11  1  ( p0 0  p1 0  p0 1 ) . First, let’s rewrite (2.7c) as
18
p11

p1 0 p0 1  p11 ( p1 0  p0 1  p11 )  p1 0 p0 1  p11 (1  p0 0 ) .
(2.7d)
Subtracting p1 1 from each side of (2.7d), and rearranging terms, gives
p0 0 p11

p1 0 p0 1 .
(2.7e)
Equation (2.7e) is the condition needed to assume that the events [ X1  1] and [ X 2  1] are
independent.
Using exactly the same procedure, one can show that the condition (2.7e) is the
condition. We state this formally in the following fact.
Fact 2.2 The components of the 2-D Bernoulli random variable X  ( X1, X 2 ) with
specified probabilities { p0 0 , p1 0 , p0 1 , p11} are statistically independent if and only if the
condition p0 0 p11 
p1 0 p0 1 holds.
Example 2.4 above is a special case of this fact. Since we assumed
p0 0  p1 0  p0 1  p11  0.25 , clearly, the above condition holds. This equality of the
elemental probabilities is a sufficient condition for independence, but it is not necessary.
Consider the following example.
Example 2.5. Suppose the person has very good control over the number of rotations the
coin makes while in the air. In particular, suppose the following numerical probabilities:
p1 0  p0 1  0.2 . Now, we need to find the numerical value of p0 0 (if there is one) such
that the relation p0 0 p11 
p1 0 p0 1 holds. To this end, express this condition as:
p0 0 p11  p0 0 [ 1  ( p0 0  p1 0  p0 1 ) ]  p1 0 p0 1
(2.8a)
This equation can be rewritten as a quadratic equation in the unknown p0 0 .
2
p00
 [ 1  ( p10  p01 ) ] p00

p10 p01  0 .
(2.8b)
Applying the quadratic formula to (2.8b) gives
p00

1
[ 1  ( p10  p01 )  1  ( p10  p01 )]2
2
 4 p01 p01 ] .
Inserting the above numerical information into (2.8c) gives
(2.8c)
19
1
1
[ 0.6  0.36  0.16 ] 
[ 0.6  0.4472]  0.5236 or 0.0764 (2.8d)
2
2
Notice that for the chosen values p1 0  p0 1  0.2 there are two possible choices for p00 .
p00

Furthermore, they add up to 0.6. Hence, if we choose the first for p00 , then the second is
exactly p11 . It should also be noted that (2.8c) indicates that for certain choices of p10 and
p01 there will be no value of p00 that makes the components of X independent.
Specifically, if both p10 and p01 are large enough so that the term inside the square root is
negative, then there is no real-valued solution for p00 . 
We now address the concept of conditional probability in relation to a 2-D Bernoulli
random variable. First, however, we give the following definition of conditional
probability in the general setting.
Definition 2.6. Let A and B be two subsets of a sample space, S X , and suppose that
Pr( B)  0 . Then, the probability of A given B, written as Pr( A | B) is defined as

Pr( A | B) 
Pr( A  B)
.
Pr( B)
(2.9)
To understand (2.9) we refer to the Venn diagram in Figure 2. What “given B” means is
that our sample space is now restricted to the set B. Stated another way, nothing outside
of the set B exists. So, in Figure 2, only the red circle exists now. Equation (2.9) is the
probability of that portion of the set A that is in the set B. The Pr( A  B) , which is the
black area in Figure 2, is the “size” of the intersection relative to the entire sample space.
Since our new sample space is the smaller one, B, the probability of this intersection
relative to B, demands that we “scale it” by dividing that probability by the probability of
B, as is done in (2.9).
Now that we have Definition 2.6, we can make an alternative definition of statistical
independence defined per Definition 2.5. Specifically,
Definition 2.5’ Events A and B (where it is assumed that Pr( B)  0 ) are said to be
statistically independent events if Pr( A | B)  Pr( A) .
Remark In relation to Figure 2, this means that if A is contained entirely in B, then
restricting our sample space to B does not alter the probability of A. In other words, if
under the condition B, the probability of A is not changed, then A and B are statistically
independent. However, while that condition that A is entirely contained in B is a
sufficient condition for independence, it is not necessary. Again, referring to Figure 2, all
that is necessary is that the overlap of A and B be just enough so that the black
intersection area equals the product of the blue and read areas.
20
We now proceed to relate the concept of conditional probability to a 2-D Bernoulli
random variable. Because the sample space for this random variable is so simple, it offers
a clear picture of both the meaning and value of conditional probability.
Example 2.6 Again, let X  ( X1, X 2 ) ~Ber( { p0 0 , p1 0 , p0 1 , p11} ). Develop the expression
for Pr[ X 2  j | X1  i ] .
Solution:
Pr[ X 2  j | X 1  i ] 
Pr[ X 2  j  X 1  i ]

Pr[ X 1  i ]
pi j
pi 0

pi 1
.
(2.10a)
In particular,
Pr[ X 2  1 | X1  0 ] 
p0 0
p01
 p01
p10
p11
 p11


p2 | X1  0
(2.10b)
p2 | X 1  1 .
(2.10c)
and
Pr[ X 2  1 | X 1  1 ] 


The probabilities (2.10b) and (2.10c) are the p-values for X 2 , conditioned on the events
[ X1  0] and [ X1  1] , respectively. 
As simple as it was to obtain (2.10), it can be an extremely valuable tool. Specifically, if
one has reliable numerical values for { p0 0 , p1 0 , p0 1 , p11} , then (2.10) is a prediction
model, in the sense that, if we have obtained a numerical value for X1 , it allows us to
predict the probability that X 2 will equal zero or one. Remember, if X1 and X 2 are
independent, then the numerical information associated with X1 is irrelevant, in the sense
that it does not alter the probability that X 2 will equal zero or one. But there are many
situations where these random variables are not independent.
3. n-Dimensional Bernoulli Random Variables

Definition 2.7 Let X [ X 1 ,, X n ] where each X k ~ Ber ( pk ) . Then X is said to be an n-D
Bernoulli random variable.
The p-values { pk }nk 1 in the above definition are not generally sufficient to describe X
unambiguously. The reason lies in the fact that the sample space for X includes 2n distinct
elements. Hence, to completely describe the probability structure of X requires the
21
specification of 2n  1 probabilities. Specifically, we need to specify all but one of

{ Pr{( x1,, xn )}  px1 x2 xn ; xk {0,1} } . There is a situation wherein the n p-values
{ pk }nk 1 are sufficient to completely describe X; namely when the n random variables
comprising X are mutually independent.
However, in doing so, we will demonstrate the value of the uniform random variable
considered in Example 3 above.
Using a Random Number Generator to Simulate n-D iid Bernoulli Random
Variables
In this section we address the problem of simulating data associated with a Bernoulli
random variable. This simulation will utilize a uniform random number generator. And
so, first, we will formally define what we mean by a uniform random number generator.
Definition 2.8 A uniform random number generator is a program that, when called,
produces a “random” number that lies in the interval [0,1].
In fact, the above definition is not very formal. But it describes in simple terms the gist of
a uniform random number generator. The following definition is formal, and allows the
generation of n numbers at a time.

Definition 2.8’Define the n-D random variable U [U1 ,U 2 ,,U n ] where each U k is a
random variable that has a uniform distribution on the interval [0,1], and where these n
random variables are mutually independent. The two assumptions that these variables
each have the same distribution and that they are mutually independent is typically
phrased as the assumption that they are independent and identically distributed (iid).
Then U is an n-D uniform random number generator.
The following example uses the uniform random number generator in Matlab to
demonstrate this definition.
Example 2.7 Here, we give examples of an n-D uniform random variable, U, using the
Matlab command “rand”, for n=1,2 and 25:
(i)
U = rand(1,1) is a 1-D uniform random variable. Each time this command is
executed, the result is a “randomly selected” number in the interval [0,1]. For
example:
>> rand(1,1)
ans =
0.9501
22
(ii)
U=rand(1,2) is a 2-D uniform random variable. For example,
>> rand(1,2)
ans =
0.2311 0.6068
(iii)
U=rand(5,5) is a 25-D uniform random variable. For example,
>> rand(5,5)
ans =
0.3340 0.5298
0.4329 0.6405
0.2259 0.2091
0.5798 0.3798
0.7604 0.7833
0.6808
0.4611
0.5678
0.7942
0.0592
0.6029
0.0503
0.4154
0.3050
0.8744
0.0150
0.7680
0.9708
0.9901
0.7889
It is important to note that the command rand(m,n) is the m n -D random variable. The
numbers are a result of the command. They are not random variables. They are numbers.
A random variable is an action, algorithm, or operation that when conducted yields
numbers. 
We now proceed to show how the uniform random number generator can be used to
simulate measurements of a Bernoulli random variable. Let’s begin with a 1-D random
variable. Again, we will use Matlab commands to this end.
Using U to arrive at X ~ Ber ( p) : For U ~ Uniform[0,1] , define the random variable, X,
in the following way: Map the interval [U  1  p] to the event [X=0], and map the event
[ 1  p  U  1] to the event [X=1]. Recall from Example 3 above that
Pr[U  1  p]  1  p . Hence, it follows that Pr[ X  0]  1  p . Therefore, since X can
take on only the value zero or one, we have Pr[ X  1]  p ; that it, X is a Ber(p) random
variable. Here is a Matlab code that corresponds to X ~ Ber ( p) :
p=0.7;
u=rand(1,1);
if u <=1-p
x=0
else
x=1
end
For example:
>> p=0.7;
u=rand(1,1);
23
if u <=1-p
x=0
else
x=1
end
x=
1
Now, suppose that we want to simulate multiple measurements associated with this
random variable X ~ Ber ( p) associated with the above code. Well, we could simple
embed the code in a DO loop, and repeat the above operation the desired number of
times. Well, it turns out that Matlab is a programming language that is not well-suited to
DO loops. If the loop count is small, it works fine. But if you wanted to simulate, say one
million values associated with X, then it would take a long time. In fact, Matlab was
designed in a way that makes it a very fast code for batch or vector operations. With this
in mind, the code below is offered. It includes no IF/ELSE commands, and it requires no
DO loop for multiple measurements. We will give the code for the case of one million
values associated with X.
p=0.7;
m=1000000;
u=rand(1,m);
u=u-(1-p);
x=ceil(u);
The command u=rand(1,m) results in a 1x1000000 vector of numbers between 0 and 1.
The command u=u-(1-p), shifts every number to the left by an amount 1-p. Thus, since
here p=0.7, every number that was in the interval [0,0.3] has been shifted to a number in
the interval[-0.3,0]. In particular, not only is every such number now a negative number,
but the closest integer to the right of it is zero. The ceil command rounds numbers to the
next higher integer. The command ceil is short for ceiling, or “round up to the nearest
integer”. Similarly, numbers originally in the interval (0.3,1] are moved to the interval
(0,0.7]. Since they are still positive, the next highest integer associated with them is one.
Here is an example of running the above code. Rather than showing x, which contains
one million zeros/ones, we included a command that adds these numbers. This sum is the
number of ones, since zeros contribute nothing to a sum.
>> p=0.7;
m=1000000;
u=rand(1,m);
u=u-(1-p);
x=ceil(u);
>> sum(x)
ans =
24
700202
Notice that the relative frequency of ones is 700202/1000000, which is pretty close to the
0.7 p-value for X. In fact, if we were to pretend that these numbers were collected from
an experiment, then we would estimate the p-value for X by this relative frequency value.
The value of running a simulation is that you know the truth. The truth in the simulation
is that the p-value is 0.7. And so, the simulation using 1000000 measurements appears to
give a pretty accurate estimate of the true p-value. We will next pursue more carefully
what this example has just demonstrated.
Using U  [U1 ,,U n ] to Simulate X  [ X 1 ,, X n ] Independent and Identically
Distributed (iid) Ber(p) Random Variables, and then, from these, investigating the


1 n
probability structure of the random variable pˆ   X k  X .
n k 1
In this subsection we are interested in using simulations to gain some idea of how many
subjects, n, would be required to obtain a good estimate of the p-value of a typical
subject. The experiment is based on the question: What is the probability that a typical
American believes that we should withdraw from Iraq. We will identify the set {1} with
an affirmative, and the set {0} with opposition. We will ask this question to n
independent subject and record their responses. Let X k ~ Ber ( p) be the response of the
kth subject. Notice that we are assuming that each subject has the same probability, p, of
believing that we should withdraw. Thus, X  [ X 1 ,, X n ] is an n-D random variable
whose components are iid Ber(p) random variables. After we conduct the survey, our
next action will be to estimate p using the estimator


1 n
X

(2.7)
 k X.
n k 1
Notice that (2.7) is a random variable that is a composite action that includes first
recording the responses of n subjects, and then taking an average of these responses.
pˆ 
But suppose we were considering conducting an experiment where only 100
measurements were being considered. Well, if we run the above code for this value of m,
we get a sum equal to 74. Running it a second time gives a sum equal to 67. And if we
run the code 500 times, we could plot a histogram of the sum data, to get a better idea of
the amount of uncertainty of the p-value estimator for m=100. Here is the Matlab code
that allows us to conduct this investigation of how good an estimate of the true p-value
we can expect:
p=0.7;
n=500;
m=100;
u=rand(m,n);
u=u-(1-p);
25
x=ceil(u);
phat=0.01*sum(x);
hist(phat)
120
100
80
60
40
20
0
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
Figure 6. Histogram of the p-value estimator (2.7) associated with m=100 subjects, using
500 simulations.
Notice that the histogram is reasonably well centered about the true p-value of 0.7. Based
on the 500 estimates of (2.7) for n=100, the sample mean and standard deviation of the
estimator (2.7) are 0.7001 and 0.0442, respectively. Were we to use a 2-σ reporting error
for our estimate of (2.7) for m=100, it would be ~±0.09 (or 9%).
To get an idea of how the reporting error may be influenced by the number of chosen
subjects, n, used in (2.7), we embedded the above code in an m-DO LOOP, for values of
m=100, 1000, and 10,000. For each value of m we computed the sample standard
deviation. The code and results are given below.
>> %PROGRAM NAME: phatstd
m=[100 1000 10000];
phatstdvec=[];
p=0.7;
n=500;
for i=1:3
u=rand(m(i),n);
u=u-(1-p);
x=ceil(u);
phat=(1/m(i))*sum(x);
phatstd=std(phat);
phatstdvec=[phatstdvec phatstd];
end
phatstdvec
26
phatstdvec =
0.0469
0.0140
0.0046
Closer examination of these 3 numbers associated with the chosen 3 values of m, would
reveal that the standard deviation of (2.7) appears to be inversely proportional to m .
In the next example, we demonstrate the power of knowledge of the conditional PDF of a
2-D Bernoulli random variable, in relation to a process that is a time-indexed collection
of random variables. In general, such a process is known as a random process:
Definition 2.9 A time-indexed collection of random variables, { X t }t  is known as a
random process. If the joint PDF of any subset { X k }tkmt does not depend on t, then the
process is said to be a stationary random process.
The universe is rife with time-dependent variables that take on only one of two possible
values. Consider just a few such processes from a wide range of settings:







Whether a person is breathing normally or not.
Whether a drop in air temperature causes a chemical phase change or not.
Whether farmers will get more that 2 inches of rain in July or not.
Whether your cell phone receives a correct bit of information or not.
Whether a cooling pump performs as designed or not.
Whether you get married in any given year or not.
Whether a black hole exists in a sector of the galaxy or not.
All of these examples are time-dependent. In the following example we address what
might be termed a Bernoulli or a binary random process.
Example 2.8 Samples of a mixing process are taken once every hour. If the chemical
composition is not within required limits, a value of one is entered into the data log.
Otherwise, a value of zero is entered. The Figure 7 below shows two randomly selected
200-hout segments of the data log for a process that is deemed to be pretty much under
control.
From these data, we see that, for the most part, the process is in control. However, when
it goes out of control, there seems to be a tendency to remain out of control for more than
one hour.
(a) Under federal regulations, the mixture associated with an out-of-control period
must be discarded. Management would like to have a computer model for
simulating this control data log. It should be a random model that captures key
information, such as the mean and standard deviation of a simulated data log.
Pursue the design of such a model.
27
Well, having had Professor Sherman’s STAT 305C course, you immediately recall
the notion of a Bernoulli random variable. And so, your first thought is to define the
events [X=0] and [X=1] to correspond to “in” and “out of” control, respectively. To
estimate the p-value for X, you add up all the ‘ones’ in the lower segment given in
Figure 7, and divide this number by 200. This yields the p-value, p=12/200=0.06.
You then proceed to simulate a data log segment by using the following Matlab
commands:
>> u = rand(1,200);
>> u= u – 0.94;
>> y = ceil(u);
>> stem(y)
The stem plot is shown in Figure 8 below.
Time Series for Process Control State
1
0.9
0.8
0.7
y(t)
0.6
0.5
0.4
0.3
0.2
0.1
0
0
20
40
60
80
100
Time
120
140
160
180
200
160
180
200
Time Series for Process Control State
1
0.9
0.8
0.7
y(t)
0.6
0.5
0.4
0.3
0.2
0.1
0
0
20
40
60
80
100
Time
120
140
Figure 7. Two 200-hour segments of the control data log for a mixing process.
Even though Figure 8 has general similarities to Figure 7, it lacks the “grouping”
tendency of the ‘ones’. Hence, management feels the model is inadequate.
28
Simulation of Process Data Log Using a Ber(0.06) random variable
1
0.9
0.8
Control State
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0
20
40
60
80
100
Time
120
140
160
180
200
Figure 8. Simulation of a data log 200-hour segment using a Ber(0.6) random variable.
(b) Use the concept of a 2-D Bernoulli random variable whose components are not
assumed to be statistically independent, as the basis for your model. Specifically,
X1 is the process control state at any time, t, and X2 is the state at time t+1.
To this end, you need configure the data to correspond to X  ( X1, X 2 ) . You do this
in the following way: For simplicity, consider the following measurements
associated with a 10-hour segment: [0 0 0 1 0 0 0 0 10]. This array represents 10
measurements of X1. Now, for each measurement of X1 then measurement that
follows it is the corresponding measurement of X 2 . Since you have no measurement
following the 10th measurement of X1 , this means that you have only 9
measurements of X  ( X1, X 2 ) ; namely
000100001
001000010
Of these 9 ordered pairs, you have 5 (0,0) elements. And so, your estimate for p00 would
be 5/9.
Using this procedure on the second data set in Figure 7, you arrive at the following
numerical estimates: p10  p01  0.01 and p11  0.05 . It follows that p00  0.93 .
Since p1  Pr[ X1  1]  p10  p11  0.06 , your first measurement of your 200-hour
simulation is that of a Ber(0.06) random variable. You simulate a numerical value for this
variable in exactly the way you did in part (a). If the number is 0, your p-value for
simulating the second number is obtained using (2.7b), and if your first number was a 1,
then you use a p-value given by (2.7b). Specifically,
Pr[ X 2  1 | X1  0 ] 
p0 0
p01
 p01


p2 | [ X1  0] 
0.01
 0.0106
0.93  0.01
29
Pr[ X 2  1 | X 1  1 ] 
p10
p11
 p11


p2 | [ X 1  1] 
0.05
 0.8333 .
0.01  0.05
The Matlab code for this simulation is shown below.
%PROGRAM NAME: berprocess.m
% This program generates a realization
% of a Ber(p00, p10, p01,p11) correlated process
npts = 200;
y=zeros(npts+1,1);
% Stationary Joint Probabilities between Y(k) and Y(k-1)
p01=0.05;
p10=p01;
p11=0.05
p00 = 1 - (p11 + p10 + p01);
pvec = [p00 p10 p01 p11]
%Marginal p for any Y(k)
p=p11 + p10
% ------------------------x = rand(npts+1,1);
y(1)= ceil(x(1)- (1-p)); % Initial condition
for k = 2:npts+1
if y(k-1)== 0
pk = p10/(p00 + p10);
y(k)=ceil(x(k) - (1-pk));
else
pk = p11/(p11 + p10);
y(k)=ceil(x(k) - (1-pk));
end
end
stem(y(1:200))
xlabel('Time')
ylabel('y(t)')
title('Time Series for Process Control State')
Running this code twice, gives the simulation segments in Figure 9 below.
30
Time Series for Process Control State
1
0.9
0.8
0.7
y(t)
0.6
0.5
0.4
0.3
0.2
0.1
0
0
50
100
150
200
250
200
250
Time
Time Series for Process Control State
1
0.9
0.8
0.7
y(t)
0.6
0.5
0.4
0.3
0.2
0.1
0
0
50
100
150
Time
Figure 9. Two 200-hour data log simulations using a 2-D Bernoulli random variable with
probabilities p10  p01  0.01 , p11  0.05 , and hence, p00  0.93 .
Management feels that this model captures the grouping tendency of the ones, and so
your model is approved. Congratulations!!!
Before we leave this example, let’s think about the reasonableness of the ‘ones’ grouping
tendency. What this says is that when the process does go out of control, it has a tendency
to remain out of control for more than one hour. In fact, the above conditional probability

p11
0.05
Pr[ X 2  1 | X 1  1 ] 
 p2 | [ X 1  1] 
 0.8333
p10  p11
0.01  0.05
States that if it is out of control during one hour, then there is an 83% chance that it will
remain out of control the next hour. This can point to a number of possible sources
responsible for the process going out of control. Specifically, if the time constant
associated with either transient nonhomogeneities in the chemicals, or with a partially
blocked mixing valve is on the order of an hour, then one might have reason to
31
investigate these sources. If either of these sources has a time constant on the order of
hours, then the above model can be used for early detection of the source. Specifically,
we can use a sliding window to collect overlapping data segments, and estimate the
probabilities associated with X  ( X1, X 2 ) . If a blockage in the mixing valve takes hours
to dissolve, then one might expect the above probability value 0.8333 to increase. We can
use this logic to construct a hypothesis test for determining whether we think the valve is
blocked or not. We will discuss hypothesis testing presently. Perhaps this commentary
will help to motivate the reader to look forward to that topic. 
4. Functions of n-D Bernoulli Random Variables
Having a solid grasp of the sample space and probability description of an n-D Bernoulli
random variable is crucial in order to appreciate the simplicity of the material in this
section. If the reader finds this material difficult, then it is suggested that the previous
sections be reviewed. Again, the concepts are (i) a random variable, which is an action
that, if repeated could lead to different results, (ii) the sample space associated with the
variable, which is the set of all measurable values that the variable could take on, and (iii)
the probabilities associated with subsets of the sample space. The reader should place the
primary focus on the nature of the action and on the set of measurable values that could
result. In a sense, the computation of probabilities is “after the fact”; that is, once the
events of concern have been clearly identified as subsets of the sample space, the
probability of those events is almost trivial to compute. If the reader can accept and
appreciate this view, then this section will be simple. Furthermore, as we arrive at some
of the more classic random variables in most textbooks, the reader will not only
understand their origins better, but will be able to readily relax the assumptions upon
which they are based, if need be.
4.1 Functions of a 1-D Bernoulli Random Variable
A function is an operation, an algorithm, or an action. Due to the extremely simple nature
of X ~ Ber ( p) , there are not many operations that one can perform on X. One is the
following:
Example 2.9 For X ~ Ber ( p) , perform the following operation on X:
Y
 aX
 b.
Since X is a random variable, and Y is a function of X, it follows that Y is also a random
variable. In this case, it is the operation of ‘multiplying X by the constant a, and then
adding the constant b to it.’ The first step in understanding Y is to identify its sample
space. To this end, perform the above operation on each element in S X  {0,1} , and the
reader should arrive at SY  {b, a  b} . It should be apparent that the following sets are
equivalent: {0}  {b} and {1}  {a  b} . Since they are equivalent, they must have the
same probabilities: Pr[Y  a  b]  Pr{a  b}  Pr[ X  1]  p and Pr[Y  b]  1  p .
32
Since a and b are any constants the readers chooses, it follows that any random variable Y
that can take on only one of two possible numerical values is ,basically, just a veiled
Bernoulli random variable. 
Example 2.10 For X ~ Ber ( p) , perform the following operation on X:
Y  a X2  c X  b.
Even though this operation is more complicated than that of the last example, the reader
should not feel intimidated. Simply, proceed to identify the sample space for Y, in exactly
the same manner as was done in the last example: {0}  {b} and {1}  {a  c  b} .
Hence, SY  {b, a  c  b} , and so, again, we see that Y is simply a veiled Bernoulli
random variable. 
4.2 Functions of a 2-D Bernoulli Random Variable
Let X ~ Ber ( p00 , p10 , p01, p11) . Since X is completely and unambiguously defined by its
ample space, S X  {( 0,0), (1,0), (0,1), (1,1)} and the associated probabilities,
{ p00 , p10 , p01, p11} , the reader should feel confident that he/she can easily accommodate
any function of X. Consider the following examples.
Example 2.11 Perform the following operation on X:
Y  a X1  bX 2 .
We then have the following equivalent sets, or events, and theire associated probabilities:
{( 0,0)}  {0}  Pr[Y  0]  p00
{(1,0)}  {a}  Pr[Y  a]  p10
{( 0,1)}  {b}  Pr[Y  b]  p10
{(1,1)}  {a  b}  Pr[Y  a  b]  p11
Hence, the sample space for Y is SY  {0, a, b, a  b} , and the elemental subsets of this set
have the above probabilities. Armed with this complete description of Y, the reader
should feel competent and unafraid to answer any question one might pose in relation to
Y.
For chosen numerical values, a  b  1, we have SY  {0,1,2} . Notice that we do not have
SY  {0,1,1,2} . Why? Because SY is the set of possible values that Y can take on. And so,
it makes no sense to include the number 1 twice. In this case, the subset {1} of SY is
equivalent to the subset {(1,0),(0,1)} of S X . With this awareness of the equivalence of
sets, it is almost trivial to compute the probability
Pr[Y  1]  Pr{(1,0), (0,1)}  Pr({(1,0}  {( 0,1)})  Pr{(1,0)}  Pr{( 0,1)}  p10  p01
33
If the reader feels that the above equation is unduly belabored, good. Then the material is
becoming so conceptually clear and simple that we are succeeding in conveyance of the
same. If the reader is confused or unsure as to the reasons for the equalities in the above
equation, then the reader should return to the previous sections and fill in any gaps in
understanding.
Before we leave this example, consider the application where X  [ X1 X 2 ] corresponds to
the measurement of significant (1), versus insignificant (0) rainfall on two consecutive
days. Suppose that on any given day, the probability of significant rainfall is p. Then,
Pr[ X1  1]  Pr[ X 2 ]  1  p . If we assume that X1 and X 2 are independent, then we arrive
at the following probabilities for Y:
Pr[Y  0]  Pr{( 0,0)}  p00  (1  p)(1  p)
Pr[Y  1]  Pr{(1,0), (0,1)}  p10  p01  2 p(1  p )
Pr[Y  2]  Pr{(1,1)}  p11  p 2
Notice that the rightmost quantities in these three equations are a consequence of the
assumption of independence; whereas the middle quantities entail no such assumption.
This leads to the question: Is it reasonable to assume that if a region experiences a
significant amount of rainfall on any given day, then it might be more likely to
experiences a significant amount on the next day? If the region is prone to experiencing
longer weather fronts, where storms linger for more than on day, then the answer to this
question would be yes. In that case, X1 and X 2 are not independent. Hence, the rightmost
expressions in the above equation are wrong; whereas the middle expressions are still
correct. The caveat here is that one must have reliable numerical estimates of these
probabilities. If one only has information about any given day, and not about two
consecutive days, then one might resort to assuming the variables are independent. This
is not necessarily a wrong assumption. But it is one that should be clearly noted in
presenting probability information to farmers. 
Example 2.12 Perform the following operation on X:
Y  | X1  X 2 | .
This operation is not as fabricated as it might seem. Consider sending a text message to
your friend. Most communications networks convert text messages into a sequence of
zeros and ones. Each 0/1 is called an information bit. Now, let the event that you send a 1
correspond to [ X1  1] , and let the event that your friend correctly receives it be [ X 2  1] .
Then, here, the event [Y  1] corresponds to a bit error in the transmission. We then have
the following equivalent sets, or events, and their associated probabilities:
34
{( 0,0), (1,1)}  {0}  Pr[Y  0]  p00  p11
{(1,0), (0,1)}  {1}  Pr[Y  1]  p10  P01 .

Hence, Y ~ Ber ( p  p10  p01 ) . Even though we have the joint probability information in
the parameters { p00 , p10 , p01, p11} , it is more useful to compute the conditional probability
information. After all, what you are really concerned with is the event that your friend
correctly receives a 0/1, given that you sent a 0/1. As in Example 2.8, these conditional
probabilities are given by
Pr[ X 2  1 | X 1  0 ] 
Pr[ X 2  0 | X 1  1 ] 
p0 0
p01
 p01
p10
p10
 p11
Usually, it is presumed that each bit you send to your friend is as likely to be a zero as it
is a one; that is, X1 ~ Ber ( p  0.5) . In this case, the above error probabilities become
Pr[ X 2  1 | X 1  0 ] 
Pr[ X 2  0 | X 1  1 ] 
p01
0 .5
p10
0 .5
If it is further assumed that p10  p01 , then we arrive at the usual expression for the bit
transmission error:

ebit  Pr[ X 2  1 | X 1  0 ]  Pr[ X 2  0 | X 1  1 ] 
p01 .

4.2 Functions of an n-D Bernoulli Random Variable
Recall that a complete and unambiguous description of an n-D Bernoulli random variable
requires specification of the probabilities associated with the 2 n elemental sets in the
sample space for X. This sample space can be expressed as:
S X  {( x1 , x2 ,, xn ) | every xk  {0,1} } .
Denote the probability associated with the elemental set {( x1, x2 ,, xn ) }as px1 x 2  x n . If
one has access to m numerical measurements {( x1( j ) , x2( j ) ,, xn( j ) }mj1 of X, then one can
35
estimate px1 x 2  x n by the relative number of occurrences of the element ( x1 , x2 ,, xn ) in
relation to the number of measurements, m. In this section we will restrict our attention to
the more common setting, wherein the components of X are mutually independent. It
follows that
n
px1 x 2  x n  Pr[ X 1  x1  X 2  x2    X n  xn ]   Pr[ X k  xk ] .
(2.8a)
k 1
Now, since X k ~ Ber ( pk ) , we have
1  pk
Pr[ X k  xk ]  
 pk
for
xk  0
for
xk  1
(2.8b)
We are now in a position to consider some classic functions of X  [ X 1 ,, X n ] .
Example 2.13 Define the random variable Y, which is the smallest index, k, such that
xk  1 . For example, suppose n=5. Then, in relation to the elements (0,1,0,0,1) and
(0,0,1,1,0), the associated values for Y are 2, and 3, respectively. More generally, the
sample space for Y is SY  {1,2,, n} . Before we compute the probabilities associated
with the elemental subsets of this sample space, let’s find the equivalent events in S X .
Specifically,
[Y  k ]  [ X1  0  X 2  0    X k 1  0  X k  1] .
Hence, in view of the assumption that the components of X  [ X 1 ,, X n ] are mutually
independent, we have
k 1
Pr[Y  k ]  Pr[ X 1  0  X 2  0    X k 1  0  X k  1]   (1  p j )  pk . (2.9a)
j 1
If we assume that, not only are the components of X mutually independent, but that they
all have exactly the same p-value, then we obtain the following well known geometric
probability model for Y:
Pr[Y  k ] 
k 1
 (1  p )  p
j
k

p(1  p)k 1 ; 1  k  n
(2.9b)
j 1
Of course, the expression (2.9b) is simpler than (2.9a). It should be, since all the p-values
are assumed to be the same. However, (2.9a) is often a more realistic situation than
(2.9b). All too often, the assumption of equal p-values is born of convenience or of lack
of understanding. We will see this same assumption in the next example.
36
Example 2.14 In Example 2.7 we used Matlab to investigate the probability description
for the p-value estimator associated with the assumedly independent and identically
distributed (iid) Ber(p) components of X  [ X 1 ,, X n ] , given by (2.7), and repeated
here:


1 n
pˆ   X k  X .
n k 1
In this example, we will obtain the actual probability model for p̂ . Furthermore, we will
obtain it in the more general (and often more realistic) case, where the components are
independent, but they do not have one and the same p-value. To this end, we first give the
sample space for the following random variable:

Y

n
X
k 1
k
The sample space for Y is
SY  {0, 1, 2, , n  1, n} .
Notice that this sample space has n+1 elements in it. With this, we are in a position to
identify the subset of S X that corresponds to the elemental subset {k} of SY . We begin
with the two simplest subsets:
[Y  0]  { (0,0,, 0) } and [Y  n]  { (1,1,, 1) } .
The key point here is that the only way that Y can take on the value zero (n) is if every
component of X takes on the value zero (one). In the more general setting wherein the
components are assumed to be mutually independent, but with not necessarily the same
p-value, the probabilities of these two events are simply
n
n
k 1
k 1
Pr[Y  0]  Pr{ (0,0,, 0) }  Pr[ X 1  0    X n  0]   Pr[ X k  0]   (1  pk )
and
n
n
k 1
k 1
Pr[Y  n]  Pr{ (1,1,, 1) }  Pr[ X 1  1    X n  1]   Pr[ X k  1]   pk .
Next, we consider the slightly more challenging events [Y  1] and [Y  n  1] . In
particular, the only way that Y can take on the value one is if one and only one of the
components of X equals one. Similarly, the only way that Y can take on the value n  1 is
if one and only one of the components of X equals zero. Hence, we have the following
equivalent events:
[Y  1]  ”only one component of X equals 1”  {(1,0,.), (0,1,0,0),, (0,,0,1}
and
37
[Y  n  1]  ”only one component of X equals 0”  {( 0,1,.,1), (1,0,1,1), , (1,,1,0}
Notice that the elements making up these sets are distinct. For example, the element
(1,0,,0) is a point in n-D space that is distinctly different than the point (0,1,0,,0) .
Sure, many of the components of these two elements are the same. But the elements are
distinctly separate points in n-D space. Their intersection is, therefore, the empty set.
Hence, the probability of the event [Y  1] is simply the sum of the elemental probabilities
associated with the set  {(1,0,.), (0,1,0,0),, (0,,0,1} ; that is:
Pr[Y  1]  Pr{(1,0,,0), (0,1,0,0), , (0,,0,1}  Pr{(1,0,,0)}    Pr{( 0,0,1)}

n
n1
j 2
j 1
p1   (1  p j )     (1  p j )  pn
n


k 1
n
[ pk   (1  p j ) ]
j 1
( j k )
Similarly,
Pr[Y  n  1]  Pr{( 0,1,,1), (1,0,1,1),, (1,,1,0}  Pr{( 0,1,,1)}    Pr{(1,1,0)}

n 1
n
j 2
n
n
 [(1  p )   p
(1  p1 )   p j     p j  (1  pn ) 
k
k 1
j 1
j
]
j 1
( jk)
While to many readers, these expressions may seem formidable, if not down right ugly,
such readers should carefully assess whether their queasiness is due to the unfamiliar
notation, or to a lack of conceptual understanding. The first portions of each of these
equations were “long hand” expressions, wherein each probability is noted individually.
The second portions of these equations use summation (∑) and product (∏) notation, in
order to make the expressions more compact. The reader who is unfamiliar with this type
of notation should not misconstrue discomfort over that, with lack of conceptual
understanding.
Now, let’s consider the most general event [Y  k ] . This event is the event that k of the n
components of X take on the value one, while the others take on the value zero, right?
Well, one way corresponds to the event {(11 , 12 ,,1k , 0k 1 ,0n )} . Notice that we have
included subscripts on the 0’s and 1’s simply to make it clear as to their positions in the
ordering of the n components of this element. It should be clear that the probability of
this event is simply
k
Pr{(11 , 12 ,,1k , 0k 1 ,0n )}   p j 
j 1
n
 (1  p ) .
j
(2.10)
j k 1
The question now is: How many ways can one position k ones and (n-k) zeros in n slots?
Each way will have a corresponding probability; just as the way of positioning all the
ones first, (followed by all the zeros), resulted in the above probability. The answer to
this question begins by answering a similar question:
38
How many ways can one order n distinctly different objects in n slots?
Well, in the first slot we can place any one of the n distinct objects. Once we have chosen
one of them, then in the second slot we have only n  1 distinct objects to choose from.
Having chosen one of them, we are left to choose from n  2 distinct objects to place in
the third slot, and so on. Think of this as a tree, where each branch has branches, and each
of those branches has branches, and so on. The question is to figure out how many total
branches there are. Let’s identify the first slot with the biggest diameter branches. Then
we have n of these, corresponding to the n distinct objects. Now, each one of these main
branches has n  1 branches of slightly smaller diameter. And each one of those branches
has n  2 slightly smaller branches, and so on. So, the total number of branches is

n  (n  1)  (n  1)    2  1  n! (read as “n factorial”)
Now, if each of the k ones and the n  k zeros were distinctly different (say, different
colors, or different aromas!), then there would be n! different possible ways to order
them. However, all the ones look alike, and all the zeros look alike. And so there will be
fewer than n! ways to order them. How many fewer? Well, how many ways can one
order the k ones, assuming they are distinct? The answer is k! ways. And how many ways
can one order n  k zeros, assuming they are distinct? The answer is ( n  k )! ways. But
since the ones and zeros are not distinct, the number of ways we can position them into
the n slots is:

n
n!
   (read as “n choose k” )
k! (n  k )!
k 
(2.10b)
Now, each way we position the k ones in the n slots has a corresponding probability, as
demonstrated above. And so, there can be a lot of probabilities to compute, even for a
modest value of n. For example, if n=10, then for k=5 we have to compute
10! 10  9  8  7  6 10 9 8 6 7

    
5!5! 5  4  3  2  1
5 3 4 2 1
 2  3  2  3  7  36  7  252
different probabilities, and then add them up to get Pr[Y  5] . And so, even if we were to
now proceed to develop a general expression for this probability, it would be really ugly!
Furthermore, given specified p-values { p j }nj 1 corresponding to the n Ber(pj) components
of X, we would still need a calculator, if not a computer, to compute this probability for
the events [Y=k] for all of the n+1 possible values of k: k=0, 1, …, n. For these reasons,
we will focus our attention on an algorithm for computing the elemental probabilities
associated first with S X . Then, having numerical values for these probabilities, we will
develop an algorithm that uses them to compute the probability of each [Y=k] event. We
do this in the Appendix to this chapter, since the development is mainly one of writing a
39
program. We will use Matlab as our programming language. However, those familiar
with other languages (e.g. C++) may prefer to write their own code.
Recall, that the only assumption, thus far, has been that the n components of X are
independent. We will now make the further assumption that they all have one and the
same p-value. In this case (2.10) becomes
k
Pr{(11 , 12 ,,1k , 0k 1 ,0n )}   p j 
j 1
n
 (1  p )

j
p k (1  p) n  k
(2.12)
j  k 1
However, since the p-values are identical, then so long as an element of S X has k 1’s
(and, consequently, n-k 0’s), that element has the probability (2.12). And the number of
distinct elements of this type is given by (2.11). Hence, under the assumption that the
components of X are iid , we have the following probability description for their sum, Y:
n
Pr[Y  k ]    p k (1  p) n  k
k 
k  0,1,, n
(2.13)
The probability model (2.13) is known as the Binomial probability model. Notice that is
entails two parameters: n, the number of Bernoulli random variables being added, and p,
the p-value associated with each and every Bernoulli variable.
Having (2.13), it is trivial to arrive at the probability model for our p-value estimator

pˆ 

1 n
Xk  X

n k 1

1
Y.
n
The sample space for p̂ is obtained by dividing each element in SY by n. Hence, we have
the following equivalent elemental events:
[ pˆ  k / n ]  [Y  k ] .
It follows that the probability description for p̂ is
n
Pr[ pˆ  k / n ]  Pr[Y  k ]    p k (1  p) n  k .
k 
(2.14)
The following is a Matlab code to compute this theoretical probability model for p̂
n=100;
p = 0.7;
xvec = 0;
pvec = (1-p)^n;
for k = 1 : n;
40
xvec = [xvec , k/n];
pval = ( factorial(n)/(factorial(k)*factorial(n-k)) ) * p^k * (1-p)^(n-k);
pvec = [pvec , pval ];
end
stem(xvec, pvec)
This code gives the following figure.
0.09
0.08
0.07
Pr(phat = x)
0.06
0.05
0.04
0.03
0.02
0.01
0
0
0.1
0.2
0.3
0.4
0.5
x
0.6
0.7
0.8
0.9
1
Figure 10. Graph of the Binomial distribution (2.13) for n=100 and p=0.7.
A comparison of Figure 7 and Figure 10 shows that Figure 7 has a shape reasonably
similar to the theoretical model in Figure 10.
41
APPENDIX
To begin, we should figure out just how many elemental events are associated with S X .
Well, in the first slot there can be a zero or a one. Then for each of these “branches”, the
second slot can be a zero or a one, and so on. It turns out that there are a total of
2 n ”branches” ; that is, there are 2 n elements in S X . For example, if n=10, then S X has
210  1024 elements in it. Now let’s compute the following numbers:

n(0)  ”the number of elements that contain no 1’s” = 1

n
n(1)  ”the number of elements that contain one 1’s” =    n
1

n
n!
n(2)  ”the number of elements that contain one 1’s” =   
 2  (n  2)!2!
In general,

n
n!
.
n(k )  ”the number of elements that contain one 1’s” =   
 k  (n  k )!k!