Download Lecture_02_Axioms

Document related concepts

Statistics wikipedia , lookup

History of statistics wikipedia , lookup

Birthday problem wikipedia , lookup

Ars Conjectandi wikipedia , lookup

Inductive probability wikipedia , lookup

Probability interpretations wikipedia , lookup

Probability wikipedia , lookup

Transcript
GV01 – Mathematical Methods, Algorithmics and Implementations:
Axioms and Probability by Counting
Dan Stoyanov,
Francois Chadebecq, Francisco Vasconcelos
Department of Computer Science, UCL
2016
http://moodle.ucl.ac.uk/course/view.php?id=11547
Axioms and Theorems of Probability
•
•
•
•
•
•
Introduction
Probability Space
Axioms and Theorems of Probability
Counting to Compute Probabilities
Dependence and Independence
Entropy
2
Axioms of Probability
•
Only three axioms are needed to describe how all of
probability works:
1. Probabilities are non-negative
2. The probability that something happens is 1
3. The probability of the union of several disjoint events is the
sum of their individual probabilities
•
We can illustrate these most easily by looking at the
indicator function interpretation
3
First Axiom of Probability
• The probability of an event A occurring is a nonnegative real number,
4
Demonstration with Relative Frequency
• Recall that,
• Since the indicator function is either 0 or 1, this clearly
has to be a non-negative number
5
Second Axiom of Probability
• The probability that some elementary event
somewhere in the sample space will occur has a
probability of 1,
• Another way to view this is that there are no
elementary events outside of the sample space
• It is known as the certain event
6
Demonstration with Relative Frequency
• The relative frequency of the certain event is
• Since
the indicator must always be 1
7
Third Axiom of Probability
• For mutually exclusive events A and B, the probability
of their union is
8
Mutually Exclusive Events
W
A
B
9
Demonstration with Relative Frequency
• The relative frequency is
• Since the events are disjoint, we know that
10
Illustration with Relative Frequency
• Therefore, the sum is given by
11
Example with Mutually Exclusive Events
• Suppose
• Therefore,
12
Third Axiom and Non-Overlapping Events
• Why isn’t it always the case that
13
Die Example
• Consider the events
• The event which is the union of A and B, together with
its probability, are
14
Die Example
• However, the naïve solution of adding just frel(A) and
frel(B) together gives us
• This violates the second axiom, and so it’s not a valid
probability measure
15
Overlapping Events
W
A
B
16
Overlapping Events
W
17
Computing Probabilities of Overlapping Events
• Therefore, the general expression for the relative
frequency of the union of a set of events is
18
Computing Probabilities of Disjoint Events
• When the events are disjoint,
• Therefore,
19
Computing Probabilities of Disjoint Events
• Substituting into our general expression for computing
the probability of the union,
• In other words, we simply add the probabilities
together according to the third axiom
20
Computing Probabilities of Overlapping Events
• When the events overlap,
21
Computing Probabilities of Overlapping Events
• Therefore, it turns out that another way to compute
overlapping events is to compute the probabilities
naïvely and subtract off the intersection region,
22
Illustration
• For the die example,
23
Complementation
• The third axiom also leads to complementation,
• To prove this, recall that
• Therefore,
24
Summary
• All probability can be described by three axioms:
– Non-negative
– The probability that something happens is 1
– The probability of the union of disjoint events is the sum of
those probabilities
• Given these rules, how do we go about computing
probabilities?
• If we assume that the probability of each elemental
event is the same, we can compute these these by
counting
25
Counting to Compute Probabilities
•
•
•
•
•
•
Introduction
Basic Definitions
Axioms of Probability
Counting to Compute Probabilities
Dependence and Independence
Entropy
26
Building the Events
• We are interested in defining events which are the
union of elemental events
• For example, the event E can be defined as the union
of |E| elemental events,
• We assume that the probability that each elemental
event can happen is the same,
27
Building the Events
• Therefore, the probability of the event E is
• Therefore, for any situation in which the elemental
events have equal probabilities of occurring, we can
compute the probabilities simply by counting the
number of ways that E can occur
28
Ball Drawing Example
• A box contains r red balls and b
blue balls
• The experiments involve
drawing several balls from the
box
• Variations include:
– How many balls are taken out?
– Is each ball replaced after its
withdrawn?
– If so, when?
1
2
9
4
6
3
8
5
7
29
Probability of Drawing a Red Ball First Time
• We take a single ball from the box and we’d like to
know the probability that it’s red
• Our sample space is
• For simplicity, we’ll say this is equivalent to drawing a
number equal to the ball number i,
30
Probability of Drawing a Red Ball First Time
• The events of interest are
31
Probability of Drawing a Red Ball First Time
• Therefore, the probability of drawing a red ball is
simply
32
Drawing Two Red Balls With Replacement
• Suppose we now make two draws from the box, and
each time we put the ball back and then shake the box
up
• We’d now like to compute the probability that both
balls are red
• The sample space is expanded to include the balls
which were pulled out from both draws (i1,i2):
33
Drawing Two Red Balls With Replacement
• We want to compute the probability
• One way we can compute this is to visualise the entire
sample space (which gives us |W|) and count the
number of entries in Eboth red
34
Graphical Representation of Sample Space
r,1
r+b,1
1,1
Blue then red
Both red
1,r
Red then blue
1, r+b
Both blue
r+b,r
r+b,r+b
35
Drawing Two Red Balls With Replacement
• From this table, we can see that
• Therefore,
36
Drawing Two Red Balls With Replacement
• Manually writing all of this out can be done, but it gets
more and more complicated
• We can simplify problem the by looking at the structure
more
• From the first example, we know that the probability
that the first ball is red is
37
Drawing Two Red Balls With Replacement
• Because we put the ball back after we have drawn it
and shake the box up, the conditions for drawing the
second ball are exactly the same as those for the first
• Therefore, the probability that the second ball we pull
from the box is red is also
38
Drawing Two Red Balls With Replacement
• Therefore,
• This is an example of something called independence
• We’ll look at this in more detail later, but in many cases
it does not hold true
• For example….
39
Drawing Two Red Balls Without Replacement
• Suppose we now make two draws from the box, but
we do not put the ball back
• We’d now like to compute the probability that both
balls are red
• As before, the sample space includes the balls which
were pulled out from both draws (i1,i2):
40
Graphical Representation of the Sample Space
XX
Both red
Red then blue
Blue then red
Both blue
41
Drawing Two Red Balls Without Replacement
• As before, the probability that the first ball is red is
• Now, to compute the probability that the second ball is
red, we know that we have r-1 red balls out of r+b-1
• Therefore, the probability that the second ball is red is
42
Drawing Two Red Balls Without Replacement
• Therefore, the probability that both balls are red is now
43
Drawing Two Red Balls Without Replacement
With replacement
Without replacement
For constant r=2, the probability of selecting two red balls declines as b increases
44
Repeated Ball Drawing Example
• Suppose we now generalise the problem:
– We keep going back to the box and taking more balls out,
one at a time, until an event is satisfied
• We would like to compute the probability of the event
45
Repeated Ball Drawing Example
• Consider the ways in which this can happen:
– R1: A single way:
• The first red ball is drawn on the first draw
– R2: Three ways:
• One blue ball is drawn, and then a red ball
• A red ball is drawn, and then a blue ball
• Two red balls are drawn
46
Repeated Ball Drawing Example
– R3: Seven ways:
• Two blue balls are drawn, and then a red ball
• The first blue ball is drawn, then a red ball, then the
second blue ball
• The red ball is drawn, then two blue balls
• A red ball is drawn, then a blue ball, then a red ball
• A blue ball is drawn, then two red balls
• Two red balls are drawn and then a blue ball
• Three red balls are drawn
– Rn: There are 2n-1 ways
47
Repeated Ball Drawing Example
• The sample space consists of all possible sequences
of draws we could make
• We could now construct an (r+b)th dimensional
hypercube and count the number of entries which
show how each event can turn up
• However, this is a complete mess and would take
forever to sort out
• One idea would be to decompose the problem into a
set of disjoint events, because we can then just add
the probabilities up
48
Repeated Ball Drawing with ME Events
• Consider the following event:
• Therefore,
–
–
–
–
S0: The first ball drawn is red
S1: The first ball is blue, the second is red
S2: The first two are blue, and the third is red
…. Etc ….
• These events are disjoint
49
Repeated Ball Drawing Example
• We can relate Rk and Sm in the following way:
50
Repeated Ball Drawing Example
• Since the S events are disjoint the probability of R
becomes
• All we have to do is work out P(Sm)
51
Repeated Ball Drawing Example
• Now, the probability that the first ball is red is simply
• The probability that the first ball is blue and the second
ball is red is
52
Repeated Ball Drawing Example
• Expanding the multiplication out,
• Therefore, the probability is given by adding all the
terms together
53
Ball Drawing Example
b=20, r=5
54
Ball Drawing Example
b=20, r=5
55
Negation Approach
• The previous solution works by summing up all the
different ways in which the event we are interested in
can occur
• It’s a lot easier than building a big hypercube
• However, it becomes cumbersome when the number
of ways in which an event can occur grows
• An alternative is to apply the negation approach and
compute the complement of what we actually want to
know
56
Negation Approach
W
A
57
Negation in the Ball Drawing Example
• Consider the event
• This is the complement of what we care about
58
Negation Approaches
• Therefore,
59
Negation Approaches
• For example,
60
Negation Approaches
• More generally
61
Negation Approach
• The events Sm are mutually exclusive,
• However, the events Bm are subsets of one another,
• Therefore, we only need to evaluate the last one to
figure out what we need
62
Combinations and Permutations
• Clever rearrangements of the problem can reduce the
complexity of the maths
• However, these can introduce some subtleties in the
way in which we have to count to add the events
together
• Two operators are often used to make life easier:
– Combinations
– Permutations
63
Combinations
• Suppose we have a collection of elements,
• A new collection is to be created by drawing r
elements from it,
• With combinations we don’t care about the order of the
elements:
64
Example with Combinations
• Considering drawing balls again and suppose we
sample with replacement
• We wish to compute the probability of the event
• We only care about totals; we don’t care about the
order in which the balls were drawn
65
Example with Combinations
• First consider the sequence in which k-s red balls are
drawn, followed by s blue balls
• The probability is
66
Example with Combinations
• Now consider the sequence that we drew k-s-1 red
balls followed by b blue balls and then a red ball,
• This is the same as before
67
Example with Combinations
• In general, the probability arises from the combination
of all possible ways to draw s blue balls
• Therefore,
• This result is general for any “Bernoulli trial”, in which a
decision is independently sampled multiple times
68
Permutations
• Consider a collection of elements,
• We want to draw r from it to make a new collection,
• With permutations, we care about the order of the
elements:
69
Permutations with Replacement
• Suppose, first, that we allow replacement
– The indexing variables B1, B2, etc. do not have to be unique
• In this case, the total number of ways to form B are
First
element
Second
element
rth
element
70
Permutations without Replacement
• If replacement is not allowed, The indexing variables
B1, B2, etc. must be unique
• In this case, the total number of ways to form B are
First
element
Second
element
rth
element
71
Permutations without Replacement
• The total is given by the permutation operator,
• Note that if we choose no elements, there is a single
permutation,
72
Permutations and Combinations
• Each combination consists of r elements of A,
• The number of permutations of this combination is
given by
• Therefore, the permutations “overcount” the number
of combinations by a factor r!
73
Relationship between Combinations
• More generally,
• These are also known as binomial coefficients
74
Conditional Probability and Independence
•
•
•
•
•
•
Introduction
Probability Space
Axioms and Theorems of Probability
Counting to Compute Probabilities
Conditional Probability and Independence
Entropy
75
Conditional Probability and Independence
• Conditional probability and independence are arguably
the most important aspects of probability
• They are used to combine what we can see with what
we can’t see
• They are a cornerstone to probabilistic inference,
which we’ll consider next
76
Dependency Between Events
• Consider two events A and B
• The joint probability that both events happen at the
same time is
77
Conditional Probability
• The conditional probability of an event A given event B
is defined as the ratio
where
78
Relative Frequency Interpretation
• Substituting relative frequencies for probabilities,
• We can interpret this as follows:
– Conduct a set of trials
– Discard all the outcomes where A did not occur
– From the subset in which A occurred, count the number of
times when B occurred
79
Visualising Conditional Probabilities
• It can be viewed as the ratio of the “area” of the
intersection region over the entire event
80
Dice Example of Conditional Probability
• Consider the die example again and suppose there
are two events,
• We’d like to compute the conditional probability,
81
Dice Example of Conditional Probability
• Now,
• Furthermore,
82
Dice Example of Conditional Probability
• Therefore,
83
The Ball Example
• We (implicitly) computed conditional probabilities
earlier when we considered drawing two balls from a
box without replacement
• Recall that we said that the event of two red balls is
the intersection of the event of drawing one red ball,
and then drawing a second red ball,
A
B
84
The Ball Example
• The probability of this event was
• Therefore,
85
The Ball Example Mini-Quiz
• Consider the events
• Compute the probabilities
86
A Property of Conditional Probabilities
• A general property of of conditional probabilities is that
• The reason is that
and
87
Conditional Probabilities Are Probabilities
• Since the conditional probability is bigger than the joint
probability, we might be worried that conditional
probabilities could violate the axioms of probability
• However conditional probabilities do satisfy all three
axioms of probabilities
88
First and Second Axioms
• Since all the probabilities are non-negative, conditional
probabilities are non-negative as well
• The second axiom is equivalent to
• Now,
89
Third Axiom
• The third axiom can be written as
when
• How this situation can arise can be seen in a Venn
Diagram
90
Third Axiom
W
91
Third Axiom
• Therefore, if A and B are mutually exclusive, AC and
B  C are as well
• Therefore,
92
Ordering in Conditional Probability Matters
• The fact that, in general
is one of the main causes why probabilistic reasoning
can be very confusing
• It can lead to an error known as the prosecutor’s
fallacy
• The correct way to handle this is Bayes Rule (later)
93
Ordering in Conditional Probability Matters
• The order of conditional probabilities matters
• Suppose we want to compute the conditional
probability of A given B,
• This looks very similar – we’ve just swapped around B
and A, but the results can be very different
94
Ordering in the Die Example
• Recall the two events in the die example,
• We’d like to compute the conditional probability,
95
Ordering in the Die Example
• Substituting relative frequencies for probabilities,
• In other words, 2 is an even number with probability 1
96
Selection Bias
• A second error is to assume that
• However,
and so the value depends on the joint probability
97
Selection Bias
• Recall again that
• We have seen that
• However,
98
Selection Bias
• Suppose we now have the events
• In this case,
even though
99
Total Probability Theory and Marginalisation
• The proper way to compute P(A) from P(A|B) is through
a process known as marginalisation
• The result is often known as the total probability theory
• The idea is that we essentially recover P(A) by adding
up all the conditional probabilites that P(A|B) can occur
100
Total Probability Theory and Marginalisation
W
101
Total Probability Theory and Marginalisation
• By looking at the union of the events, we can see that
• Because the events are disjoint, we know that
102
Total Probability Theory and Marginalisation
• From the definition of conditional probability,
• Therefore,
103
Example of Total Probability
• Suppose we have the events
104
Example of Total Probability
• It can be shown (=I’m not going to)
105
Example of Total Probability
• Therefore,
106
Example of Total Probability
• Suppose we now have the events
• What do we get this time?
107
Independence
• We have shown that conditional probability can be
used to quantify relationships between events
• However, it’s quite possible that the occurrence of one
event makes no difference in the probability that
another event will occur
• In this situation
108
Multiplication Rule for Independence
• From the definition for conditional probability,
• Therefore the multiplication rule for independent
events is
109
Independence and the Ball Example
• We saw a case of independent events earlier with the
ball drawing example with replacement
• In this case,
• We saw that
110
Independence and the Ball Example
• Dividing through, we find
111
Independence Does Not Mean Lack of Causality
• It is tempting to assume that:
– Dependent events must be causally linked in some way
– Independent events must not causally linked in some way
• However, independence between events A and B is
only defined by the property
• This has nothing to do with causality
112
Independence Does Not Mean Lack of Causality
• A counter example
• Therefore,
• If A were somehow “independent of itself”, then we
would be able to write the equation
113
Independence Does Not Mean Lack of Causality
• This can be satisfied in two cases
• Therefore, if event A never occurs or always occurs, it
is mathematically defined to be “independent of itself”!
114
Independence of Multiple Events
• Consider the question of whether three events A1, A2
and A3 are independent
• We might think that we can just use the multiplication
rule,
• This condition is necessary but it is not sufficient to
guarantee independence
115
Counter Example to Simple Independence Proof
• Consider a dice game in which two independent dice,
X and Y, are thrown and the sample space is (X, Y)
• The events are
116
Counter Example to Simple Independence Proof
• We can show that
117
Counter Example to Simple Independence Proof
• The outcomes compatible with the various events are
118
Counter Example to Simple Independence Proof
• Now,
• This suggests that the events are independent
119
Counter Example to Simple Independence Proof
• However,
• These are not compatible with pairwise independence
120
Independence of Multiple Events
• Therefore, the “proper” test is as follows
• A set of n events is independent if
– Any set of k<n events are independent
– This is proved by showing that
• This proof works inductively from the case n=2
121
Independence is Not Transitive
• One of the consequences of the previous result is that
independence is not, in general, transitive
• This means:
– Events A and C can be independent of one another
– Events B and C can be independent of one another
– But events A and B are not independent of one another
122
A Non-Transitive Example
• Consider the two-die example again, but suppose we
now have events
123
A Non-Transitive Example
• We can show that
• In other words, A1 and A3 are independent, A3 and A2
are independent, but A2 and A1 are not independent
124
Conditional Independence
•
It is possible to have a situation in which A and B
depend on one another
•
However, their probabilities, conditioned on some
event C, are independent of one another
•
In this situation, A and B are said to be conditionally
independent given C
125
Conditional Independence Example
• Consider a dice game in which three independent dice
X, Y, Z are thrown
• We are interested in computing the probabilities of
three types of events
126
Conditional Independence Example
• In this example, we’ll show that the distribution
is conditionally independent
• However, the joint distribution is not and so
127
Conditional Independence Example
• First consider how to compute the conditional
probability
• Because we are conditioning on Ck, we are
conditioning on the event that we know that X=k
• Therefore,
128
Conditional Independence Example
• Assuming that the dice are six-sided and fair, the
probability of rolling a number r is
• Therefore,
129
Conditional Independence Example
• Similarly,
where
130
Conditional Independence Example
• The joint distribution considers both die,
• Recall that
where Y, Z are independent of one another
131
Conditional Independence Example
• Therefore,
• In other words,
• This is conditionally independent, as promised
132
Conditionally Independent Distribution
133
Conditionally Independent Distribution
134
Conditionally Independent Distribution
135
The Joint Distribution
• To compute the joint distribution
we would like to apply the total probability theory and
marginalise out the Ck events from
136
The Joint Distribution
• Because each value of Ck represents one value on X,
each event Ck is disjoint from all the other events
• Therefore,
137
Example
• Substituting,
138
Computing the Results
139
Dependent Probability Result
140
Summary
• Because events can overlap one another, the
probability of one event happening can change if we
know that another event has already happened
• This “degree of influence” is measured by the
conditional probability
• A special case is when the events are independent
• Independence is just a property of the numerical
values of probability; it does not say anything about
causality
• Independence can be conditional
141
Entropy
•
•
•
•
•
•
Introduction
Probability Space
Axioms and Theorems of Probability
Counting to Compute Probabilities
Conditional Probability and Independence
Entropy
142
Entropy and Randomness
• Consider again the view that we have a system which
produces a set of outcomes which leads to a set of
events
• In some cases we would like to know how random or
how uncertain the system is
– Random means that we can’t predict terribly accurately what
will be the next event which will get fired
• This is quantified through entropy
143
Entropy
• Entropy is defined in information theory and is based
off of analogies with statistical thermodynamics
– A system is observed and is seen to observe a macrostate
(e.g., temperature, pressure …)
– However, the same macrostate can be generated by many
microstates of the system (location of individual molecules)
– Entropy expresses the amount of ambiguity (number of
feasible microstates which satisfy the observed macrostate)
144
Entropy and Partitions
• Entropy is defined in terms of a set of events U={Ai}
which partition the sample space
• A partition of a set S is a collection of events:
– That are mutually exclusive
– And whose union covers the entire sample space
145
Partition of a Sample Space
W
A1
A3
A2
146
Another Partition of the Same Sample Space
W
A1
A8
A2
A3
A4
A6
A5
A7
147
Entropy of a Partition
• The entropy is defined to be
• If the log is base 2, the units of entropy are bits (same
as in a computer)
• The greater the entropy, the more random the
distribution is
148
Two Outcome Example
• Suppose there are two possible events,
• Let
• The entropy is
149
Entropy of the Two Outcome Example
150
Entropy for a Die Example
• Suppose we throw our single die X
• The partition of events are just the elementary events
which are the values of the dice rolls,
• Therefore,
151
Properties of Entropy
• The entropy is non-negative
• We can see this from the definition,
• Note that
152
Properties of Entropy
• The entropy is zero if there is a single partition,
• In this case we have a single event and from our
axioms of probability
• This occurs if our system is deterministic
153
Properties of Entropy
• The maximum value of entropy is when the
probabilities of all events are the same,
• This can be proved using calculus of variations
• However, an intuitive argument is that the system is
“most random” when the next event could be any
permissible event with the same probability
154
Properties of Entropy
• If a new partition is formed where V is a
subdivision of U,
• An intuitive explanation is that if you redefine
your system to have more events, the number
of macro states has to go up
155
Partition U
W
A1
A3
A2
156
Subdivided Partition V
W
A11
A3
1
A3
2
A2
A1
2
1
A2
2
Original events (solid black lines) have been subdivided into “smaller” events
157
Subdivision in the Die Example
• Suppose the partition U is
• The entropy is
158
Subdivision in the Die Example
• If we now subdivide the partition into
• The entropy is now
159
Conditional Entropy
• Entropy can be computed based on the conditional
distribution of an event occurring
160
Example of Conditional Entropy
• For example, if we stick with our elementary partitions
but condition on the event
• We know that
161
Example of Conditional Entropy
• Therefore,
• In most cases, conditioning on an event causes the
entropy to go down
• However, we’ll see a counterexample to this later
162
Summary
• Entropy is used to quantify how “random” the system is
• It is defined in terms of the partition (set of events)
used
• The smallest value of entropy is 0 and occurs when
only a single event can ever happen
• The largest value of entropy is given when the
probability of each event is the same as all other
events
163
Summary
• If you use a more “fine-grained” event structure,
entropy cannot decrease
• Entropy can be conditional on events happening
164
Where Next?
• So far, we’ve looked at outcomes and events and the
properties of them
• However, many systems do not deal with events
directly, but numerical quantities
• We handle these through random variables, and they
are the subject of the next set of slides
165