* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Lecture_02_Axioms
Survey
Document related concepts
Transcript
GV01 – Mathematical Methods, Algorithmics and Implementations: Axioms and Probability by Counting Dan Stoyanov, Francois Chadebecq, Francisco Vasconcelos Department of Computer Science, UCL 2016 http://moodle.ucl.ac.uk/course/view.php?id=11547 Axioms and Theorems of Probability • • • • • • Introduction Probability Space Axioms and Theorems of Probability Counting to Compute Probabilities Dependence and Independence Entropy 2 Axioms of Probability • Only three axioms are needed to describe how all of probability works: 1. Probabilities are non-negative 2. The probability that something happens is 1 3. The probability of the union of several disjoint events is the sum of their individual probabilities • We can illustrate these most easily by looking at the indicator function interpretation 3 First Axiom of Probability • The probability of an event A occurring is a nonnegative real number, 4 Demonstration with Relative Frequency • Recall that, • Since the indicator function is either 0 or 1, this clearly has to be a non-negative number 5 Second Axiom of Probability • The probability that some elementary event somewhere in the sample space will occur has a probability of 1, • Another way to view this is that there are no elementary events outside of the sample space • It is known as the certain event 6 Demonstration with Relative Frequency • The relative frequency of the certain event is • Since the indicator must always be 1 7 Third Axiom of Probability • For mutually exclusive events A and B, the probability of their union is 8 Mutually Exclusive Events W A B 9 Demonstration with Relative Frequency • The relative frequency is • Since the events are disjoint, we know that 10 Illustration with Relative Frequency • Therefore, the sum is given by 11 Example with Mutually Exclusive Events • Suppose • Therefore, 12 Third Axiom and Non-Overlapping Events • Why isn’t it always the case that 13 Die Example • Consider the events • The event which is the union of A and B, together with its probability, are 14 Die Example • However, the naïve solution of adding just frel(A) and frel(B) together gives us • This violates the second axiom, and so it’s not a valid probability measure 15 Overlapping Events W A B 16 Overlapping Events W 17 Computing Probabilities of Overlapping Events • Therefore, the general expression for the relative frequency of the union of a set of events is 18 Computing Probabilities of Disjoint Events • When the events are disjoint, • Therefore, 19 Computing Probabilities of Disjoint Events • Substituting into our general expression for computing the probability of the union, • In other words, we simply add the probabilities together according to the third axiom 20 Computing Probabilities of Overlapping Events • When the events overlap, 21 Computing Probabilities of Overlapping Events • Therefore, it turns out that another way to compute overlapping events is to compute the probabilities naïvely and subtract off the intersection region, 22 Illustration • For the die example, 23 Complementation • The third axiom also leads to complementation, • To prove this, recall that • Therefore, 24 Summary • All probability can be described by three axioms: – Non-negative – The probability that something happens is 1 – The probability of the union of disjoint events is the sum of those probabilities • Given these rules, how do we go about computing probabilities? • If we assume that the probability of each elemental event is the same, we can compute these these by counting 25 Counting to Compute Probabilities • • • • • • Introduction Basic Definitions Axioms of Probability Counting to Compute Probabilities Dependence and Independence Entropy 26 Building the Events • We are interested in defining events which are the union of elemental events • For example, the event E can be defined as the union of |E| elemental events, • We assume that the probability that each elemental event can happen is the same, 27 Building the Events • Therefore, the probability of the event E is • Therefore, for any situation in which the elemental events have equal probabilities of occurring, we can compute the probabilities simply by counting the number of ways that E can occur 28 Ball Drawing Example • A box contains r red balls and b blue balls • The experiments involve drawing several balls from the box • Variations include: – How many balls are taken out? – Is each ball replaced after its withdrawn? – If so, when? 1 2 9 4 6 3 8 5 7 29 Probability of Drawing a Red Ball First Time • We take a single ball from the box and we’d like to know the probability that it’s red • Our sample space is • For simplicity, we’ll say this is equivalent to drawing a number equal to the ball number i, 30 Probability of Drawing a Red Ball First Time • The events of interest are 31 Probability of Drawing a Red Ball First Time • Therefore, the probability of drawing a red ball is simply 32 Drawing Two Red Balls With Replacement • Suppose we now make two draws from the box, and each time we put the ball back and then shake the box up • We’d now like to compute the probability that both balls are red • The sample space is expanded to include the balls which were pulled out from both draws (i1,i2): 33 Drawing Two Red Balls With Replacement • We want to compute the probability • One way we can compute this is to visualise the entire sample space (which gives us |W|) and count the number of entries in Eboth red 34 Graphical Representation of Sample Space r,1 r+b,1 1,1 Blue then red Both red 1,r Red then blue 1, r+b Both blue r+b,r r+b,r+b 35 Drawing Two Red Balls With Replacement • From this table, we can see that • Therefore, 36 Drawing Two Red Balls With Replacement • Manually writing all of this out can be done, but it gets more and more complicated • We can simplify problem the by looking at the structure more • From the first example, we know that the probability that the first ball is red is 37 Drawing Two Red Balls With Replacement • Because we put the ball back after we have drawn it and shake the box up, the conditions for drawing the second ball are exactly the same as those for the first • Therefore, the probability that the second ball we pull from the box is red is also 38 Drawing Two Red Balls With Replacement • Therefore, • This is an example of something called independence • We’ll look at this in more detail later, but in many cases it does not hold true • For example…. 39 Drawing Two Red Balls Without Replacement • Suppose we now make two draws from the box, but we do not put the ball back • We’d now like to compute the probability that both balls are red • As before, the sample space includes the balls which were pulled out from both draws (i1,i2): 40 Graphical Representation of the Sample Space XX Both red Red then blue Blue then red Both blue 41 Drawing Two Red Balls Without Replacement • As before, the probability that the first ball is red is • Now, to compute the probability that the second ball is red, we know that we have r-1 red balls out of r+b-1 • Therefore, the probability that the second ball is red is 42 Drawing Two Red Balls Without Replacement • Therefore, the probability that both balls are red is now 43 Drawing Two Red Balls Without Replacement With replacement Without replacement For constant r=2, the probability of selecting two red balls declines as b increases 44 Repeated Ball Drawing Example • Suppose we now generalise the problem: – We keep going back to the box and taking more balls out, one at a time, until an event is satisfied • We would like to compute the probability of the event 45 Repeated Ball Drawing Example • Consider the ways in which this can happen: – R1: A single way: • The first red ball is drawn on the first draw – R2: Three ways: • One blue ball is drawn, and then a red ball • A red ball is drawn, and then a blue ball • Two red balls are drawn 46 Repeated Ball Drawing Example – R3: Seven ways: • Two blue balls are drawn, and then a red ball • The first blue ball is drawn, then a red ball, then the second blue ball • The red ball is drawn, then two blue balls • A red ball is drawn, then a blue ball, then a red ball • A blue ball is drawn, then two red balls • Two red balls are drawn and then a blue ball • Three red balls are drawn – Rn: There are 2n-1 ways 47 Repeated Ball Drawing Example • The sample space consists of all possible sequences of draws we could make • We could now construct an (r+b)th dimensional hypercube and count the number of entries which show how each event can turn up • However, this is a complete mess and would take forever to sort out • One idea would be to decompose the problem into a set of disjoint events, because we can then just add the probabilities up 48 Repeated Ball Drawing with ME Events • Consider the following event: • Therefore, – – – – S0: The first ball drawn is red S1: The first ball is blue, the second is red S2: The first two are blue, and the third is red …. Etc …. • These events are disjoint 49 Repeated Ball Drawing Example • We can relate Rk and Sm in the following way: 50 Repeated Ball Drawing Example • Since the S events are disjoint the probability of R becomes • All we have to do is work out P(Sm) 51 Repeated Ball Drawing Example • Now, the probability that the first ball is red is simply • The probability that the first ball is blue and the second ball is red is 52 Repeated Ball Drawing Example • Expanding the multiplication out, • Therefore, the probability is given by adding all the terms together 53 Ball Drawing Example b=20, r=5 54 Ball Drawing Example b=20, r=5 55 Negation Approach • The previous solution works by summing up all the different ways in which the event we are interested in can occur • It’s a lot easier than building a big hypercube • However, it becomes cumbersome when the number of ways in which an event can occur grows • An alternative is to apply the negation approach and compute the complement of what we actually want to know 56 Negation Approach W A 57 Negation in the Ball Drawing Example • Consider the event • This is the complement of what we care about 58 Negation Approaches • Therefore, 59 Negation Approaches • For example, 60 Negation Approaches • More generally 61 Negation Approach • The events Sm are mutually exclusive, • However, the events Bm are subsets of one another, • Therefore, we only need to evaluate the last one to figure out what we need 62 Combinations and Permutations • Clever rearrangements of the problem can reduce the complexity of the maths • However, these can introduce some subtleties in the way in which we have to count to add the events together • Two operators are often used to make life easier: – Combinations – Permutations 63 Combinations • Suppose we have a collection of elements, • A new collection is to be created by drawing r elements from it, • With combinations we don’t care about the order of the elements: 64 Example with Combinations • Considering drawing balls again and suppose we sample with replacement • We wish to compute the probability of the event • We only care about totals; we don’t care about the order in which the balls were drawn 65 Example with Combinations • First consider the sequence in which k-s red balls are drawn, followed by s blue balls • The probability is 66 Example with Combinations • Now consider the sequence that we drew k-s-1 red balls followed by b blue balls and then a red ball, • This is the same as before 67 Example with Combinations • In general, the probability arises from the combination of all possible ways to draw s blue balls • Therefore, • This result is general for any “Bernoulli trial”, in which a decision is independently sampled multiple times 68 Permutations • Consider a collection of elements, • We want to draw r from it to make a new collection, • With permutations, we care about the order of the elements: 69 Permutations with Replacement • Suppose, first, that we allow replacement – The indexing variables B1, B2, etc. do not have to be unique • In this case, the total number of ways to form B are First element Second element rth element 70 Permutations without Replacement • If replacement is not allowed, The indexing variables B1, B2, etc. must be unique • In this case, the total number of ways to form B are First element Second element rth element 71 Permutations without Replacement • The total is given by the permutation operator, • Note that if we choose no elements, there is a single permutation, 72 Permutations and Combinations • Each combination consists of r elements of A, • The number of permutations of this combination is given by • Therefore, the permutations “overcount” the number of combinations by a factor r! 73 Relationship between Combinations • More generally, • These are also known as binomial coefficients 74 Conditional Probability and Independence • • • • • • Introduction Probability Space Axioms and Theorems of Probability Counting to Compute Probabilities Conditional Probability and Independence Entropy 75 Conditional Probability and Independence • Conditional probability and independence are arguably the most important aspects of probability • They are used to combine what we can see with what we can’t see • They are a cornerstone to probabilistic inference, which we’ll consider next 76 Dependency Between Events • Consider two events A and B • The joint probability that both events happen at the same time is 77 Conditional Probability • The conditional probability of an event A given event B is defined as the ratio where 78 Relative Frequency Interpretation • Substituting relative frequencies for probabilities, • We can interpret this as follows: – Conduct a set of trials – Discard all the outcomes where A did not occur – From the subset in which A occurred, count the number of times when B occurred 79 Visualising Conditional Probabilities • It can be viewed as the ratio of the “area” of the intersection region over the entire event 80 Dice Example of Conditional Probability • Consider the die example again and suppose there are two events, • We’d like to compute the conditional probability, 81 Dice Example of Conditional Probability • Now, • Furthermore, 82 Dice Example of Conditional Probability • Therefore, 83 The Ball Example • We (implicitly) computed conditional probabilities earlier when we considered drawing two balls from a box without replacement • Recall that we said that the event of two red balls is the intersection of the event of drawing one red ball, and then drawing a second red ball, A B 84 The Ball Example • The probability of this event was • Therefore, 85 The Ball Example Mini-Quiz • Consider the events • Compute the probabilities 86 A Property of Conditional Probabilities • A general property of of conditional probabilities is that • The reason is that and 87 Conditional Probabilities Are Probabilities • Since the conditional probability is bigger than the joint probability, we might be worried that conditional probabilities could violate the axioms of probability • However conditional probabilities do satisfy all three axioms of probabilities 88 First and Second Axioms • Since all the probabilities are non-negative, conditional probabilities are non-negative as well • The second axiom is equivalent to • Now, 89 Third Axiom • The third axiom can be written as when • How this situation can arise can be seen in a Venn Diagram 90 Third Axiom W 91 Third Axiom • Therefore, if A and B are mutually exclusive, AC and B C are as well • Therefore, 92 Ordering in Conditional Probability Matters • The fact that, in general is one of the main causes why probabilistic reasoning can be very confusing • It can lead to an error known as the prosecutor’s fallacy • The correct way to handle this is Bayes Rule (later) 93 Ordering in Conditional Probability Matters • The order of conditional probabilities matters • Suppose we want to compute the conditional probability of A given B, • This looks very similar – we’ve just swapped around B and A, but the results can be very different 94 Ordering in the Die Example • Recall the two events in the die example, • We’d like to compute the conditional probability, 95 Ordering in the Die Example • Substituting relative frequencies for probabilities, • In other words, 2 is an even number with probability 1 96 Selection Bias • A second error is to assume that • However, and so the value depends on the joint probability 97 Selection Bias • Recall again that • We have seen that • However, 98 Selection Bias • Suppose we now have the events • In this case, even though 99 Total Probability Theory and Marginalisation • The proper way to compute P(A) from P(A|B) is through a process known as marginalisation • The result is often known as the total probability theory • The idea is that we essentially recover P(A) by adding up all the conditional probabilites that P(A|B) can occur 100 Total Probability Theory and Marginalisation W 101 Total Probability Theory and Marginalisation • By looking at the union of the events, we can see that • Because the events are disjoint, we know that 102 Total Probability Theory and Marginalisation • From the definition of conditional probability, • Therefore, 103 Example of Total Probability • Suppose we have the events 104 Example of Total Probability • It can be shown (=I’m not going to) 105 Example of Total Probability • Therefore, 106 Example of Total Probability • Suppose we now have the events • What do we get this time? 107 Independence • We have shown that conditional probability can be used to quantify relationships between events • However, it’s quite possible that the occurrence of one event makes no difference in the probability that another event will occur • In this situation 108 Multiplication Rule for Independence • From the definition for conditional probability, • Therefore the multiplication rule for independent events is 109 Independence and the Ball Example • We saw a case of independent events earlier with the ball drawing example with replacement • In this case, • We saw that 110 Independence and the Ball Example • Dividing through, we find 111 Independence Does Not Mean Lack of Causality • It is tempting to assume that: – Dependent events must be causally linked in some way – Independent events must not causally linked in some way • However, independence between events A and B is only defined by the property • This has nothing to do with causality 112 Independence Does Not Mean Lack of Causality • A counter example • Therefore, • If A were somehow “independent of itself”, then we would be able to write the equation 113 Independence Does Not Mean Lack of Causality • This can be satisfied in two cases • Therefore, if event A never occurs or always occurs, it is mathematically defined to be “independent of itself”! 114 Independence of Multiple Events • Consider the question of whether three events A1, A2 and A3 are independent • We might think that we can just use the multiplication rule, • This condition is necessary but it is not sufficient to guarantee independence 115 Counter Example to Simple Independence Proof • Consider a dice game in which two independent dice, X and Y, are thrown and the sample space is (X, Y) • The events are 116 Counter Example to Simple Independence Proof • We can show that 117 Counter Example to Simple Independence Proof • The outcomes compatible with the various events are 118 Counter Example to Simple Independence Proof • Now, • This suggests that the events are independent 119 Counter Example to Simple Independence Proof • However, • These are not compatible with pairwise independence 120 Independence of Multiple Events • Therefore, the “proper” test is as follows • A set of n events is independent if – Any set of k<n events are independent – This is proved by showing that • This proof works inductively from the case n=2 121 Independence is Not Transitive • One of the consequences of the previous result is that independence is not, in general, transitive • This means: – Events A and C can be independent of one another – Events B and C can be independent of one another – But events A and B are not independent of one another 122 A Non-Transitive Example • Consider the two-die example again, but suppose we now have events 123 A Non-Transitive Example • We can show that • In other words, A1 and A3 are independent, A3 and A2 are independent, but A2 and A1 are not independent 124 Conditional Independence • It is possible to have a situation in which A and B depend on one another • However, their probabilities, conditioned on some event C, are independent of one another • In this situation, A and B are said to be conditionally independent given C 125 Conditional Independence Example • Consider a dice game in which three independent dice X, Y, Z are thrown • We are interested in computing the probabilities of three types of events 126 Conditional Independence Example • In this example, we’ll show that the distribution is conditionally independent • However, the joint distribution is not and so 127 Conditional Independence Example • First consider how to compute the conditional probability • Because we are conditioning on Ck, we are conditioning on the event that we know that X=k • Therefore, 128 Conditional Independence Example • Assuming that the dice are six-sided and fair, the probability of rolling a number r is • Therefore, 129 Conditional Independence Example • Similarly, where 130 Conditional Independence Example • The joint distribution considers both die, • Recall that where Y, Z are independent of one another 131 Conditional Independence Example • Therefore, • In other words, • This is conditionally independent, as promised 132 Conditionally Independent Distribution 133 Conditionally Independent Distribution 134 Conditionally Independent Distribution 135 The Joint Distribution • To compute the joint distribution we would like to apply the total probability theory and marginalise out the Ck events from 136 The Joint Distribution • Because each value of Ck represents one value on X, each event Ck is disjoint from all the other events • Therefore, 137 Example • Substituting, 138 Computing the Results 139 Dependent Probability Result 140 Summary • Because events can overlap one another, the probability of one event happening can change if we know that another event has already happened • This “degree of influence” is measured by the conditional probability • A special case is when the events are independent • Independence is just a property of the numerical values of probability; it does not say anything about causality • Independence can be conditional 141 Entropy • • • • • • Introduction Probability Space Axioms and Theorems of Probability Counting to Compute Probabilities Conditional Probability and Independence Entropy 142 Entropy and Randomness • Consider again the view that we have a system which produces a set of outcomes which leads to a set of events • In some cases we would like to know how random or how uncertain the system is – Random means that we can’t predict terribly accurately what will be the next event which will get fired • This is quantified through entropy 143 Entropy • Entropy is defined in information theory and is based off of analogies with statistical thermodynamics – A system is observed and is seen to observe a macrostate (e.g., temperature, pressure …) – However, the same macrostate can be generated by many microstates of the system (location of individual molecules) – Entropy expresses the amount of ambiguity (number of feasible microstates which satisfy the observed macrostate) 144 Entropy and Partitions • Entropy is defined in terms of a set of events U={Ai} which partition the sample space • A partition of a set S is a collection of events: – That are mutually exclusive – And whose union covers the entire sample space 145 Partition of a Sample Space W A1 A3 A2 146 Another Partition of the Same Sample Space W A1 A8 A2 A3 A4 A6 A5 A7 147 Entropy of a Partition • The entropy is defined to be • If the log is base 2, the units of entropy are bits (same as in a computer) • The greater the entropy, the more random the distribution is 148 Two Outcome Example • Suppose there are two possible events, • Let • The entropy is 149 Entropy of the Two Outcome Example 150 Entropy for a Die Example • Suppose we throw our single die X • The partition of events are just the elementary events which are the values of the dice rolls, • Therefore, 151 Properties of Entropy • The entropy is non-negative • We can see this from the definition, • Note that 152 Properties of Entropy • The entropy is zero if there is a single partition, • In this case we have a single event and from our axioms of probability • This occurs if our system is deterministic 153 Properties of Entropy • The maximum value of entropy is when the probabilities of all events are the same, • This can be proved using calculus of variations • However, an intuitive argument is that the system is “most random” when the next event could be any permissible event with the same probability 154 Properties of Entropy • If a new partition is formed where V is a subdivision of U, • An intuitive explanation is that if you redefine your system to have more events, the number of macro states has to go up 155 Partition U W A1 A3 A2 156 Subdivided Partition V W A11 A3 1 A3 2 A2 A1 2 1 A2 2 Original events (solid black lines) have been subdivided into “smaller” events 157 Subdivision in the Die Example • Suppose the partition U is • The entropy is 158 Subdivision in the Die Example • If we now subdivide the partition into • The entropy is now 159 Conditional Entropy • Entropy can be computed based on the conditional distribution of an event occurring 160 Example of Conditional Entropy • For example, if we stick with our elementary partitions but condition on the event • We know that 161 Example of Conditional Entropy • Therefore, • In most cases, conditioning on an event causes the entropy to go down • However, we’ll see a counterexample to this later 162 Summary • Entropy is used to quantify how “random” the system is • It is defined in terms of the partition (set of events) used • The smallest value of entropy is 0 and occurs when only a single event can ever happen • The largest value of entropy is given when the probability of each event is the same as all other events 163 Summary • If you use a more “fine-grained” event structure, entropy cannot decrease • Entropy can be conditional on events happening 164 Where Next? • So far, we’ve looked at outcomes and events and the properties of them • However, many systems do not deal with events directly, but numerical quantities • We handle these through random variables, and they are the subject of the next set of slides 165