Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Genomic Data Manipulation BIO508 Spring 2011 Handout 01 Probably Things Go Wrong I think the title pretty much sums it up. If nothing else, I'm pretty clumsy, and I mess things up. I have no idea how many times I've tripped over my own feet, dropped chalk, walked into lampposts, or gotten locked out of my garage1. Fortunately, most people have this sort of problem, which is why there are whole branches of science devoted to the study of things going wrong. In the real world, things can go wrong during software development and execution, elections, manufacturing processes, or even (gasp) laboratory experiments. The only reason we can really function as a large-scale society is by minimizing the collective wrongness - by figuring out how likely we are to make mistakes, where they're most likely to be made, and what we can change to most effectively remove the sources of error. And as you may have guessed by now - especially if you've been called by a telephone survey company lately - the area of science that deals with this in particular is that of probability and statistics. In general, you can think of the mathematics of probability as describing the numerical chances of things happening: how you can solve problems and figure out the likelihoods of various outcomes in various situations. Statistics, on the other hand, tends to be the application of probability: correcting for errors in a phone survey, for example (since you can't survey infinitely many people), or keeping track of the actual results of some card games (rather than just theoretically predicting their results). Different people's definitions will vary, but this is the general idea. Do You Speak Jive? Assuming I've convinced you that this is at least marginally interesting, there are (as usual) a bunch of words to which we'll assign special meanings when discussing probability and statistics. On the bright side, most of these are pretty simple (as opposed to, oh, linear algebra, where none of the terminology is particularly obvious, at least to me). So to get the basics out of the way: An experiment is something that we can do to produce a random result. It can really be pretty much anything; "flip a coin" and "roll a die" are popular examples, but they're not the only options. We could "see how long it takes for a cell culture to double in density," or "count the number of pipette tips discarded in a single day." All of these things have well defined outcomes that vary with some degree of randomness. A sample space is the set of all possible outcomes for a particular experiment. This is a set in the formal sense! So for the two canonical examples, flipping a coin and rolling a die, we might have the sample spaces: Scoin = {H, T} for Heads and Tails Sdie = {1, 2, 3, 4, 5, 6} But what about our more complicated experiments? Sample spaces don't have to be finite: Well, actually, I have a hard count on that last one. It's one. It was memorable, not my fault (house painters; I do not recommend CertaPro), and involved excising one of my window panes with an X-ACTO knife. I had no idea the proper spelling was "X-ACTO" until I just looked it up... 1 H01-1 Sdouble = {0, 0.01, 0.0001, 9, 9.0005, ...} or the non-negative real numbers (assuming hours) Stips = {0, 1, 2, 3, 4, ...} or the natural numbers And we can have pretty non-intuitive sample spaces, too. Suppose we roll two dice instead of one die. Then the following are all valid, essentially equivalent sample spaces for this experiment. The one we'd choose would depend on the specific application (for example, if we cared about order, we wouldn't choose the set-based version): Sdice = {(1, 1), (1, 2), ..., (1, 6), (2, 1), (2, 2), ..., (2, 6), (3, 1), ..., (6, 6)} Sdice = {2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12} Sdice = {{1, 1}, {1, 2}, ..., {1, 6}, {2, 2}, {2, 3}, ..., {3, 3}, ..., {6, 6}} For one more, suppose we're running a test over and over until we get a positive result. Then our sample space is still infinite, but it isn't clearly numbers or sets or anything like that: Stest = {P, NP, NNP, NNNP, NNNNP, ...} An event is any subset of a sample space. This sort of almost corresponds to the intuitive idea we'd have of an event as a probabilistic outcome, but you have to be careful. Suppose we have two dice again. Then the verbal event "the sum of the two dice equals five" would be described by the following subsets of the three equivalent sample spaces: E5 = {(1, 4), (2, 3), (3, 2), (4, 1)} E5 = {5} E5 = {{1, 4}, {2, 3}} We can get pretty esoteric in our verbal descriptions, as long as they still describe some subset (not necessarily a strict subset) of our sample space. So suppose we're observing the sample space of "gender distributions in a group of mice." Then our space might be described as: Sgender = {(M, F), (M, M, F), (M, F, F), (M, F, F, F), (M, M, F, F), (M, F, M, F), (F, M, M, M), ...} The description "groups containing the same number of males as females" would then be an event represented by the subset: Eeven = {(M, F), (M, M, F, F), (M, F, M, F), (M, M, M, F, F, F), ...} Note that both of these - the sample space and the event - assume that there are infinitely many mice available. While it may sometimes feel that way when you're working on an experimental design, your funding agency will probably frown on such a thing 2. We can often make simplifying assumptions to regularize or, well, simplify our models. So we say that an event occurs if any one of its elements is actually what we observe. For example, suppose our sample space is "the results of three coin flips in no particular order." We're interested in the event "at least one head is generated." Then we have: S = {{H, H, H}, {H, H, T}, {H, T, T}, {T, T, T}} E = {{H, H, H}, {H, H, T}, {H, T, T}} 2 Not to mention the ethics committee... H01-2 If we actually run this experiment once, we might generate the result {H, T, T}. This is a member of our event E, so in this case, E has occurred. If we run the experiment again and generate the result {T, T, T}, E has not occurred. Now for the biggie. Notice that we haven't yet defined the probability of an event, which is really what we're ultimately interested in. Suppose we have some event E over a sample space. We run n experiments. Then we can use n(E) to denote "the number of those n times in which E occurred." For example, using S and E from directly above, let's run three experiments. Then we know n=3, and we might generate the results: {H, T, T} {H, H, T} {T, T, T} Of these three results, E occurred for two of them, so n(E)=2. This lets us finally define the probability3 of some particular event as follows: for any E, if we run an experiment n times, then the probability of the event (written as P(E)) is equal to the value n( E ) n as n grows arbitrarily large (the limit as n grows to infinity). This sounds a little complicated, but fortunately it tends to fit our intuitive definition of probability. Let's go back to the simpler "single coin flip" sample space: Scoin = {H, T} Suppose our event is "the coin lands heads up." Then E={H}, pretty obviously, and we can run experiment after experiment to see what happens. This might generate the results: H, H, T, H, T, T, H, T, T, T, H, T, H, T, T, H, H, T, T, H Here, n=20, and our event has occurred 9 times, so our probability n(E)/n = 9/20 = 0.45. Suppose we do another 20 experiments and get the results: H, H, T, H, T, T, H, T, H, H, T, T, H, T, H, T, H, T, H, T Now n=40, and n(E)=19, making the probability 19/40 = 0.475. I bet I could probably convince you that as we flip the coin more and more and more and more times, this will creep towards exactly 0.5. It might meander around a bit on the way, but in the limit as n gets big, it'll eventually be exactly one half. Need Input So now that we have some words to use, we need to say things with them. The first thing we need to do is to encode our intuitions about probability using our newly defined definitions. We know basically how we want "the chance of something happening" to work in the real world - we just need to make sure we write down the math so that it means the same thing. 3 At least in the frequentist sense. If you want more than that, I'd have to recommend a real statistics course... H01-3 First, remember that we write "the probability of the event E" as just plain P(E). Knowing this, I hereby decree, declare, proclaim, and state that the following three things pretty much sum up my expectations about probability: P(E) must be greater than or equal to zero for any event E. Negative probabilities just don't make sense! Especially since we defined probability using the idea of counting, and it's awfully hard to count negatively many objects. P(S) must be exactly one for any sample space S. This basically means that given all possible outcomes, at least one of them has to happen every time we run the experiment. Or in other words, we're working with a closed universe; once we've defined a sample space S, no experiment can generate a result that's not included in S. This one's the tricky one. For two events E and F, we require the probability of either E or F happening, written as P(E∪F), to equal exactly P(E)+P(F) if E and F share no common members. More succinctly, if E∩F=, then P(E∪F)=P(E)+P(F). In plain English, this means that if E and F are completely different results that have nothing to do with each other - say getting a one on a die versus getting a six - then the probability of either happening is the sum of the probabilities of one or the other happening. There are called the axioms of probability, and they're a bit more profound than we care to think about. For our purposes, they're just a way of writing down in mathese the intuitions we already have about probabilities: they can't be negative, they can't be bigger than one, and things that don't affect each other have disjoint (nonoverlapping) probabilities. It's surprising what we can prove using just these three ideas. For that matter, think about all of the things we know about probabilities - or at least things you suspect. If they're true, you can prove them with just these three facts! It's pretty neat. Consider, for example, that individual members of S are always non-overlapping (as per the non-overlappingness we talked about in the third axiom above). This means that we can break the probability of a whole event E down into the individual probabilities of its members ei. Let's write this out. Suppose we have some event E - the one we used above was good: S = {{H, H, H}, {H, H, T}, {H, T, T}, {T, T, T}} E = {{H, H, H}, {H, H, T}, {H, T, T}} We can be lazy and abbreviate these as follows: S = {HHH, HHT, HTT, TTT} E = {HHH, HHT, HTT} Then the elements of E are e1=HHH, e2=HHT, and e3=HTT. Because these three things are disjoint - they don't overlap with each other, depend on each other, or otherwise influence each other - we know that they must fit the requirements of our third axiom. This means that we can break down P(E) into... P(E) = P(e1 ∪ e2 ∪ e3) = P(e1) + P(e2) + P(e3) = P(HHH) + P(HHT) + P(HTT) And in general, for any E composed of n different members ei, we write: n P ( E ) P ( ei ) i 1 H01-4 Similarly, when we have P(E∪F) and some of their members do overlap, we need to subtract out the extra stuff we would overcount using this method. You've almost certainly seen this before for the rules of set cardinality, though: |E∪F| = |E| + |F| - |E∩F| The size of the union of E and F is the count of all of their elements minus the stuff that's in both of them and thus counted twice. Likewise, if two sample spaces share elements, we can define: P(E∪F) = P(E) + P(F) - P(E∩F) And if E and F don't share any elements, E∩F is empty, making its probability zero and reducing this to the simpler summation we saw above. Since events are always just sets, we can use other tricks from set notation to talk about them, too. Remember complementation? Well, when we're dealing with probabilities, the current sample space is our universe 4. That means that we can easily define the complement of an event E as ~E=S-E. And this immediately means that: P(~E) = 1 - P(E) The probability of something happening is one minus the probability of it happening - just like we'd expect! And we can prove this using the rules we've seen so far; since E and ~E share no elements, they are disjoint, making: P(S) = 1 by our second axiom S = E ∪ ~E by the definition of complement P(E∪~E) = P(E) + P(~E) by our third axiom, so... P(S) = P(E∪~E) = P(E) + P(~E) = 1, and thus... P(~E) = 1 - P(E) That proves our intuition mathematically; surprisingly enough, math works. At least for now... What If If you combine some combinatorics (ha ha) with these definitions, it provides pretty much everything you need to probabilistically answer a lot of important questions about important things. Like poker5. Suppose, for example, that our sample space is "one card drawn from a deck of playing cards," making: S = {2H, 3H, ..., JH, QH, KH, AH, 2D, ..., 2S, ..., 2C, ..., AC} Our event of interest will be "we draw a jack", making: E = {JH, JS, JD, JC} I bet that you could believe P(E)=4/52=1/13 pretty easily. But what if I change my event slightly: given that I've drawn a face card, what is the probability of it being a jack? At least for the set-theoretic definition of "universe." Most of the time, the lab is our universe instead... I'm more of a bridge, spades, and hearts person myself, although I think the number of years since I last played bridge may be entering the double digits... 4 5 H01-5 Any time you have extra information about your event, you've generated a slightly different situation known as a conditional probability. Rather than treating the whole sample space as your universe, you've restricted it based on some conditions for a particular event. So if our conditional event is "I've drawn a face card", we can write: C = {JH, QH, KH, JS, QS, KS, JC, QC, KC, JD, QD, KD} We then write P(E|C), pronounced, "The probability of E given C," or in this case, "The probability of drawing a jack given that I've drawn a face card." And what this really means intuitively, of course, is that we've restricted our universe from S down to C - we've only considering the chances of drawing E out of C rather than out of S. We need to do this in terms of the math, though - as usual. We want to find the probability that both E and C occur, and we want to scale it by the original probability of C. We can't just scale back our universe from S to C for free! The first part of this is easy; just like we write P(A∪B) to mean, "The probability of A or B," we write P(A∩B) to mean, "The probability of A and B." This lets us write the mathematical definition: P(E|C) = P(E∩C)/P(C) This means that, "The chance of E occurring given that C has already occurred is equal to the chance of E and C both occurring scaled by the change of just C occurring." Note that this should again, thankfully, also fit our intuition of how probability works. If you're considering any two events C and E occurring, one has to go first; the probability of them both occurring must be equal to the chance of one happening first, and then the other happening afterward: P(C∩E) = P(E∩C) = P(E|C)P(C) = P(C|E)P(E) In English, "The probability of both C and E happening is equal to the probability of C happening first, and then E happening after C has." Alternatively, if you happen to be more of an E-first sort of person, "The probability of E happening first, and then C happening after E has." More on this below... Anyhow, to finish off our example, the chance of C occurring (us drawing a face card) is pretty clearly 12/52 we've got 52 possibilities, 12 of which satisfy the requirements of C. Since E is already a subset of C, we know E∩C=E, making P(E∩C)=P(E)=4/52, just like we wrote above. This means that: The probability of drawing a jack given that we've drawn a face card = P(E|C) = P(E∩C)/P(C) = (4/52)/(12/52) = (4/52)(52/12) = 4/12 = 1/3 And this fits our intuition again, thank goodness. If we've drawn a face card, we've either drawn a jack, a queen, or a king; one third of those possibilities are jacks. And that's the number we got! Our mathematical definition of probability and conditional probability still seems to match up with the real world, which is pretty much what math is all about. Often. Not quite always, I suppose. H01-6 If What There's one more important formula I want to get written down here, because not even a single class (or two) on probability would be complete without at least seeing it. I already gave it away above, but to lay at least a bit more of a mathematical foundation, consider the Multiplication Principle from combinatorics. In terms of counting, it states that, "The number of ways to do task A followed by task B is the product of the number of ways to do just A with the number of ways to do just B." It turns out that this has an (unsurprising) analog in probability. We've been talking about "the probability of doing both A and B" up above, and I claimed that P(A∩B) can be rewritten using the multiplication principle as "the probability of doing just A followed by just B." The reason this is important is that doing A can change B, and of course vice versa. If we're talking about P(C∩E) again, we know that if we first reduce the deck from all 52 cards down to just the face cards (that's what C stood for, remember), then the probability of E (drawing a jack) is very, very different. Likewise P(C) if E has already occurred (intuitively, what's the probability of holding a face card if you're already holding a jack?) But! We just defined notation for this! We know that P(E|C) means "the chances of achieving E given that we've already done C." So... since doing C and E basically means "doing C first and then E", I propose that we rewrite things as follows: P(C∩E) = P(C)P(E|C) This is identical to the Multiplication Principle - it can be read as, "The probability of doing task A followed by task B is the product of the probability of doing just A with the probability of doing B after A." Given that you believe this6, remember our original definition of conditional probability? It looked pretty much like: P(A|B) = P(A∩B)/P(B) But now we know how to rewrite P(A∩B), so we get: P(A|B) = P(B|A)P(A)/P(B) Or to make it bigger and clearer and stuff: P( A | B) P( B | A) P( A) P( B) This is one of several ways to derive Bayes' Rule or Bayes' Theorem, and it's the foundation of A) pretty much all probabilistic algorithm design and B) a good chunk of all machine learning. There are lots of ways to prove it even more rigorously, none of which we'll do... but it's something that should be floating around the back of your head, informing your interpretation and analysis of data. It's useful; anything that can help a computer learn how to do anything right is important. Congratulations - you've completed pretty much the first month of any respectable probability and statistics course. In a few hours. Consider this an accomplishment - because, well, it is! 6 Weakest. Joke. Ever. H01-7