Download Theory of Computation

yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts

Statistics wikipedia, lookup

History of statistics wikipedia, lookup

Inductive probability wikipedia, lookup

Birthday problem wikipedia, lookup

Ars Conjectandi wikipedia, lookup

Probability interpretations wikipedia, lookup

Probability wikipedia, lookup

Genomic Data Manipulation
BIO508 Spring 2015
Handout 01
Things Go Wrong
I think the title pretty much sums it up. If nothing else, I'm pretty clumsy, and I mess things up. I have no idea
how many times I've tripped over my own feet, dropped chalk, walked into lampposts, or gotten locked out of
my garage1. Fortunately, most people have this sort of problem, which is why there are whole branches of science
devoted to the study of things going wrong.
In the real world, things can go wrong during software development and execution, elections, manufacturing
processes, or even (gasp) laboratory experiments. The only reason we can really function as a large-scale society
is by minimizing the collective wrongness - by figuring out how likely we are to make mistakes, where they're
most likely to be made, and what we can change to most effectively remove the sources of error. And as you may
have guessed by now - especially if you've been called by a telephone survey company lately - the area of science
that deals with this in particular is that of probability and statistics.
In general, you can think of the mathematics of probability as describing the numerical chances of things
happening: how you can solve problems and figure out the likelihoods of various outcomes in various situations.
Statistics, on the other hand, tends to be the application of probability: correcting for errors in a phone survey, for
example (since you can't survey infinitely many people), or keeping track of the actual results of some card games
(rather than just theoretically predicting their results). Different people's definitions will vary, but this is the
general idea.
Do You Speak Jive?
Assuming I've convinced you that this is at least marginally interesting, there are (as usual) a bunch of words to
which we'll assign special meanings when discussing probability and statistics. On the bright side, most of these
are pretty simple (as opposed to, oh, linear algebra, where none of the terminology is particularly obvious, at least
to me). So to get the basics out of the way:
An experiment is something that we can do to produce a random result. It can really be pretty much anything;
"flip a coin" and "roll a die" are popular examples, but they're not the only options. We could "see how long it
takes for a cell culture to double in density," or "count the number of pipette tips discarded in a single day." All of
these things have well defined outcomes that vary with some degree of randomness.
A sample space is the set of all possible outcomes for a particular experiment. This is a set in the formal sense! So
for the two canonical examples, flipping a coin and rolling a die, we might have the sample spaces:
Scoin = {H, T} for Heads and Tails
Sdie = {1, 2, 3, 4, 5, 6}
But what about our more complicated experiments? Sample spaces don't have to be finite:
Well, actually, I have a hard count on that last one. It's one. It was memorable, not my fault (house painters; I do not recommend CertaPro),
and involved excising one of my window panes with an X-ACTO knife. I had no idea the proper spelling was "X-ACTO" until I just looked it
Sdouble = {0, 0.01, 0.0001, 9, 9.0005, ...} or the non-negative real numbers (assuming hours)
Stips = {0, 1, 2, 3, 4, ...} or the natural numbers
And we can have pretty non-intuitive sample spaces, too. Suppose we roll two dice instead of one die. Then the
following are all valid, essentially equivalent sample spaces for this experiment. The one we'd choose would
depend on the specific application (for example, if we cared about order, we wouldn't choose the set-based
Sdice = {(1, 1), (1, 2), ..., (1, 6), (2, 1), (2, 2), ..., (2, 6), (3, 1), ..., (6, 6)}
Sdice = {2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12}
Sdice = {{1, 1}, {1, 2}, ..., {1, 6}, {2, 2}, {2, 3}, ..., {3, 3}, ..., {6, 6}}
For one more, suppose we're running a test over and over until we get a positive result. Then our sample space is
still infinite, but it isn't clearly numbers or sets or anything like that:
Stest = {P, NP, NNP, NNNP, NNNNP, ...}
An event is any subset of a sample space. This sort of almost corresponds to the intuitive idea we'd have of an
event as a probabilistic outcome, but you have to be careful. Suppose we have two dice again. Then the verbal
event "the sum of the two dice equals five" would be described by the following subsets of the three equivalent
sample spaces:
E5 = {(1, 4), (2, 3), (3, 2), (4, 1)}
E5 = {5}
E5 = {{1, 4}, {2, 3}}
We can get pretty esoteric in our verbal descriptions, as long as they still describe some subset (not necessarily a
strict subset) of our sample space. So suppose we're observing the sample space of "gender distributions in a
group of mice." Then our space might be described as:
Sgender = {(M, F), (M, M, F), (M, F, F), (M, F, F, F), (M, M, F, F), (M, F, M, F), (F, M, M, M), ...}
The description "groups containing the same number of males as females" would then be an event represented by
the subset:
Eeven = {(M, F), (M, M, F, F), (M, F, M, F), (M, M, M, F, F, F), ...}
Note that both of these - the sample space and the event - assume that there are infinitely many mice available.
While it may sometimes feel that way when you're working on an experimental design, your funding agency will
probably frown on such a thing 2. We can often make simplifying assumptions to regularize or, well, simplify our
So we say that an event occurs if any one of its elements is actually what we observe. For example, suppose our
sample space is "the results of three coin flips in no particular order." We're interested in the event "at least one
head is generated." Then we have:
S = {{H, H, H}, {H, H, T}, {H, T, T}, {T, T, T}}
E = {{H, H, H}, {H, H, T}, {H, T, T}}
Not to mention the ethics committee...
If we actually run this experiment once, we might generate the result {H, T, T}. This is a member of our event E,
so in this case, E has occurred. If we run the experiment again and generate the result {T, T, T}, E has not
Now for the biggie. Notice that we haven't yet defined the probability of an event, which is really what we're
ultimately interested in. Suppose we have some event E over a sample space. We run n experiments. Then we
can use n(E) to denote "the number of those n times in which E occurred." For example, using S and E from
directly above, let's run three experiments. Then we know n=3, and we might generate the results:
{H, T, T}
{H, H, T}
{T, T, T}
Of these three results, E occurred for two of them, so n(E)=2.
This lets us finally define the probability3 of some particular event as follows: for any E, if we run an experiment n
times, then the probability of the event (written as P(E)) is equal to the value
n( E )
as n grows arbitrarily large (the limit as n grows to infinity). This sounds a little complicated, but fortunately it
tends to fit our intuitive definition of probability. Let's go back to the simpler "single coin flip" sample space:
Scoin = {H, T}
Suppose our event is "the coin lands heads up." Then E={H}, pretty obviously, and we can run experiment after
experiment to see what happens. This might generate the results:
H, H, T, H, T, T, H, T, T, T, H, T, H, T, T, H, H, T, T, H
Here, n=20, and our event has occurred 9 times, so our probability n(E)/n = 9/20 = 0.45. Suppose we do another 20
experiments and get the results:
H, H, T, H, T, T, H, T, H, H, T, T, H, T, H, T, H, T, H, T
Now n=40, and n(E)=19, making the probability 19/40 = 0.475. I bet I could probably convince you that as we flip
the coin more and more and more and more times, this will creep towards exactly 0.5. It might meander around a
bit on the way, but in the limit as n gets big, it'll eventually be exactly one half.
Need Input
So now that we have some words to use, we need to say things with them. The first thing we need to do is to
encode our intuitions about probability using our newly defined definitions. We know basically how we want
"the chance of something happening" to work in the real world - we just need to make sure we write down the
math so that it means the same thing.
At least in the frequentist sense. If you want more than that, I'd have to recommend a real statistics course...
First, remember that we write "the probability of the event E" as just plain P(E). Knowing this, I hereby decree,
declare, proclaim, and state that the following three things pretty much sum up my expectations about
 P(E) must be greater than or equal to zero for any event E. Negative probabilities just don't make sense!
Especially since we defined probability using the idea of counting, and it's awfully hard to count
negatively many objects.
 P(S) must be exactly one for any sample space S. This basically means that given all possible outcomes, at
least one of them has to happen every time we run the experiment. Or in other words, we're working
with a closed universe; once we've defined a sample space S, no experiment can generate a result that's
not included in S.
 This one's the tricky one. For two events E and F, we require the probability of either E or F happening,
written as P(E∪F), to equal exactly P(E)+P(F) if E and F share no common members. More succinctly, if
E∩F=, then P(E∪F)=P(E)+P(F). In plain English, this means that if E and F are completely different
results that have nothing to do with each other - say getting a one on a die versus getting a six - then the
probability of either happening is the sum of the probabilities of one or the other happening.
There are called the axioms of probability, and they're a bit more profound than we care to think about. For our
purposes, they're just a way of writing down in mathese the intuitions we already have about probabilities: they
can't be negative, they can't be bigger than one, and things that don't affect each other have disjoint (nonoverlapping) probabilities.
It's surprising what we can prove using just these three ideas. For that matter, think about all of the things we
know about probabilities - or at least things you suspect. If they're true, you can prove them with just these three
facts! It's pretty neat. Consider, for example, that individual members of S are always non-overlapping (as per
the non-overlappingness we talked about in the third axiom above). This means that we can break the
probability of a whole event E down into the individual probabilities of its members ei.
Let's write this out. Suppose we have some event E - the one we used above was good:
S = {{H, H, H}, {H, H, T}, {H, T, T}, {T, T, T}}
E = {{H, H, H}, {H, H, T}, {H, T, T}}
We can be lazy and abbreviate these as follows:
Then the elements of E are e1=HHH, e2=HHT, and e3=HTT. Because these three things are disjoint - they don't
overlap with each other, depend on each other, or otherwise influence each other - we know that they must fit the
requirements of our third axiom. This means that we can break down P(E) into...
P(E) = P(e1 ∪ e2 ∪ e3) = P(e1) + P(e2) + P(e3) = P(HHH) + P(HHT) + P(HTT)
And in general, for any E composed of n different members ei, we write:
P ( E )   P ( ei )
i 1
Similarly, when we have P(E∪F) and some of their members do overlap, we need to subtract out the extra stuff we
would overcount using this method. You've almost certainly seen this before for the rules of set cardinality,
|E∪F| = |E| + |F| - |E∩F|
The size of the union of E and F is the count of all of their elements minus the stuff that's in both of them and thus
counted twice. Likewise, if two sample spaces share elements, we can define:
P(E∪F) = P(E) + P(F) - P(E∩F)
And if E and F don't share any elements, E∩F is empty, making its probability zero and reducing this to the
simpler summation we saw above.
Since events are always just sets, we can use other tricks from set notation to talk about them, too. Remember
complementation? Well, when we're dealing with probabilities, the current sample space is our universe4. That
means that we can easily define the complement of an event E as ~E=S-E. And this immediately means that:
P(~E) = 1 - P(E)
The probability of something happening is one minus the probability of it happening - just like we'd expect! And
we can prove this using the rules we've seen so far; since E and ~E share no elements, they are disjoint, making:
P(S) = 1 by our second axiom
S = E ∪ ~E by the definition of complement
P(E∪~E) = P(E) + P(~E) by our third axiom, so...
P(S) = P(E∪~E) = P(E) + P(~E) = 1, and thus...
P(~E) = 1 - P(E)
That proves our intuition mathematically; surprisingly enough, math works. At least for now...
What If
If you combine some combinatorics (ha ha) with these definitions, it provides pretty much everything you need to
probabilistically answer a lot of important questions about important things. Like poker5. Suppose, for example,
that our sample space is "one card drawn from a deck of playing cards," making:
S = {2H, 3H, ..., JH, QH, KH, AH, 2D, ..., 2S, ..., 2C, ..., AC}
Our event of interest will be "we draw a jack", making:
E = {JH, JS, JD, JC}
I bet that you could believe P(E)=4/52=1/13 pretty easily. But what if I change my event slightly: given that I've
drawn a face card, what is the probability of it being a jack?
At least for the set-theoretic definition of "universe." Most of the time, the lab is our universe instead...
I'm more of a bridge, spades, and hearts person myself, although I think the number of years since I last played bridge may be entering the
double digits...
Any time you have extra information about your event, you've generated a slightly different situation known as a
conditional probability. Rather than treating the whole sample space as your universe, you've restricted it based
on some conditions for a particular event. So if our conditional event is "I've drawn a face card", we can write:
C = {JH, QH, KH, JS, QS, KS, JC, QC, KC, JD, QD, KD}
We then write P(E|C), pronounced, "The probability of E given C," or in this case, "The probability of drawing a
jack given that I've drawn a face card." And what this really means intuitively, of course, is that we've restricted
our universe from S down to C - we've only considering the chances of drawing E out of C rather than out of S.
We need to do this in terms of the math, though - as usual. We want to find the probability that both E and C
occur, and we want to scale it by the original probability of C. We can't just scale back our universe from S to C
for free! The first part of this is easy; just like we write P(A∪B) to mean, "The probability of A or B," we write
P(A∩B) to mean, "The probability of A and B."
This lets us write the mathematical definition:
P(E|C) = P(E∩C)/P(C)
This means that, "The chance of E occurring given that C has already occurred is equal to the chance of E and C
both occurring scaled by the change of just C occurring." Note that this should again, thankfully, also fit our
intuition of how probability works. If you're considering any two events C and E occurring, one has to go first;
the probability of them both occurring must be equal to the chance of one happening first, and then the other
happening afterward:
P(C∩E) = P(E∩C) = P(E|C)P(C) = P(C|E)P(E)
In English, "The probability of both C and E happening is equal to the probability of C happening first, and then E
happening after C has." Alternatively, if you happen to be more of an E-first sort of person, "The probability of E
happening first, and then C happening after E has." More on this below...
Anyhow, to finish off our example, the chance of C occurring (us drawing a face card) is pretty clearly 12/52 we've got 52 possibilities, 12 of which satisfy the requirements of C. Since E is already a subset of C, we know
E∩C=E, making P(E∩C)=P(E)=4/52, just like we wrote above. This means that:
The probability of drawing a jack given that we've drawn a face card =
P(E|C) =
P(E∩C)/P(C) =
(4/52)/(12/52) =
(4/52)(52/12) =
4/12 =
And this fits our intuition again, thank goodness. If we've drawn a face card, we've either drawn a jack, a queen,
or a king; one third of those possibilities are jacks. And that's the number we got! Our mathematical definition of
probability and conditional probability still seems to match up with the real world, which is pretty much what
math is all about. Often. Not quite always, I suppose.
If What
There's one more important formula I want to get written down here, because not even a single class (or two) on
probability would be complete without at least seeing it. I already gave it away above, but to lay at least a bit
more of a mathematical foundation, consider the Multiplication Principle from combinatorics. In terms of
counting, it states that, "The number of ways to do task A followed by task B is the product of the number of ways
to do just A with the number of ways to do just B."
It turns out that this has an (unsurprising) analog in probability. We've been talking about "the probability of
doing both A and B" up above, and I claimed that P(A∩B) can be rewritten using the multiplication principle as
"the probability of doing just A followed by just B." The reason this is important is that doing A can change B,
and of course vice versa. If we're talking about P(C∩E) again, we know that if we first reduce the deck from all 52
cards down to just the face cards (that's what C stood for, remember), then the probability of E (drawing a jack) is
very, very different. Likewise P(C) if E has already occurred (intuitively, what's the probability of holding a face
card if you're already holding a jack?)
But! We just defined notation for this! We know that P(E|C) means "the chances of achieving E given that we've
already done C." So... since doing C and E basically means "doing C first and then E", I propose that we rewrite
things as follows:
P(C∩E) = P(C)P(E|C)
This is identical to the Multiplication Principle - it can be read as, "The probability of doing task A followed by
task B is the product of the probability of doing just A with the probability of doing B after A."
Given that you believe this6, remember our original definition of conditional probability? It looked pretty much
P(A|B) = P(A∩B)/P(B)
But now we know how to rewrite P(A∩B), so we get:
P(A|B) = P(B|A)P(A)/P(B)
Or to make it bigger and clearer and stuff:
P( A | B) 
P( B | A) P( A)
P( B)
This is one of several ways to derive Bayes' Rule or Bayes' Theorem, and it's the foundation of A) pretty much all
probabilistic algorithm design and B) a good chunk of all machine learning. There are lots of ways to prove it
even more rigorously, none of which we'll do... but it's something that should be floating around the back of your
head, informing your interpretation and analysis of data. It's useful; anything that can help a computer learn how
to do anything right is important.
Congratulations - you've completed pretty much the first month of any respectable probability and statistics
course. In a few hours. Consider this an accomplishment - because, well, it is!
Weakest. Joke. Ever.