Download day-4-bayesian-machine-notes

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
20.560: Bayesians statistics & machine learning
Scott Olesen
15 January 2016
Part I
Bayesian statistics
1
What does “Bayesian” mean?
It’s in opposition to “frequentist”. In frequentist statistics, probabilities are
strictly understood as frequencies. For a frequentist to say “an outcome has a
50% probability” means that, if that test were repeated very mean times, the
outcome would occur 50% of the times.
“Bayesian” means many things, but an essential idea is that probabilities are
understood as statements of ignorance, expectation, and prediction, the things
we would call “probability” in everyday speech and thought. For a Bayesian
to say “an outcome has 50% probability” means that, based on everything she
knows, she believes that that outcome is as likely as any other outcome.
2
2.1
The easy: Bayes rule where everyone agrees
Where Bayes rule comes from
Bayes rule (or Bayes’ or Bayes’s) is an easy-to-derive mathematical fact that
can get used in really different way. Bayes rule relates the probability of some
state-of-nature given some data to a model that produces the data.
For example, imagine a pregnancy test. (I chose this example because actual
pregnancy rates are easy to manipulate.) The test is 99.9% accurate (i.e., 1 in
1,000 pregnant women test negative and 1 in 1,000 non-pregnant women test
positive). It’s easy to say, given that a woman is pregnant (or not), what’s the
probability that she’ll test positive (or not)?
999
1000
This isn’t the interesting question, however. We want to know the probability
that someone is pregnant given that they tested positive.
P (tests positive|pregnant) =
P (pregnant|tests positive) =?
1
Are they the same? Imagine some scenarios.
• A 30-year-old woman tests positive. What is the probability she’s pregnant? 99.9% seems high but not absurd. (But is actually too high, spoiler
alert.)
• A 14-year-old girl (randomly selected from somewhere in the US) tests
positive. What is the probaiblity she is pregnant? Here 99.9% feels high.
It seems more likely that this is a false positive.
• An 85-year-old woman tests positive. What is the probability she is pregnant? Here 99.9% is definitely wrong: we are certain that this is a false
positive.
These cases show that, when predicting the probability that someone is pregnant, it’s important to integrate your prior knowledge about the probability
that that person is pregnant.
Bayes rule is the mathematical tool for incorporating this knowledge.
2.2
Derivation of the rule
To do a derivation, I’ll use the standard notation used in probability textbooks
and wikipedia: θ is a state of nature (or “parameter”) and X is the data. In our
example, θ is “person is pregnant” and X is the test result.
As we said before, thing we know about the test is the probability that a
person tests positive given that she is pregnant, i.e., P (X|θ), but we want to
know the probability that she is pregnant given that she tested positive, i.e.,
P (θ|X). The examples we showed before, with the different ages of women,
were made to emphasize that P (X|θ) 6= P (θ|X).
The reason we believe that, using the same pregnancy test, P (θ|X) is much
larger for an 30-year-old women than for an 85-year-old woman is that we know
that, in general, pregnant women are much more common among 30-year-olds
than among 85-year-olds. In other words, our belief that this woman is pregnant
after we have the test results, i.e., P (θ|X), is somehow related to our belief that
this woman is pregnant before we get the test results, i.e., P (θ). In fact, P (θ)
is called the prior probability (or just “prior”) and P (θ|X) is called the posterior
probability (or just “posterior”). Bayes’ rules relates these quantities.
To derive Bayes’ rule, you need only one probability theory definition: the
probability of θ given X is the probability of θ and X divided by the probability
of X. Mathematically,
P (θ|X) ≡ P (θ ∩ X)/P (X)
2
From here it’s easy to see that
P (θ|X) = P (θ ∩ X)/P (X)
P (θ ∩ X) P (θ)
×
P (X)
P (θ)
P (X ∩ θ)
P (θ)
=
×
P (θ)
P (X)
P (X|θ)P (θ)
P (θ|X) =
.
P (X)
=
That’s Bayes rule! It’s written with a proportional sign,
P (θ|X) ∝ P (θ) × P (X|θ)
and pronounced “posterior is prior times likelihood ”. In probability theory, unlike in normal speech, “likelihood” is not a synonym for “probability”. Instead,
likelihood is the probability of the data given some state of nature, i.e., P (X|θ).
You might say, “Why drop the denominator? You need the denominator
for the actual number; I don’t just want something that’s proportional to the
number!” Well, we dropped the denominator P (X) because it’s difficult to
calculate directly, and often people are interested in the ratios of posterior
probabilities.
2.3
Using Bayes rule for our problem
The CDC reports that about 1 in every 100 women aged 15-44 is pregnant.
Let’s say 1 in 10 pregnant women doesn’t know she’s pregnant, so there is a 1 in
1000 chance that a random woman aged 15-44 doesn’t know she’s pregnant but
actually is pregnant. We can use Bayes rule to compute the probability that a
30-year-old is pregnant given that she tested positive:
P (is pregnant|tested positive) =
P (preg) × P (pos|preg)
.
P (pos)
Note that the two things in the numerator are the things we already know: the
1
999
prior P (preg) = 1000
and the likelihood P (pos|preg) = 1000
.
As we mentioned, it’s difficult to compute the denominator, so let’s take the
ratio of probabilities:
P (preg|pos)
P (preg)P (pos|preg)
=
=
P (not preg|pos)
P (not preg)P (pos|not preg)
1
1000
999
1000
×
×
999
1000
1
1000
= 1,
that is, if she tested positive, there’s a 50/50 chance she’s actually pregnant.
This is much smaller than the 99.9% you might have naively assumed by equating P (preg|pos) with P (pos|preg).
Now recompute this ratio for 14-year-old girls (say, 1 in 10,000 is pregnant)
999
and for 85-year-old women (say, none are pregnant). (Answers: 9999
≈ 0.1, 0.0)
3
2.4
Interpretations of these results
A frequentist interpretation says that this is just the appropriate way to apply
conditional probabilities. The quantity P (preg|pos) still means frequency: of all
people from such-and-such a population who have such-and-such test results,
what fraction are actually pregnant?
The Bayesian interpretation is more subtle but can feel more natural. The
quantity P (preg|pos) means our belief in the probability that this particular
person is pregnant. In a frequentist approach, the probability that this particular person is pregnant is either 0 or 1: either she is or she isn’t. It doesn’t make
sense to ask about frequencies of individual cases.
The Bayesian approach gives different names to the quantities. P (preg) is
called the prior probability that this person is pregnant. The idea is that we use
the data to “update” the prior to make the posterior probability P (preg|pos).
The link between prior and posterior is the likelihood of the data P (pos|preg). In
normal speech, “likelihood” and “probability” mean the same thing. In statistics,
“likelihood” always means that probability of the data given some state of nature,
i.e., P (X|θ).
3
How Bayesian and frequentist approaches differ
Bayesian statistics is a whole set of mathematical and theoretical approaches to
Bayes-rule-type problems. Here’s one example of how the two approaches differ.
3.1
How do we know the prior?
Recall the 99.9%-accurate pregnancy test. We said that, among 30-year-old
women, 1 in 1,000 is unknowingly pregnant. We used this exact proportion
1
1000 to compute our results. Where did that 1 in 1000 come from? What if
I told you I followed a random selection of 1,000 thirty-year-old women in a
prospective clinical trial and 1 of them turned out to be pregnant? Would you
feel differently if I told you I had followed 1 million women and 1,000 of them
were pregnant?
In a frequentist interpretation, there is some true rate of pregnancy, and we
use the best guess we have (technically, the maximum likelihood estimator) for
that rate. Whether I ultrasounded 1,000 women or 1 million, the frequentist
computation would continue the same way. The frequentist approach is called
a point estimation, and it’s the kind of thing that we learn, as scientists and
engineers, to be skeptical about: when a UROP says that the PCR product is
200 base pairs, your first question is “Plus-minus what?” Oddly, the industrystandard frequentist-statistical approaches do exactly this kind of no-error thing
with your data.
In a Bayesian approach, however, we can account for our uncertainty about
the pregnancy rate. Unlike the frequentist view, we can treat the pregnancy
4
rate itself as a random variable, and we are allowed to have uncertainty about
it. The mathematics of the Bayesian approach aren’t difficult, but they are
beyond the scope of this lecture. Nature Biotechnology has a Primer from 2004
called “What is Bayesian statistics?” that introduces this kind of mathematics.
3.2
What about when you don’t know the prior?
The criticisms of the Bayesian approach mostly come down to the use of priors.
The example with pregnancy rates is mostly above reproach, since we have a
very well-supported prior and we used the frequentist-approach of making P (θ)
just a point estimate.
Imagine, instead, that you’re doing an experiment. You have a hypothesis
about a biological system, saying that the state of nature θ is true (e.g., θ is the
event where the expression of your favorite gene is elevated under your favorite
condition). You take some data X. The chance that your hypothesis is true,
i.e., P (θ|X), has to do with your prior on θ. How do you get a prior P (θ)?
Unlike the pregnancy example where we could appeal to some gold standard
measurement, here your measurements X are the best knowledge you have.
In short, the Bayesian approach requires you to guess. You cobble together
what you can from the literature, from similar biological systems, whatever,
and then you guess. You could say cynically that, since someone can pick
whatever prior they want, you can produce whatever probabilities you want.
This is technically true: selecting strong, weird priors can give you basically
any probability you want. The Bayesian responses to this criticism take a few
forms.
• Every statistical analysis involves argumentation and arbitrariness. Frequentist analysis requires picking an analytical method, discarding outliers, and all sorts of other subjective decisions. A Bayesian prior is one
of those subjective decisions, it’s just very up-front and in-your-face. It’s
therefore harder to hide.
• In the case of a decision analysis like our pregnancy test, the important
thing is whether this person is actually pregnant or not. In this case, it’s
foolish to discard what we already know. There is existing information
about pregnancy rates, and we know for sure that the rate isn’t 0% and
isn’t 100%. When a plane crashes in the ocean, the real-life search pattern
used to search for it is Bayesian, since we have some ideas about where it
probably is.
The Bayesian approach makes a lot of sense, so what does that say about the
frequentist approach? We’ll do one example in the next section.
5
4
What clarity Bayesian stats gives us regarding
frequentist stats
4.1
p-values
Frequentist statistics use a p-value, the probability that these data would have
resulted from the null hypothesis (i.e., the likelihood of the data). A common and catastrophic misunderstanding of the p-value is that it is equal to the
probability that the null hypothesis is true given the data (i.e., the posterior
probability). In fact, the two things can be quite different.
To illustrate, let’s examine the ratio of the posterior probabilities of the null
hypothesis H0 and alternative hypothesis HA :
P (HA |X)
P (HA ) P (X|HA )
prior on alt. hypo.
P (X|HA )
=
×
=
×
.
P (H0 |X)
P (H0 )
P (X|H0 )
1 − prior on alt. hypo.
pvalue
As we expected, as the p-value goes down, the posterior probability on the
alternative hypothesis goes up. (Presumably also, the weird term P (X|HA )
will go up as the p-value goes down, and vice versa.) Importantly, the posterior
probability on the altnernative hypothesis also depends on our prior belief that
the alternative hypothesis was true.
This fits with the way we as humans (and scientists) reason: if you make
an easy to believe claim (e.g., it’s snowing in winter), I only need a little data
to believe it (e.g., someone says it’s snowing). If you make an outrageous claim
(e.g., it’s snowing in July), I need a lot of data to believe it. We intuitively
use this kind of logic when reading papers: even if someone computes a 10−100
p-value, we believe the assertion or not based on other information we have.
A feature of Bayesian statistics is that it allows you to make the prior probability of your hypothesis explicit. You can use that value to make a reasoned
statement about the probability that your hypothesis is true. Again, a p-value
is not the probability that the null hypothesis is true: it’s a likelihood of the
data given the null hypothesis.
4.2
Confidence intervals
If you thought the p-value thing was scary, brace yourself for confidence intervals.
In frequentist analysis, a 95% confidence interval is a method for computing
a range of values which, in 95% of many cases, would include the true value.
Say you are measuring some variable that you think is distributed according
to some distribution, and you want to know, say, the mean. You look up how
to compute confidence intervals for the mean for that distrubtion. If you did
the experiment 100 times, you could expect that the confidence interval would
contain the true mean about 95 times.
Critically, the confidence interval is not the thing that feels natural. For me,
the natural thing would be some interval that, with 95% probability, includes
6
the true mean for that one experiment. This, however, is not a frequentist
measure, since it is essentially a posterior probability. In Bayesian statistics,
credible interval is the name for exactly the thing I want. If θ is the true mean,
then y1 and y2 are the limits of a 95% credible interval if
ˆ y2
0.95 =
P (θ = Y |X) dY.
y1
As an aside: I said “a” credible interval because there are usually many
(y1 , y2 ) that make this equation true. Depending on your purposes, you might
want, say, the narrowest credible interval (i.e., where y2 − y1 is minimized) or
a symmetric credible interval, where the probability of landing below y1 is the
same as the probability of landing above y2 .
5
When should I use Bayesian statistics?
Maybe when:
• It’s important to have a clear interpretation of your results (i.e., none of
the weird frequentist mumbo-jumbo definitions of p-values and confidence
intervals).
• You are able to make a statement about your prior belief that something
is true.
• You want to update your interpretation of later data using earlier data.
• You want to train a model on some data and use it to predict something
about other data.
• You have some parameter that you can’t measure and you’re worried about
pulling a single point estimate out of thin air.
• You know there is uncertainty in some measure and your frequentist approach will only let you put in a single point estimate.
• You want to make a prediction or decide an action based on prior knowledge.
Part II
Machine learning
6
What does “machine learning” mean?
Machine learning refers to many different kinds of methods. It’s an entire area
of research in computer science and statistics, and it has growing applications
7
to the biological sciences. Because there are so many methods, I’ll just do a
quick overview of what kinds of things machine learning can do.
7
Taxonomies of machine learning
There is a little jargon that will clarify discussions about machine learning. The
data comes in the form of samples or examples, which are usually individual
sampling or measurement events. Every sample consists of some number of features. For example, one biological replicate would be a sample, and RNAseq will
give you the abundances of many transcripts; the abundances of the transcripts
are the features. Some samples might be training examples used to train the
machines.
Classification. Say I want to predict some discrete output variable (or lable
or class) from some input samples. For example, can I predict which patients
will get sick (or are already sick) based on their microbiome data?
Regression. Say I want to predict some continuous output variable from some
input data. For example, can I predict cells’ division time based on its gene
expression data?
Clustering. Say I just want to know which samples are more similar to one
another. Do they form into natural groups?
Dimensionality reduction. Say I have samples with many features and I
just want to make my data more “compact” by removing some dimensions or
features. For example, is it important to know the abundance of all the transcripts, or can we focus on a smaller number of very informative features?
8
Machine learning techniques to be familiar with
• Wiki’s “List of machine learning concepts”
• Sci-kit Learn’s Flowchart
8