Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
20.560: Bayesians statistics & machine learning Scott Olesen 15 January 2016 Part I Bayesian statistics 1 What does “Bayesian” mean? It’s in opposition to “frequentist”. In frequentist statistics, probabilities are strictly understood as frequencies. For a frequentist to say “an outcome has a 50% probability” means that, if that test were repeated very mean times, the outcome would occur 50% of the times. “Bayesian” means many things, but an essential idea is that probabilities are understood as statements of ignorance, expectation, and prediction, the things we would call “probability” in everyday speech and thought. For a Bayesian to say “an outcome has 50% probability” means that, based on everything she knows, she believes that that outcome is as likely as any other outcome. 2 2.1 The easy: Bayes rule where everyone agrees Where Bayes rule comes from Bayes rule (or Bayes’ or Bayes’s) is an easy-to-derive mathematical fact that can get used in really different way. Bayes rule relates the probability of some state-of-nature given some data to a model that produces the data. For example, imagine a pregnancy test. (I chose this example because actual pregnancy rates are easy to manipulate.) The test is 99.9% accurate (i.e., 1 in 1,000 pregnant women test negative and 1 in 1,000 non-pregnant women test positive). It’s easy to say, given that a woman is pregnant (or not), what’s the probability that she’ll test positive (or not)? 999 1000 This isn’t the interesting question, however. We want to know the probability that someone is pregnant given that they tested positive. P (tests positive|pregnant) = P (pregnant|tests positive) =? 1 Are they the same? Imagine some scenarios. • A 30-year-old woman tests positive. What is the probability she’s pregnant? 99.9% seems high but not absurd. (But is actually too high, spoiler alert.) • A 14-year-old girl (randomly selected from somewhere in the US) tests positive. What is the probaiblity she is pregnant? Here 99.9% feels high. It seems more likely that this is a false positive. • An 85-year-old woman tests positive. What is the probability she is pregnant? Here 99.9% is definitely wrong: we are certain that this is a false positive. These cases show that, when predicting the probability that someone is pregnant, it’s important to integrate your prior knowledge about the probability that that person is pregnant. Bayes rule is the mathematical tool for incorporating this knowledge. 2.2 Derivation of the rule To do a derivation, I’ll use the standard notation used in probability textbooks and wikipedia: θ is a state of nature (or “parameter”) and X is the data. In our example, θ is “person is pregnant” and X is the test result. As we said before, thing we know about the test is the probability that a person tests positive given that she is pregnant, i.e., P (X|θ), but we want to know the probability that she is pregnant given that she tested positive, i.e., P (θ|X). The examples we showed before, with the different ages of women, were made to emphasize that P (X|θ) 6= P (θ|X). The reason we believe that, using the same pregnancy test, P (θ|X) is much larger for an 30-year-old women than for an 85-year-old woman is that we know that, in general, pregnant women are much more common among 30-year-olds than among 85-year-olds. In other words, our belief that this woman is pregnant after we have the test results, i.e., P (θ|X), is somehow related to our belief that this woman is pregnant before we get the test results, i.e., P (θ). In fact, P (θ) is called the prior probability (or just “prior”) and P (θ|X) is called the posterior probability (or just “posterior”). Bayes’ rules relates these quantities. To derive Bayes’ rule, you need only one probability theory definition: the probability of θ given X is the probability of θ and X divided by the probability of X. Mathematically, P (θ|X) ≡ P (θ ∩ X)/P (X) 2 From here it’s easy to see that P (θ|X) = P (θ ∩ X)/P (X) P (θ ∩ X) P (θ) × P (X) P (θ) P (X ∩ θ) P (θ) = × P (θ) P (X) P (X|θ)P (θ) P (θ|X) = . P (X) = That’s Bayes rule! It’s written with a proportional sign, P (θ|X) ∝ P (θ) × P (X|θ) and pronounced “posterior is prior times likelihood ”. In probability theory, unlike in normal speech, “likelihood” is not a synonym for “probability”. Instead, likelihood is the probability of the data given some state of nature, i.e., P (X|θ). You might say, “Why drop the denominator? You need the denominator for the actual number; I don’t just want something that’s proportional to the number!” Well, we dropped the denominator P (X) because it’s difficult to calculate directly, and often people are interested in the ratios of posterior probabilities. 2.3 Using Bayes rule for our problem The CDC reports that about 1 in every 100 women aged 15-44 is pregnant. Let’s say 1 in 10 pregnant women doesn’t know she’s pregnant, so there is a 1 in 1000 chance that a random woman aged 15-44 doesn’t know she’s pregnant but actually is pregnant. We can use Bayes rule to compute the probability that a 30-year-old is pregnant given that she tested positive: P (is pregnant|tested positive) = P (preg) × P (pos|preg) . P (pos) Note that the two things in the numerator are the things we already know: the 1 999 prior P (preg) = 1000 and the likelihood P (pos|preg) = 1000 . As we mentioned, it’s difficult to compute the denominator, so let’s take the ratio of probabilities: P (preg|pos) P (preg)P (pos|preg) = = P (not preg|pos) P (not preg)P (pos|not preg) 1 1000 999 1000 × × 999 1000 1 1000 = 1, that is, if she tested positive, there’s a 50/50 chance she’s actually pregnant. This is much smaller than the 99.9% you might have naively assumed by equating P (preg|pos) with P (pos|preg). Now recompute this ratio for 14-year-old girls (say, 1 in 10,000 is pregnant) 999 and for 85-year-old women (say, none are pregnant). (Answers: 9999 ≈ 0.1, 0.0) 3 2.4 Interpretations of these results A frequentist interpretation says that this is just the appropriate way to apply conditional probabilities. The quantity P (preg|pos) still means frequency: of all people from such-and-such a population who have such-and-such test results, what fraction are actually pregnant? The Bayesian interpretation is more subtle but can feel more natural. The quantity P (preg|pos) means our belief in the probability that this particular person is pregnant. In a frequentist approach, the probability that this particular person is pregnant is either 0 or 1: either she is or she isn’t. It doesn’t make sense to ask about frequencies of individual cases. The Bayesian approach gives different names to the quantities. P (preg) is called the prior probability that this person is pregnant. The idea is that we use the data to “update” the prior to make the posterior probability P (preg|pos). The link between prior and posterior is the likelihood of the data P (pos|preg). In normal speech, “likelihood” and “probability” mean the same thing. In statistics, “likelihood” always means that probability of the data given some state of nature, i.e., P (X|θ). 3 How Bayesian and frequentist approaches differ Bayesian statistics is a whole set of mathematical and theoretical approaches to Bayes-rule-type problems. Here’s one example of how the two approaches differ. 3.1 How do we know the prior? Recall the 99.9%-accurate pregnancy test. We said that, among 30-year-old women, 1 in 1,000 is unknowingly pregnant. We used this exact proportion 1 1000 to compute our results. Where did that 1 in 1000 come from? What if I told you I followed a random selection of 1,000 thirty-year-old women in a prospective clinical trial and 1 of them turned out to be pregnant? Would you feel differently if I told you I had followed 1 million women and 1,000 of them were pregnant? In a frequentist interpretation, there is some true rate of pregnancy, and we use the best guess we have (technically, the maximum likelihood estimator) for that rate. Whether I ultrasounded 1,000 women or 1 million, the frequentist computation would continue the same way. The frequentist approach is called a point estimation, and it’s the kind of thing that we learn, as scientists and engineers, to be skeptical about: when a UROP says that the PCR product is 200 base pairs, your first question is “Plus-minus what?” Oddly, the industrystandard frequentist-statistical approaches do exactly this kind of no-error thing with your data. In a Bayesian approach, however, we can account for our uncertainty about the pregnancy rate. Unlike the frequentist view, we can treat the pregnancy 4 rate itself as a random variable, and we are allowed to have uncertainty about it. The mathematics of the Bayesian approach aren’t difficult, but they are beyond the scope of this lecture. Nature Biotechnology has a Primer from 2004 called “What is Bayesian statistics?” that introduces this kind of mathematics. 3.2 What about when you don’t know the prior? The criticisms of the Bayesian approach mostly come down to the use of priors. The example with pregnancy rates is mostly above reproach, since we have a very well-supported prior and we used the frequentist-approach of making P (θ) just a point estimate. Imagine, instead, that you’re doing an experiment. You have a hypothesis about a biological system, saying that the state of nature θ is true (e.g., θ is the event where the expression of your favorite gene is elevated under your favorite condition). You take some data X. The chance that your hypothesis is true, i.e., P (θ|X), has to do with your prior on θ. How do you get a prior P (θ)? Unlike the pregnancy example where we could appeal to some gold standard measurement, here your measurements X are the best knowledge you have. In short, the Bayesian approach requires you to guess. You cobble together what you can from the literature, from similar biological systems, whatever, and then you guess. You could say cynically that, since someone can pick whatever prior they want, you can produce whatever probabilities you want. This is technically true: selecting strong, weird priors can give you basically any probability you want. The Bayesian responses to this criticism take a few forms. • Every statistical analysis involves argumentation and arbitrariness. Frequentist analysis requires picking an analytical method, discarding outliers, and all sorts of other subjective decisions. A Bayesian prior is one of those subjective decisions, it’s just very up-front and in-your-face. It’s therefore harder to hide. • In the case of a decision analysis like our pregnancy test, the important thing is whether this person is actually pregnant or not. In this case, it’s foolish to discard what we already know. There is existing information about pregnancy rates, and we know for sure that the rate isn’t 0% and isn’t 100%. When a plane crashes in the ocean, the real-life search pattern used to search for it is Bayesian, since we have some ideas about where it probably is. The Bayesian approach makes a lot of sense, so what does that say about the frequentist approach? We’ll do one example in the next section. 5 4 What clarity Bayesian stats gives us regarding frequentist stats 4.1 p-values Frequentist statistics use a p-value, the probability that these data would have resulted from the null hypothesis (i.e., the likelihood of the data). A common and catastrophic misunderstanding of the p-value is that it is equal to the probability that the null hypothesis is true given the data (i.e., the posterior probability). In fact, the two things can be quite different. To illustrate, let’s examine the ratio of the posterior probabilities of the null hypothesis H0 and alternative hypothesis HA : P (HA |X) P (HA ) P (X|HA ) prior on alt. hypo. P (X|HA ) = × = × . P (H0 |X) P (H0 ) P (X|H0 ) 1 − prior on alt. hypo. pvalue As we expected, as the p-value goes down, the posterior probability on the alternative hypothesis goes up. (Presumably also, the weird term P (X|HA ) will go up as the p-value goes down, and vice versa.) Importantly, the posterior probability on the altnernative hypothesis also depends on our prior belief that the alternative hypothesis was true. This fits with the way we as humans (and scientists) reason: if you make an easy to believe claim (e.g., it’s snowing in winter), I only need a little data to believe it (e.g., someone says it’s snowing). If you make an outrageous claim (e.g., it’s snowing in July), I need a lot of data to believe it. We intuitively use this kind of logic when reading papers: even if someone computes a 10−100 p-value, we believe the assertion or not based on other information we have. A feature of Bayesian statistics is that it allows you to make the prior probability of your hypothesis explicit. You can use that value to make a reasoned statement about the probability that your hypothesis is true. Again, a p-value is not the probability that the null hypothesis is true: it’s a likelihood of the data given the null hypothesis. 4.2 Confidence intervals If you thought the p-value thing was scary, brace yourself for confidence intervals. In frequentist analysis, a 95% confidence interval is a method for computing a range of values which, in 95% of many cases, would include the true value. Say you are measuring some variable that you think is distributed according to some distribution, and you want to know, say, the mean. You look up how to compute confidence intervals for the mean for that distrubtion. If you did the experiment 100 times, you could expect that the confidence interval would contain the true mean about 95 times. Critically, the confidence interval is not the thing that feels natural. For me, the natural thing would be some interval that, with 95% probability, includes 6 the true mean for that one experiment. This, however, is not a frequentist measure, since it is essentially a posterior probability. In Bayesian statistics, credible interval is the name for exactly the thing I want. If θ is the true mean, then y1 and y2 are the limits of a 95% credible interval if ˆ y2 0.95 = P (θ = Y |X) dY. y1 As an aside: I said “a” credible interval because there are usually many (y1 , y2 ) that make this equation true. Depending on your purposes, you might want, say, the narrowest credible interval (i.e., where y2 − y1 is minimized) or a symmetric credible interval, where the probability of landing below y1 is the same as the probability of landing above y2 . 5 When should I use Bayesian statistics? Maybe when: • It’s important to have a clear interpretation of your results (i.e., none of the weird frequentist mumbo-jumbo definitions of p-values and confidence intervals). • You are able to make a statement about your prior belief that something is true. • You want to update your interpretation of later data using earlier data. • You want to train a model on some data and use it to predict something about other data. • You have some parameter that you can’t measure and you’re worried about pulling a single point estimate out of thin air. • You know there is uncertainty in some measure and your frequentist approach will only let you put in a single point estimate. • You want to make a prediction or decide an action based on prior knowledge. Part II Machine learning 6 What does “machine learning” mean? Machine learning refers to many different kinds of methods. It’s an entire area of research in computer science and statistics, and it has growing applications 7 to the biological sciences. Because there are so many methods, I’ll just do a quick overview of what kinds of things machine learning can do. 7 Taxonomies of machine learning There is a little jargon that will clarify discussions about machine learning. The data comes in the form of samples or examples, which are usually individual sampling or measurement events. Every sample consists of some number of features. For example, one biological replicate would be a sample, and RNAseq will give you the abundances of many transcripts; the abundances of the transcripts are the features. Some samples might be training examples used to train the machines. Classification. Say I want to predict some discrete output variable (or lable or class) from some input samples. For example, can I predict which patients will get sick (or are already sick) based on their microbiome data? Regression. Say I want to predict some continuous output variable from some input data. For example, can I predict cells’ division time based on its gene expression data? Clustering. Say I just want to know which samples are more similar to one another. Do they form into natural groups? Dimensionality reduction. Say I have samples with many features and I just want to make my data more “compact” by removing some dimensions or features. For example, is it important to know the abundance of all the transcripts, or can we focus on a smaller number of very informative features? 8 Machine learning techniques to be familiar with • Wiki’s “List of machine learning concepts” • Sci-kit Learn’s Flowchart 8