Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Notes on Bayesian Statistics Coursera class Created 10/07/16 Updated 10/09/16, Updated 10/12/16, Updated 10/25/16, Updated 11/13/16 Introduction https://www.coursera.org/learn/bayesian/home/welcome This course describes Bayesian statistics, in which one's inferences about parameters or hypotheses are updated as evidence accumulates. You will learn to use Bayes’ rule to transform prior probabilities into posterior probabilities, and be introduced to the underlying theory and perspective of the Bayesian paradigm. The course will apply Bayesian methods to several practical problems, to show end-to-end Bayesian analyses that move from framing the question to building models to eliciting prior probabilities to implementing in R (free statistical software) the final posterior distribution. Additionally, the course will introduce credible regions, Bayesian comparisons of means and proportions, Bayesian regression and inference using multiple models, and discussion of Bayesian prediction. Course Mechanics Welcome to Bayesian Statistics! You’re joining thousands of learners currently enrolled in the course. I'm excited to have you in the class and look forward to your contributions to the learning community. To begin, I recommend taking a few minutes to explore the course site. Review the material we’ll cover each week, and preview the assignments you’ll need to complete to pass the course. Click Discussions to see forums where you can discuss the course material with fellow students taking the class. Be sure to introduce yourself to everyone in the Meet and Greet forum. If you have questions about course content, please post them in the forums to get help from others in the course community. For technical problems with the Coursera platform, visit the Learner Help Center. Do I need to use "R"? Yes. You will need R and RStudio to do the labs and the project. Both are free and publicly available. You will need administrator access to your computer to install this software. There are step-by-step video instructions and additional help under "Resources." How will I be graded? There are four graded quizzes (weeks 1-4), four lab exercises (weeks 1-4), and a peer-reviewed assignment (week 5) -- the data analysis project in this course. You will need to pass each graded assignment in order to pass the whole course. After passing the course, your final grade for the course will be: 10% for each quiz + 10% for each lab + 20% for the data analysis project. Book Summary Gelman, Andrew with Donald B. Rubin (2013) Bayesian Data Analysis, Third Edition (Chapman & Hall/CRC Texts in Statistical Science). Bought this on Amazon.com. Andrew Gelman is Professor of Statistics and Professor of Political Science at Columbia University. He has published over 150 articles in statistical theory, methods, and computation, and in applications areas including decision analysis, survey sampling, political science, public health, and policy. Donald B. Rubin is John L. Loeb Professor of Statistics at Harvard University. Hoff, P. D. (2009) A First Course in Bayesian Statistical Methods. Springer Verlag ISBN: 978-0-387-92299-7 (Print) 978-0-387-92407-6 (Online) E-Book available - check your library for availability. Albert, J. (2009) Bayesian Computation with R. Springer Verlag ISBN: 978-0-387-92297-3 (Print) 978-0-38792298-0 (Online) [search for pdf on ResearchGate] Page 1 Week 1 – The Basics of Bayesian Statistics This started 10/3/16, and finished 10/11/16 Introduction to Statistics with R An introduction to the faculty members from Duke University who are presenting the material. This class is part of the “Statistics with R” specialization at Coursera. Bayes Rule Conditional probabilities and Bayes Rule (2 min) First real lecture about the topic. Provided a historical context, and info about Thomas Bayes. Bayes Rule and Diagnostic Testing (6 min) Contained examples of testing for HIV and RU-486. Bayes Updating (2 min) This was useful because it discussed how to make updates. Bayesian vs. Frequentist definitions of Probability (4 min) This clarified that a Bayesian always starts from a prior. Also, it contained definitions of probability, such as N out of M occurrences, vs. belief in an event happening. Inference for a Proposition Approach: Frequentist (3 min) RU-486 analysis. Here we started with counts of occurrences, and used that derive H0 vs. H1 models. Approach: Bayesian (7 min) Here we recast the problem in terms of models that partitioned the event space, then we assigned priors, and then ran updates. This was a great example of Bayes. Effect of Sample Size (2 min) Here we saw that the additional information increased the accuracy (or differentiation) of the Bayes estimates on the different models. Frequentist vs. Bayes Inference This was a good summary of the above lessons, pointing out the strengths of Bayes in that you can formulate models better, and you get a measure of belief in the result. Strengthen your Understanding Thee were two quizzes – a pretest with 6 questions that are available for free, and then 10 questions in the real quiz, but you have to have a paid subscription for this. Learning R This was a good session. Installed R and RStudio, ran the example. The example was brought in from the repo https://github.com/StatsWithR/statsr which seems to be for the whole specialization. We are only dealing with the Bayes portion. It appears that StatsWithR provides a large set of useful stat functions which can be loaded into R/Studio. There was a good video on RStudio at https://www.youtube.com/watch?v=32o0DnuRjfg. It is by David Langer. In this case, we are loading “devtools”, then “dplyr” (utilities), then “ggplot2” (plotting), then “shiny” (web applications). Page 2 You are given a problem as a deck that can be “knitted” so that you can run it. Also you can call functions that are part of the lesson, to get answers for the quiz. Summary Quite useful introduction. Assumes little prior experience except for knowledge of H0/H1 hypothesis testing, and about binomial distributions. It was also helpful to get RStudio installed and read background info. For next week, we will move to continuous distributions, since all of these were discrete. Week 2 – Bayesian Inference Started this late on 10/12/16, and this week officially ended at 10/16/16. In this week, we will discuss the continuous version of Bayes' rule and show you how to use it in a conjugate family, and discuss credible intervals. By the end of this week, you will be able to understand and define the concepts of prior, likelihood, and posterior probability and identify how they relate to one another, in terms of actual statistical distributions and analysis. Introduction This week our learning goals are to show the continuous version of Bayes' rule and teach you how to use it in a conjugate family. The RU-486 example will allow us to discuss Bayesian modeling in a concrete way. It also leads naturally to a Bayesian analysis without conjugacy. For the non-conjugate case, there's usually no simple mathematical expression, and one must resort to computation. Finally, we shall discuss credible intervals, the Bayesian analog of frequentist confidence intervals, and Bayesian estimation and prediction. To proceed during this week, you will need to have mastered the concept of conditional probability and Bayes' rule for discrete random variables. You do not need to know calculus to complete this week's materials. However, for those who do, we shall briefly look at an integral. Continuous Variables and Eliciting Probability Distributions From the discrete to the continuous This was introduction to PDF and similar concepts of continuous random variable. Basically right out of a textbook. Elicitation Where to the priors come from? Bayesians express their priors in terms of personal probabilities. These must obey laws of probability, and take into account everything that the Bayesian knows. The second part of the lecture is an introduction to the beta distribution, which is specified in terms of alpha and beta. When these are both 1, we get a uniform distribution. When the values are larger, the distribution is more focused. Can also be skewed. beta(5,1) is skewed high, and beta(1,5) is skewed low. Conjugacy This defined as having a Bayesian update approach that produces a distribution of the same family, but with different parameters. For instance from one beta distribution to another. For instance from beta(a,b) to beta (a+x, b+n-x) Three Conjugate Families Inference on a Binomial proposition This was about updating the distributions for new information. Here we apply the above step that adjusts a beta distribution to a real-world example, which was from the RU-486 study. In this case there were 4 failures out of 20, Page 3 vs. 0 failures out of 20. In frequentist analysis, this would be 4/20, but in Bayesian analysis we apply each datapoint in turn. The Gamma-Poisson Conjugate Families Watch this again. Prussian horse kick analysis is in this video. The data comes from a Poisson distribution, and the prior and posterior are both gamma distributions. Basic definition of Poisson. Number of cases, single parameter, lambda, which is both the mean and variance. The Normal-Normal Conjugate families In this case we start with a normal distribution, in which the mean is unknown but also normal. Analytical chemist. Credible Intervals and Predictive Inference Non-conjugate Priors This should really be part of the prior section. For instance, he is talking about distributions are that not simple or mathematical, but are a set of discrete values. For this, we need a computer. He discusses JAGS. Apparently this is a kind of samping approach. However, this lecture is “mostly a look ahead to future material”. Rather short. Credible Intervals This had some very important definitions about what confidence intervals mean in frequentist statistics, and how Bayesian can generate what we call credible intervals. Predictive Inference Often we are not interested in determine which of two models are more likely, but what to expect about next events. In this example, we have a coin which may be p=0.7 or may be p=0.4. We have priors, then we create posteriors, then we create a prediction. Simple values can be handled without integration. Strengthening your Understanding 1) Which of the following statements is true? The likelihood is a mixture between the prior and posterior. No, the likelihood is completely independent of the prior. The likelihood is the probability of observing the data given the model parameters, whereas the prior represents a subjective distribution imposed on those model parameters prior to observing the data. The prior is a mixture between the likelihood and posterior. No, the prior is completely independent of the likelihood. The posterior is a mixture between the prior and likelihood. Yes, the posterior is proportional to the likelihood times the prior, which means that it is influenced by both of them. 2) Which of the following distributions would be a good choice of prior to use if you wanted to determine if a coin is fair when you have a strong belief that the coin is fair? (Assume a model where we call heads a success and tails a failure). Solution: Since a Beta(a, b) distribution corresponds to observing data with prior mean a/(a + b) and prior sample size a + b, we seek an answer with a/a + b = 0.5 and a + b large. Of the answer choices given, a Beta(50, 50) distribution corresponds to prior mean 0.5 with a larger sample size than any other answer. 3) If Amy is trying to make inferences about the average number of customers that visit Macy’s between noon and 1 p.m., which of the following distributions represents a conjugate prior? Solution: The number of customers that visit Macy’s between noon and 1 p.m. can only take on integer values and has no clearly defined upper bound. As such, a Poisson distribution with parameter λ would be a good model for customer arrivals. The corresponding conjugate prior for λ, the average number of customers that visit Macy’s in the time period, is a Gamma distribution, which we learned from the lectures. 4) Suppose that you sample 24 M&Ms from a bag and find that 3 of them are yellow. Assuming that you place a uniform Beta(1,1) prior on the proportion of yellow M&Ms p, what is the posterior probability that p ¡ 0.2? Solution: The Beta distribution is conjugate to the Binomial distribution, the update rule is that the posterior Beta will have parameter values α + x, β + n − x where α and β are the parameters of the prior 1 beta. In this case, x = 3 and n = 24. Hence, the posterior distribution of p is a Beta(4,22) distribution. To find the Page 4 posterior probability that p < 0.2, we use the following command in R: pbeta(q = 0.2, shape1 = 4, shape2 = 22) This gives a posterior probability of approximately 0.766. 5) Suppose you are given a coin and told that the coin is either biased towards heads (p = 0.6) or biased towards tails (p = 0.4). Since you have no prior knowledge about the bias of the coin, you place a prior probability of 0.5 on the outcome that the coin is biased towards heads. You flip the coin twice and it comes up tails both times. What is the posterior probability that your next two flips will be heads? Solution: To solve this problem, we first need to find the posterior probability that the coin is biased towards heads. Using Bayes’ rule, we find that P(p = 0.6|{T, T}) = (0.4 2 )(0.5)/((0.4 2)(0.5) + (0.6 2)(0.5() = 4/13 To find the posterior probability that the next two flips will be heads, we note that P({T, T, H, H}|{T, T}) = P({T, T, H, H}|p = 0.6, {T, T})P(p = 0.6|{T, T}) + P({T, T, H, H}|p = 0.4, {T, T})P(p = 0.4|{T, T}) = (0.6)2 (4/13) + (0.4)2 (9/13) = 72/325 ≈ 0.222 Learning R More downloads, and another lab. This one is about credible intervals. Summary of Week This was a very interesting discussion of distributions and Bayesian updating. The credible interval discussion was very unique, and the lab covered it well. Week 3 This began on 10/15/16, I continued to study it through about 10/26/16. Losses and decision-making Losses and Decision-making (3 mins) This begins with basic definitions. But the posterior distribution on its own is not always sufficient. Sometimes one wants to express one's inference is a credible interval to indicate a range of likely values for the parameter. What is the one-number estimate? Linked to risk analysis, which is built around minimizing expected loss. The context is what would you summarize from the Bayes posterior (for instance what would you tell a patient?). One approach might be to use the median of the distribution. Another would be use to mode of the distribution. Working with Loss Functions (6 mins) Here we are defining loss functions and giving an example. Car sales example. Example provides a distribution from analysis. Here is a dot plot. What single number do you report? Try 30. Loss function of L0 means that you get 1 right and 50 wrong, hence the loss is 50. You want to minimize the loss. Hence report the mode. Let consider a linear loss, which is L1. Hence report the median. L2 is the squared difference. Report the mean. Minimizing Expected Loss for Hypothesis Testing (5 mins) One straightforward way of choosing between H1 and H2 would be to choose the one with the higher posterior probability. In other words, you reject H1 if the posterior probability of H1 is smaller than the posterior probability of H2. However, since hypothesis testing is a decision problem, we should also consider a loss function. She defines loss functions in medical false positive and false negative. Posterior probabilities of Hypotheses and Bayes factors (6 mins) The prior odds is defined as the ratio of the prior probabilities assigned to the hypotheses or models we're considering. In other words, the posterior odds is the product of the Bayes factor and the prior odds for these two hypotheses. The Bayes factor quantifies the evidence of data arising from hypothesis one versus hypothesis two. Then a discussion about scales for interpreting the Bayes factor. To recap, in this video we defined prior odds, posterior odds, and Bayes factors. We learned about scales by which we can interpret these values for model selection. We also re-emphasize that in Bayesian testing, the order in which we evaluate the models of hypotheses does not matter. Since the Bayes factor of hypothesis two versus hypothesis one is simply the reciprocal of the Bayes factor for hypothesis one versus hypothesis two. Page 5 Comparing Two Means [this was a really through discussion of how to compare two means that were found from a survey. What it does it state the assumptions, state the hypotheses, and state the means of analysis. It seems like a frequentist approach might be easier, but doesn’t produce as much information.] Comparing two Proportions using Bayes factors: assumptions (8 mins) In summary we reviewed assumptions for basing an inference for comparing two proportions. We presented a default prior distribution for Bayesian null hypothesis testing, using conjugate prior distributions. We are working with Survey USA poll about bullying. H1 p(males) = p(females) and H2: p(males) != p(females). We introduced pooling of beliefs to construct a prior distribution for the common proportion p under H2. Other assumptions included independence. We reviewed finding the posterior distribution of parameters of interest under the two hypotheses. At this point, you should be able to find point estimates of the probabilities as well as credible intervals. Comparing two Proportions using Bayes factors (4 mins) In this video, we will continue our comparison of two proportions using Bayesian inference. Recall in our last video, we used a Survey USA poll to illustrate inference about two proportions, using conjugate beta prior distributions. We used independent beta prior distributions for the probabilities that male and female parents would report bullying under hypothesis two. Introduced the idea of pooling. Here are the posterior distributions for male, female, and pooled. Now find post prob of H1. Comparing two Paired Means using Bayes factors (7 mins) In this case, we are considering how to analyze and report on two mean values, which are amount of zinc in water samples at the surface and at the bottom. Like the bullying analysis, this was introduced in the inferential stat course. The hypotheses are H1 same, and H2 different Bayesian pair analysis assumes independence of samples in the data. Student’s t-distribution gets mentioned here. What to report? (7 mins) Here we are seeking to create a reportable analysis of the prior video. In these samples, the Bayes factor for H1 : H2 is 0.015 (not at all likely), while H2 : H1 is 64 (very likely) Our results are based on the default prior distribution. In our case we found that there was a high probability that the bottom concentrations exceed the surface concentrations. The credible interval provides a range of the most plausible values of the mean difference. While the suggested statistical significance subject matter expertise is needed in order to say whether the differences in or absolute levels are of a magnitude meaningful for health consideration. Comparing Two Independent Means Posterior probability, p-values and paradoxes (7 mins) Lindley’s paradox. Also introduce the Cauchy distribution of a potential prior. Unusual case of ESP analysis here. The claim was that a person could influence the probabilities of machinegenerated zeros and ones. Out of the trials there were approximately 52 million ones, resulting in a sample mean of 0.500177, which is pretty close to 0.5. The null hypothesis asserted that the results were generated by a machine operating with a chance of 0.5, whereas the alternative was that the unspecified hypothesis H2 that mu was not equal to one-half. Discussion about the Cauchy distribution. Subjective information on the scale of the distribution can also help refine the prior distribution by reducing the prior support on values that are not a priori plausible. While hypotheses testing does have its place, think carefully about whether testing a point null, such as mu equals say mu naught, or mu equals one-half, make sense. Reporting credible intervals and considering practical significance in terms of effect size may be more meaningful. In the case of the ESP analysis, it is more likely that threat there are biases in the random umber generator. Comparing two independent means (7 mins) Page 6 Why is this different from above? Above we assumed that the bottom mean and the surface mean were not independent. Here we have set up a study that is. We are using the “distracted eaters” does playing a computer game during lunch affect fullness, memory for lunch, and later snack intake? Distracted was about 52 grams, while non-distracted was about 27 grams. What is the difference between these? Unfortunately, there's no closed form expression for the distribution of the difference of two student t distributions. However, we can use simulation to draw samples from posterior distribution using what is called Monte Carlos sampling. From Monte Carlo samples, we simulate possible values of the parameters from their posterior distributions. In this case, first we generate a large number of values from the student t distribution for the mean for Group A. Second, we generate an equivalent from the student t distribution for the mean of Group B. Then we calculate Monte Carlo averages. Comparing two independent means: hypothesis testing (7 mins) In this case we are seeking to prepare a result for the above analysis. Set up the hypotheses H1 mu1 = mu2 and H2: mu1 != mu2. The null hypothesis here means that nothing is going on. Use an intrinsic prior to specify that the prior distribution for the three parameters in the null model, which include the common mean mu and the two variances, and for the four parameters under the alternative hypothesis. There are two means and two variances in the two groups under H1. At this point, we are again using Monte Carlo approach, and she refers to the JAG implementation of this. In this case, no hypothesis is more plausible, and more data are needed. Strengthening your Understanding Practice Quiz with 4 questions. I had trouble with the following: True or False: If the posterior distribution is normally distributed, the estimate that minimizes posterior expected loss is the same, regardless of whether the loss function is 0/1, linear, or quadratic. Correct Response True Which of the following statements is false? A Bayes factor of less than .01 suggests that the evidence in favor of one of the hypotheses is barely worth mentioning. Correct Response Correct Answer. A Bayes factor of less than .01 will yield strong evidence in favor of one of the hypotheses. Learning R More downloads, and another lab. This one is about decision-making. Summary of Week The concept of Bayes factors was something that I had never heard of before. However, it provides some metrics for indicating the strength of a belief about a decision, with a level of domain and unit independence. The lectures were quite detailed, and covered several ways to make conclusions. I need to read up on the Cauchy distribution: It is the distribution of the ratio of two independent normally distributed Gaussian random variables. The Cauchy distribution is often used in statistics as the canonical example of a "pathological" distribution since both its mean and its variance are undefined. It is one of the few distributions that is stable and has a probability density function that can be expressed analytically, the others being the normal distribution and the Lévy distribution. Week 4 This began on 10/22/16, and I had to study this over a several-week period because of London travel. Page 7 Bayesian Regression In the previous module we introduced Bayesian decision making using a variety of loss functions. And we discussed how to minimize expected loss for hypothesis testing. We also introduced the concept of base factors and gave some examples of how they can be used in Bayesian hypothesis testing for two proportions and two means. We wrapped up the unit with the discussion on how findings from credible intervals compare to those from a hypothesis test and when to reject, accept or wait. >> In this new module, we'll introduce Bayesian inference and multiple regression models. And show how Bayes factors and posterior probabilities can be used for variable selection. We will introduce the concept of Bayesian model averaging. An alternative to variable selection that will allow you to make inferences and predictions using an ensemble of models. Finally we demonstrate modern simulation methods to search for regression models of high posterior probability. When a numeration of all possible subsets is not feasible. Simple and Multiple Bayesian Regression Bayesian simple linear regression (8 mins) Seems like we are using Bayes to understand the uncertainty in the Ordinary Least Squares results. Checking for outliers (4 mins) In this case we are using Bayes to determine if an outlier should be ignored. Bayesian multiple regression (4 mins) Again we take a standard multiple regression model, and use Bayes to understand the probabilities that the coefficients have the values we are estimating. Bayesian Model Uncertainty and Model Averaging Model selection criteria (5 mins) This lecture defines BIC, or Bayesian Information Criteria. Bayesian model uncertainty (7 mins) This begins to look like the approaches used at ADS. Each model is a scenario, and each model has a weight or probability. Bayesian model averaging (7 mins) Watch this Markov Chain Monte Carlo Statistic exploration (4 mins) Watch this Priors for Bayesian model uncertainty (8 mins) Watch this R Demo: crime and punishment (9 mins) Watch this Decisions under model uncertainty (7 mins) Watch this Strengthening your Understanding Practice Quiz with 10 questions. I had trouble with the following: True or False: If the r-squared value is low (less than 0.2), then the model assumptions in Bayesian regression are violated. Apparently this is false. True or False: When selecting a single model after conducting an analysis with Bayesian model averaging, the model with the highest posterior model probability should be chosen. Apparently this is false. Page 8 Learning R The example loads a library called “BAM” for Bayeisn model Averaging”. The discussion starts out similar to regular Multiple Regression, then creates models which each of the possible combinations of values. For each combination a posterior probability is created (starting with uniform across all of the models). Thoughts on the Week This material was very important, since first of all it resembled the ADS or Wagner models, and second it began to add more the computation framework around Bayesian models and statistics. Week 5 I started this on 11/10, after returning from London. The course ended on 11/13, but I will continue to watch videos. Perspectives on Bayesian Applications This week consists of interviews with statisticians on how they use Bayesian statistics in their work, as well as the final project in the course. Bayesian Inference: a talk with James Berger (9 mins) To be filled in Bayesian methods and big data: a talk with David Dunson (8 mins) He is discussing analysis of brain scans and data. And so one of the really exciting things in Bayesian statistics is trying to design completely new algorithms, new ways to do inferences, new types of models for really hugely high dimensional complicated data while allowing uncertainty. And so, the really distinct characteristic of Bayesian methods is their ability to characterize uncertainty. And so the really interesting problems in big data are when you actually have really large numbers of variables or you're trying to fit a really flexible model like a non-parametrical model or you have something like rare events. So, for example, in computational advertising, we might be looking at people going from a large number of websites to a small set of websites, client websites. And so there might be hundreds of thousands websites and then a hundred client websites, and those transitions can be really rare. And so, even though you have millions of people, you have rare events. And so, then we found that the uncertainty in those transitions is super important. And so, if you just do an optimization approach, you might be quite misled. Bayesian methods in biostatistics and public health: a talk with Amy Herring (4 mins) To be filled in Bayes in industry: a talk with Steve Scott of Google (9 mins) To be filled in Peer Review Project To be filled in Thoughts on the Week To be filled in Page 9