Download Notes on Bayesian Statistics Coursera

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Probability wikipedia , lookup

Statistics wikipedia , lookup

History of statistics wikipedia , lookup

Foundations of statistics wikipedia , lookup

Transcript
Notes on Bayesian Statistics Coursera class
Created 10/07/16
Updated 10/09/16, Updated 10/12/16, Updated 10/25/16, Updated 11/13/16
Introduction
https://www.coursera.org/learn/bayesian/home/welcome
This course describes Bayesian statistics, in which one's inferences about parameters or hypotheses are updated as
evidence accumulates. You will learn to use Bayes’ rule to transform prior probabilities into posterior probabilities,
and be introduced to the underlying theory and perspective of the Bayesian paradigm. The course will apply
Bayesian methods to several practical problems, to show end-to-end Bayesian analyses that move from framing the
question to building models to eliciting prior probabilities to implementing in R (free statistical software) the final
posterior distribution. Additionally, the course will introduce credible regions, Bayesian comparisons of means and
proportions, Bayesian regression and inference using multiple models, and discussion of Bayesian prediction.
Course Mechanics
Welcome to Bayesian Statistics! You’re joining thousands of learners currently enrolled in the course. I'm excited to
have you in the class and look forward to your contributions to the learning community.
To begin, I recommend taking a few minutes to explore the course site. Review the material we’ll cover each week,
and preview the assignments you’ll need to complete to pass the course. Click Discussions to see forums where
you can discuss the course material with fellow students taking the class. Be sure to introduce yourself to everyone
in the Meet and Greet forum.
If you have questions about course content, please post them in the forums to get help from others in the course
community. For technical problems with the Coursera platform, visit the Learner Help Center.
Do I need to use "R"?
Yes. You will need R and RStudio to do the labs and the project. Both are free and publicly available. You will need
administrator access to your computer to install this software. There are step-by-step video instructions and
additional help under "Resources."
How will I be graded?
There are four graded quizzes (weeks 1-4), four lab exercises (weeks 1-4), and a peer-reviewed assignment (week 5)
-- the data analysis project in this course. You will need to pass each graded assignment in order to pass the whole
course.
After passing the course, your final grade for the course will be:
10% for each quiz + 10% for each lab + 20% for the data analysis project.
Book Summary
Gelman, Andrew with Donald B. Rubin (2013) Bayesian Data Analysis, Third Edition (Chapman & Hall/CRC Texts
in Statistical Science). Bought this on Amazon.com. Andrew Gelman is Professor of Statistics and Professor of
Political Science at Columbia University. He has published over 150 articles in statistical theory, methods, and
computation, and in applications areas including decision analysis, survey sampling, political science, public health,
and policy. Donald B. Rubin is John L. Loeb Professor of Statistics at Harvard University.
Hoff, P. D. (2009) A First Course in Bayesian Statistical Methods. Springer Verlag ISBN: 978-0-387-92299-7
(Print) 978-0-387-92407-6 (Online) E-Book available - check your library for availability.
Albert, J. (2009) Bayesian Computation with R. Springer Verlag ISBN: 978-0-387-92297-3 (Print) 978-0-38792298-0 (Online) [search for pdf on ResearchGate]
Page 1
Week 1 – The Basics of Bayesian Statistics
This started 10/3/16, and finished 10/11/16
Introduction to Statistics with R
An introduction to the faculty members from Duke University who are presenting the material.
This class is part of the “Statistics with R” specialization at Coursera.
Bayes Rule
Conditional probabilities and Bayes Rule (2 min)
First real lecture about the topic. Provided a historical context, and info about Thomas Bayes.
Bayes Rule and Diagnostic Testing (6 min)
Contained examples of testing for HIV and RU-486.
Bayes Updating (2 min)
This was useful because it discussed how to make updates.
Bayesian vs. Frequentist definitions of Probability (4 min)
This clarified that a Bayesian always starts from a prior. Also, it contained definitions of probability, such as N out
of M occurrences, vs. belief in an event happening.
Inference for a Proposition
Approach: Frequentist (3 min)
RU-486 analysis. Here we started with counts of occurrences, and used that derive H0 vs. H1 models.
Approach: Bayesian (7 min)
Here we recast the problem in terms of models that partitioned the event space, then we assigned priors, and then ran
updates. This was a great example of Bayes.
Effect of Sample Size (2 min)
Here we saw that the additional information increased the accuracy (or differentiation) of the Bayes estimates on the
different models.
Frequentist vs. Bayes Inference
This was a good summary of the above lessons, pointing out the strengths of Bayes in that you can formulate models
better, and you get a measure of belief in the result.
Strengthen your Understanding
Thee were two quizzes – a pretest with 6 questions that are available for free, and then 10 questions in the real quiz,
but you have to have a paid subscription for this.
Learning R
This was a good session. Installed R and RStudio, ran the example. The example was brought in from the repo
https://github.com/StatsWithR/statsr which seems to be for the whole specialization. We are only dealing with the
Bayes portion. It appears that StatsWithR provides a large set of useful stat functions which can be loaded into
R/Studio.
There was a good video on RStudio at https://www.youtube.com/watch?v=32o0DnuRjfg. It is by David Langer.
In this case, we are loading “devtools”, then “dplyr” (utilities), then “ggplot2” (plotting), then “shiny” (web
applications).
Page 2
You are given a problem as a deck that can be “knitted” so that you can run it. Also you can call functions that are
part of the lesson, to get answers for the quiz.
Summary
Quite useful introduction. Assumes little prior experience except for knowledge of H0/H1 hypothesis testing, and
about binomial distributions. It was also helpful to get RStudio installed and read background info.
For next week, we will move to continuous distributions, since all of these were discrete.
Week 2 – Bayesian Inference
Started this late on 10/12/16, and this week officially ended at 10/16/16.
In this week, we will discuss the continuous version of Bayes' rule and show you how to use it in a conjugate family,
and discuss credible intervals. By the end of this week, you will be able to understand and define the concepts of
prior, likelihood, and posterior probability and identify how they relate to one another, in terms of actual statistical
distributions and analysis.
Introduction
This week our learning goals are to show the continuous version of Bayes' rule and teach you how to use it in a
conjugate family.
The RU-486 example will allow us to discuss Bayesian modeling in a concrete way.
It also leads naturally to a Bayesian analysis without conjugacy. For the non-conjugate case, there's usually no
simple mathematical expression, and one must resort to computation.
Finally, we shall discuss credible intervals, the Bayesian analog of frequentist confidence intervals, and Bayesian
estimation and prediction.
To proceed during this week, you will need to have mastered the concept of conditional probability and Bayes' rule
for discrete random variables. You do not need to know calculus to complete this week's materials. However, for
those who do, we shall briefly look at an integral.
Continuous Variables and Eliciting Probability Distributions
From the discrete to the continuous
This was introduction to PDF and similar concepts of continuous random variable. Basically right out of a textbook.
Elicitation
Where to the priors come from? Bayesians express their priors in terms of personal probabilities. These must obey
laws of probability, and take into account everything that the Bayesian knows. The second part of the lecture is an
introduction to the beta distribution, which is specified in terms of alpha and beta. When these are both 1, we get a
uniform distribution. When the values are larger, the distribution is more focused. Can also be skewed. beta(5,1) is
skewed high, and beta(1,5) is skewed low.
Conjugacy
This defined as having a Bayesian update approach that produces a distribution of the same family, but with
different parameters. For instance from one beta distribution to another. For instance from beta(a,b) to beta (a+x,
b+n-x)
Three Conjugate Families
Inference on a Binomial proposition
This was about updating the distributions for new information. Here we apply the above step that adjusts a beta
distribution to a real-world example, which was from the RU-486 study. In this case there were 4 failures out of 20,
Page 3
vs. 0 failures out of 20. In frequentist analysis, this would be 4/20, but in Bayesian analysis we apply each datapoint
in turn.
The Gamma-Poisson Conjugate Families
Watch this again. Prussian horse kick analysis is in this video. The data comes from a Poisson distribution, and the
prior and posterior are both gamma distributions. Basic definition of Poisson. Number of cases, single parameter,
lambda, which is both the mean and variance.
The Normal-Normal Conjugate families
In this case we start with a normal distribution, in which the mean is unknown but also normal. Analytical chemist.
Credible Intervals and Predictive Inference
Non-conjugate Priors
This should really be part of the prior section. For instance, he is talking about distributions are that not simple or
mathematical, but are a set of discrete values. For this, we need a computer. He discusses JAGS. Apparently this is
a kind of samping approach. However, this lecture is “mostly a look ahead to future material”. Rather short.
Credible Intervals
This had some very important definitions about what confidence intervals mean in frequentist statistics, and how
Bayesian can generate what we call credible intervals.
Predictive Inference
Often we are not interested in determine which of two models are more likely, but what to expect about next events.
In this example, we have a coin which may be p=0.7 or may be p=0.4. We have priors, then we create posteriors,
then we create a prediction. Simple values can be handled without integration.
Strengthening your Understanding
1) Which of the following statements is true?
 The likelihood is a mixture between the prior and posterior. No, the likelihood is completely independent of
the prior. The likelihood is the probability of observing the data given the model parameters, whereas the
prior represents a subjective distribution imposed on those model parameters prior to observing the data.
 The prior is a mixture between the likelihood and posterior. No, the prior is completely independent of the
likelihood.
 The posterior is a mixture between the prior and likelihood. Yes, the posterior is proportional to the
likelihood times the prior, which means that it is influenced by both of them.
2) Which of the following distributions would be a good choice of prior to use if you wanted to determine if a coin is
fair when you have a strong belief that the coin is fair? (Assume a model where we call heads a success and tails a
failure).
Solution: Since a Beta(a, b) distribution corresponds to observing data with prior mean a/(a + b) and prior
sample size a + b, we seek an answer with a/a + b = 0.5 and a + b large. Of the answer choices given, a
Beta(50, 50) distribution corresponds to prior mean 0.5 with a larger sample size than any other answer.
3) If Amy is trying to make inferences about the average number of customers that visit Macy’s between noon and 1
p.m., which of the following distributions represents a conjugate prior?
Solution: The number of customers that visit Macy’s between noon and 1 p.m. can only take on integer
values and has no clearly defined upper bound. As such, a Poisson distribution with parameter λ would be a
good model for customer arrivals. The corresponding conjugate prior for λ, the average number of
customers that visit Macy’s in the time period, is a Gamma distribution, which we learned from the
lectures.
4) Suppose that you sample 24 M&Ms from a bag and find that 3 of them are yellow. Assuming that you place a
uniform Beta(1,1) prior on the proportion of yellow M&Ms p, what is the posterior probability that p ¡ 0.2?
Solution: The Beta distribution is conjugate to the Binomial distribution, the update rule is that the posterior
Beta will have parameter values α + x, β + n − x where α and β are the parameters of the prior 1 beta. In
this case, x = 3 and n = 24. Hence, the posterior distribution of p is a Beta(4,22) distribution. To find the
Page 4
posterior probability that p < 0.2, we use the following command in R: pbeta(q = 0.2, shape1 = 4, shape2 =
22) This gives a posterior probability of approximately 0.766.
5) Suppose you are given a coin and told that the coin is either biased towards heads (p = 0.6) or biased towards tails
(p = 0.4). Since you have no prior knowledge about the bias of the coin, you place a prior probability of 0.5 on the
outcome that the coin is biased towards heads. You flip the coin twice and it comes up tails both times. What is the
posterior probability that your next two flips will be heads?
Solution: To solve this problem, we first need to find the posterior probability that the coin is biased
towards heads. Using Bayes’ rule, we find that
P(p = 0.6|{T, T}) = (0.4 2 )(0.5)/((0.4 2)(0.5) + (0.6 2)(0.5() = 4/13
To find the posterior probability that the next two flips will be heads, we note that
P({T, T, H, H}|{T, T}) = P({T, T, H, H}|p
= 0.6, {T, T})P(p = 0.6|{T, T}) + P({T, T, H, H}|p = 0.4, {T, T})P(p = 0.4|{T, T})
= (0.6)2 (4/13) + (0.4)2 (9/13) = 72/325 ≈ 0.222
Learning R
More downloads, and another lab. This one is about credible intervals.
Summary of Week
This was a very interesting discussion of distributions and Bayesian updating.
The credible interval discussion was very unique, and the lab covered it well.
Week 3
This began on 10/15/16, I continued to study it through about 10/26/16.
Losses and decision-making
Losses and Decision-making (3 mins)
This begins with basic definitions. But the posterior distribution on its own is not always sufficient. Sometimes one
wants to express one's inference is a credible interval to indicate a range of likely values for the parameter.
What is the one-number estimate? Linked to risk analysis, which is built around minimizing expected loss. The
context is what would you summarize from the Bayes posterior (for instance what would you tell a patient?). One
approach might be to use the median of the distribution. Another would be use to mode of the distribution.
Working with Loss Functions (6 mins)
Here we are defining loss functions and giving an example. Car sales example. Example provides a distribution
from analysis. Here is a dot plot. What single number do you report? Try 30. Loss function of L0 means that you
get 1 right and 50 wrong, hence the loss is 50. You want to minimize the loss. Hence report the mode. Let consider
a linear loss, which is L1. Hence report the median. L2 is the squared difference. Report the mean.
Minimizing Expected Loss for Hypothesis Testing (5 mins)
One straightforward way of choosing between H1 and H2 would be to choose the one with the higher posterior
probability. In other words, you reject H1 if the posterior probability of H1 is smaller than the posterior probability
of H2. However, since hypothesis testing is a decision problem, we should also consider a loss function.
She defines loss functions in medical false positive and false negative.
Posterior probabilities of Hypotheses and Bayes factors (6 mins)
The prior odds is defined as the ratio of the prior probabilities assigned to the hypotheses or models we're
considering. In other words, the posterior odds is the product of the Bayes factor and the prior odds for these two
hypotheses. The Bayes factor quantifies the evidence of data arising from hypothesis one versus hypothesis two.
Then a discussion about scales for interpreting the Bayes factor. To recap, in this video we defined prior odds,
posterior odds, and Bayes factors. We learned about scales by which we can interpret these values for model
selection. We also re-emphasize that in Bayesian testing, the order in which we evaluate the models of hypotheses
does not matter. Since the Bayes factor of hypothesis two versus hypothesis one is simply the reciprocal of the
Bayes factor for hypothesis one versus hypothesis two.
Page 5
Comparing Two Means
[this was a really through discussion of how to compare two means that were found from a survey. What it does it
state the assumptions, state the hypotheses, and state the means of analysis. It seems like a frequentist approach
might be easier, but doesn’t produce as much information.]
Comparing two Proportions using Bayes factors: assumptions (8 mins)
In summary we reviewed assumptions for basing an inference for comparing two proportions.
We presented a default prior distribution for Bayesian null hypothesis testing, using conjugate prior
distributions. We are working with Survey USA poll about bullying. H1 p(males) = p(females) and H2:
p(males) != p(females). We introduced pooling of beliefs to construct a prior distribution for the common
proportion p under H2. Other assumptions included independence. We reviewed finding the posterior
distribution of parameters of interest under the two hypotheses. At this point, you should be able to find
point estimates of the probabilities as well as credible intervals.
Comparing two Proportions using Bayes factors (4 mins)
In this video, we will continue our comparison of two proportions using Bayesian inference.
Recall in our last video, we used a Survey USA poll to illustrate inference about two proportions, using
conjugate beta prior distributions. We used independent beta prior distributions for the probabilities that
male and female parents would report bullying under hypothesis two. Introduced the idea of pooling. Here
are the posterior distributions for male, female, and pooled. Now find post prob of H1.
Comparing two Paired Means using Bayes factors (7 mins)
In this case, we are considering how to analyze and report on two mean values, which are amount of zinc in water
samples at the surface and at the bottom. Like the bullying analysis, this was introduced in the inferential stat
course.
The hypotheses are H1 same, and H2 different
Bayesian pair analysis assumes independence of samples in the data.
Student’s t-distribution gets mentioned here.
What to report? (7 mins)
Here we are seeking to create a reportable analysis of the prior video.
In these samples, the Bayes factor for H1 : H2 is 0.015 (not at all likely), while H2 : H1 is 64 (very likely)
Our results are based on the default prior distribution. In our case we found that there was a high probability that the
bottom concentrations exceed the surface concentrations. The credible interval provides a range of the most
plausible values of the mean difference. While the suggested statistical significance subject matter expertise is needed
in order to say whether the differences in or absolute levels are of a magnitude meaningful for health consideration.
Comparing Two Independent Means
Posterior probability, p-values and paradoxes (7 mins)
Lindley’s paradox. Also introduce the Cauchy distribution of a potential prior.
Unusual case of ESP analysis here. The claim was that a person could influence the probabilities of machinegenerated zeros and ones.
Out of the trials there were approximately 52 million ones, resulting in a sample mean of 0.500177, which
is pretty close to 0.5. The null hypothesis asserted that the results were generated by a machine operating
with a chance of 0.5, whereas the alternative was that the unspecified hypothesis H2 that mu was not equal
to one-half.
Discussion about the Cauchy distribution.
Subjective information on the scale of the distribution can also help refine the prior distribution by reducing
the prior support on values that are not a priori plausible. While hypotheses testing does have its
place, think carefully about whether testing a point null, such as mu equals say mu naught, or mu equals
one-half, make sense. Reporting credible intervals and considering practical significance in terms of effect
size may be more meaningful.
In the case of the ESP analysis, it is more likely that threat there are biases in the random umber generator.
Comparing two independent means (7 mins)
Page 6
Why is this different from above? Above we assumed that the bottom mean and the surface mean were not
independent. Here we have set up a study that is. We are using the “distracted eaters” does playing a computer
game during lunch affect fullness, memory for lunch, and later snack intake? Distracted was about 52 grams, while
non-distracted was about 27 grams. What is the difference between these?
Unfortunately, there's no closed form expression for the distribution of the difference of two student t
distributions. However, we can use simulation to draw samples from posterior distribution using what is
called Monte Carlos sampling. From Monte Carlo samples, we simulate possible values of the parameters
from their posterior distributions. In this case, first we generate a large number of values from the student t
distribution for the mean for Group A. Second, we generate an equivalent from the student t distribution
for the mean of Group B.
Then we calculate Monte Carlo averages.
Comparing two independent means: hypothesis testing (7 mins)
In this case we are seeking to prepare a result for the above analysis. Set up the hypotheses H1 mu1 = mu2 and H2:
mu1 != mu2. The null hypothesis here means that nothing is going on.
Use an intrinsic prior to specify that the prior distribution for the three parameters in the null model, which
include the common mean mu and the two variances, and for the four parameters under the alternative
hypothesis. There are two means and two variances in the two groups under H1.
At this point, we are again using Monte Carlo approach, and she refers to the JAG implementation of this.
In this case, no hypothesis is more plausible, and more data are needed.
Strengthening your Understanding
Practice Quiz with 4 questions. I had trouble with the following:
True or False: If the posterior distribution is normally distributed, the estimate that minimizes posterior expected
loss is the same, regardless of whether the loss function is 0/1, linear, or quadratic.
Correct Response
True
Which of the following statements is false?
A Bayes factor of less than .01 suggests that the evidence in favor of one of the hypotheses is barely worth
mentioning.
Correct Response
Correct Answer. A Bayes factor of less than .01 will yield strong evidence in favor of one of the hypotheses.
Learning R
More downloads, and another lab. This one is about decision-making.
Summary of Week
The concept of Bayes factors was something that I had never heard of before. However, it provides some metrics
for indicating the strength of a belief about a decision, with a level of domain and unit independence.
The lectures were quite detailed, and covered several ways to make conclusions. I need to read up on the Cauchy
distribution: It is the distribution of the ratio of two independent normally distributed Gaussian random variables.
The Cauchy distribution is often used in statistics as the canonical example of a "pathological" distribution since
both its mean and its variance are undefined. It is one of the few distributions that is stable and has a probability
density function that can be expressed analytically, the others being the normal distribution and the Lévy
distribution.
Week 4
This began on 10/22/16, and I had to study this over a several-week period because of London travel.
Page 7
Bayesian Regression
In the previous module we introduced Bayesian decision making using a variety of loss functions. And we discussed
how to minimize expected loss for hypothesis testing. We also introduced the concept of base factors and gave some
examples of how they can be used in Bayesian hypothesis testing for two proportions and two means. We wrapped
up the unit with the discussion on how findings from credible intervals compare to those from a hypothesis test and
when to reject, accept or wait. >> In this new module, we'll introduce Bayesian inference and multiple regression
models. And show how Bayes factors and posterior probabilities can be used for variable selection. We will
introduce the concept of Bayesian model averaging. An alternative to variable selection that will allow you to make
inferences and predictions using an ensemble of models. Finally we demonstrate modern simulation methods to
search for regression models of high posterior probability. When a numeration of all possible subsets is not feasible.
Simple and Multiple Bayesian Regression
Bayesian simple linear regression (8 mins)
Seems like we are using Bayes to understand the uncertainty in the Ordinary Least Squares results.
Checking for outliers (4 mins)
In this case we are using Bayes to determine if an outlier should be ignored.
Bayesian multiple regression (4 mins)
Again we take a standard multiple regression model, and use Bayes to understand the probabilities that the
coefficients have the values we are estimating.
Bayesian Model Uncertainty and Model Averaging
Model selection criteria (5 mins)
This lecture defines BIC, or Bayesian Information Criteria.
Bayesian model uncertainty (7 mins)
This begins to look like the approaches used at ADS. Each model is a scenario, and each model has a weight or
probability.
Bayesian model averaging (7 mins)
Watch this
Markov Chain Monte Carlo
Statistic exploration (4 mins)
Watch this
Priors for Bayesian model uncertainty (8 mins)
Watch this
R Demo: crime and punishment (9 mins)
Watch this
Decisions under model uncertainty (7 mins)
Watch this
Strengthening your Understanding
Practice Quiz with 10 questions. I had trouble with the following:
True or False: If the r-squared value is low (less than 0.2), then the model assumptions in Bayesian regression are
violated. Apparently this is false.
True or False: When selecting a single model after conducting an analysis with Bayesian model averaging, the
model with the highest posterior model probability should be chosen. Apparently this is false.
Page 8
Learning R
The example loads a library called “BAM” for Bayeisn model Averaging”. The discussion starts out similar to
regular Multiple Regression, then creates models which each of the possible combinations of values. For each
combination a posterior probability is created (starting with uniform across all of the models).
Thoughts on the Week
This material was very important, since first of all it resembled the ADS or Wagner models, and second it began to
add more the computation framework around Bayesian models and statistics.
Week 5
I started this on 11/10, after returning from London. The course ended on 11/13, but I will continue to watch videos.
Perspectives on Bayesian Applications
This week consists of interviews with statisticians on how they use Bayesian statistics in their work, as well as the
final project in the course.
Bayesian Inference: a talk with James Berger (9 mins)
To be filled in
Bayesian methods and big data: a talk with David Dunson (8 mins)
He is discussing analysis of brain scans and data. And so one of the really exciting things in Bayesian statistics is
trying to design completely new algorithms, new ways to do inferences, new types of models for really hugely high
dimensional complicated data while allowing uncertainty. And so, the really distinct characteristic of Bayesian
methods is their ability to characterize uncertainty.
And so the really interesting problems in big data are when you actually have really large numbers of variables or
you're trying to fit a really flexible model like a non-parametrical model or you have something like rare events. So,
for example, in computational advertising, we might be looking at people going from a large number of websites to
a small set of websites, client websites. And so there might be hundreds of thousands websites and then a hundred
client websites, and those transitions can be really rare. And so, even though you have millions of people, you have
rare events. And so, then we found that the uncertainty in those transitions is super important. And so, if you just do
an optimization approach, you might be quite misled.
Bayesian methods in biostatistics and public health: a talk with Amy Herring (4 mins)
To be filled in
Bayes in industry: a talk with Steve Scott of Google (9 mins)
To be filled in
Peer Review Project
To be filled in
Thoughts on the Week
To be filled in
Page 9