Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
SPH6004 Advanced Biostatistics Part 1: Bayesian Statistics Chapter 1: Introduction to Bayesian Statistics Golden rule: please stop me to ask questions Week Starting Tuesday Friday 1 13 Jan Alex [Alex] 2 20 Jan 3 27 Jan 4 3 Feb Alex [Alex] 5 10 Feb Alex [Alex] 6 17 Feb Alex [Hyungwon] R 24 Feb 7 3 Mar Hyungwon Hyungwon 8 10 Mar Hyungwon Hyungwon 9 17 Mar Hyungwon Hyungwon 10 24 Mar YY YY 11 1 Apr YY YY 12 7 Apr YY YY Week Starting Tuesday Friday 1 13 Jan Introduction to Bayesian statistics Importance sampling 2 20 Jan 3 4 27 Jan 3 Feb 5 10 Feb 6 17 Feb Markov chain JAGS and STAN Monte Carlo Hierarchical Variable selection and modelling model checking Bayesian inference for mathematical models Objectives ● ● ● ● Describe differences between Bayesian and classical statistics Develop appropriate Bayesian solutions to nonstandard problems, describe the model, fit it, relate analysis to problem Describe differences between computational methods used in Bayesian inference, understand how they work, implement them in a programming language Understand modelling and data analytic principles Expectations Know already ● Basic and intermediate statistics ● Likelihood function ● Pick up programming in R ● Generalised linear models ● Able to read notes Why the profundity? ● ● ● Bayes' rule is THE way to invert conditional probabilities ALL probabilities are conditional Bayes' rule therefore provides the 'calculus' to manipulate probability, moving from p(A|B) to p(B|A). For early detection of breast cancer, starting at some age, women are encouraged to have routine screening, even if they have no symptoms Imagine you conduct such screening using mammography Prof Gerd Gigerenzer The following information is available about asymptomatic women aged 40 to 50 in your region who have mammography screening • The probability a woman has breast cancer is 0.8% • If she has breast cancer, the probability is 90% that she has a positive mammogram • If she does not have breast cancer, the probability is 7% that she still has a positive mammogram The challenge: • Imagine a woman who has a positive mammogram • What is the probability she actually has breast cancer? Their answers... I never inform my patients about statistical data. I would tell the patient that mammography is not so exact, and I would in any case perform a biopsy. The following information is available about asymptomatic women aged 40 to 50 in your region who have mammography screening • The probability a woman has breast cancer is 0.8% • If she has breast cancer, the probability is 90% that she has a positive mammogram • If she does not have breast cancer, the probability is 7% that she still has a positive mammogram Can we write the above mathematically? Key point 1 • p(B = 1 | A = 1)---the probability prior to observing the mammogram • p(B = 1 | M = 1, A = 1)---the probability after observing it • Bayes’ rule provides the way to update the prior probability to reflect the new information to get the posterior probability • (Even the prior is a posterior) Key point 2 ● Bayes' rule allows you to switch from – pr(something known | something unknown) ● to – pr(something unknown | something known) Bayesians and frequentists Bayes' rule is used to switch to pr(unknowns|knowns) for all situations in which there is uncertainty including parameter estimation Bayes' rule is only used to make probability statements about events, that in principle could be repeatedly observed Parameter estimation is done using methods that perform well under some arbitrary desiderata, such as being unbiased, and uncertainty is quantified by appealing to large samples The Thai AIDS vaccine trial The modified intention to treat analysis Seroconverted Participated Vaccine arm 51 8197 Placebo arm 74 8198 Q: what is the “underlying” probability pv of infection over this time window for those on the vaccine arm? What does that actually mean? • Participants are not randomly selected from the population: they are referred or volunteer • Participants must meet eligibility requirements • Not representative of Thai population • Risk of infection different in Thailand and, eg, Singapore • Nebulous: risk of infection in an hypothetical second trial in same group of participants • Hope pv/pu has some relevance in other settings Model for data • • • • Seems appropriate to assume Xv ~ Bin(Nv,pv) Xv = 51 = number vaccinees infected Nv = 8197 = number vaccinees pv = ? Point estimate to summarise the data Interval estimate to summarise uncertainty (later) measure of evidence that the vaccine is effective Refresher: frequentist approach • Traditional approach to estimate pv: – find the value of pv that maximises the probability of the data given that the hypothetical value were the true value – using calculus – numerically (Newton-Raphson, simulated annealing, cross entropy etc) – EITHER CASE use log likelihood Refresher: frequentist approach • Differentiating wrt argument we want to max over • setting derivative to zero, adding hat, solving, gives • which is just the empirical proportion infected Refresher: frequentist approach • To quantify the uncertainty might take a 95% interval • You probably know • (involves cheating: assuming you know pv and assuming the same size is close to infinity--actually there are better equations for small samples) Interpretation • The maximum likelihood estimate of pv is not the most likely value of pv • Classical statisticians cannot make probabilistic statements about parameters • Not a 95% chance pv lies in the interval (0.45,0.79)% • 95% of such intervals over your lifetime (with no systematic error, small samples) will contain the true value Tackling it Bayesianly • Target: point and interval estimate • Intermediate: probability of the parameter pv given the data Xv and Nv, ie likelihood fn posterior for pv prior for pv dummy variable pi • Likelihood function is same as before • What is the prior? What is the prior? • There is no the prior • There is a prior: you choose it just as you choose a Binomial model for the data • It represents information on the parameter (proportion of vaccinees that would be infected) before the data are in hand • Perhaps justifiable to assume all probs between [0,1] are equiprobably before data observed What is the prior? • 1{A}=1 if A true and 0 if A false • Nv can be dropped from the condition as I assume sample size and probability of infection are independent What is the posterior? • pv on the range (0,1) • C a constant 1 •Smart way •(later) 2 •Dumb way •(now) The dumb way • Grid of values for pv, finely spaced, on sensible range • Evaluate log posterior +C • Transform to posterior ×C • Approximate integral by sum over grid • Scale to get rid of C exploiting fact that posterior is a pdf and integrates to 1 The dumb way The posterior Point estimates • If you have a sample x1, x2, ... from a distribution, can represent overall location using: – mean – median – mode • Similarly can report as point estimate mean, median or mode of posterior In R Method Mean Mode Median MLE Estimate 0.63% 0.63% 0.62% 0.62% Uncertainty • Two common methods to get uncertainty interval/credible interval/intervals: – quantiles of the posterior (eg 2.5%ile, 97.5%ile) – highest posterior density interval • Since there is a 95% chance if you drew a parameter value from the posterior of it falling in this interval, the interpretation is how many people think of confidence intervals Highest posterior density intervals In R (0.47,0.82)% (0.45,0.79)% (0.47,0.81)% Important points • In some situations it doesn’t really matter if you do a Bayesian or a classical analysis as the results are effectively the same – sample size is large, asymptotic theory justified – no prior/external information for analysis – someone has already developed a classical routine • In other situations, Bayesian methods come into their own! Philosophical points • If you really love frequentism and hate Bayesianism, you can pragmatically use Bayesian approaches and interpret them like classical ones • If vice versa, you can – use classical estimates from literature as if Bayesian – arguably interpret classical point/interval estimates the way you want to Priors and posteriors •• A prior probability of BC reflects the A prior probability density function for pv reflects information beforethe observing the are the informationyou you have have before study results mammogram: all you know is the risk class the known sits inprobability density function reflects • patient The posterior information the study, anything •the The posteriorafter probability ofincluding BC reflects the known before and everything from the study itself information after observing the mammogram How much knowledge, how much uncertainty Justification •ForStatistician, Ms A, is analysing some data. She instance, Ms A wants to do a logistic regression on the comesdata up with a model for the data based on following outcome: got infected by H1N1 asShe measured some simplifying assumptions. mustbyjustify serology thispredictors: choice if others arerecent to believe age, gender, overseasher travel, number ofMr children household, ... • Bayesian statistician, B, isinanalysing some There is no why the effect age on the risk of infection data. Hereason must come up of with a model for the should be linear in the logit of risk. There is no reason why each data and for the parameters. He too must predictor’s effect is additive on the logit of risk. There is no justify choice.should be taken to be independent. reason whyhis individuals These are choices made by the statistician Support • Each parameter of a model has a support • The prior should match this 𝑋~𝐵𝑖𝑛 𝑛, 𝑝 𝑝 ∈ 0,1 𝑌~𝑁 𝜇, 𝜎 2 𝜇 ∈ ℝ, 𝜎 2 ∈ ℝ+ 𝑌𝑖 ~𝑁 𝑎 + 𝑏𝑥𝑖 , 𝜎 2 𝑎 ∈ ℝ, 𝑏 ∈ ℝ, 𝜎 ∈ ℝ+ • All a bit silly: 𝑝~𝑁 0,1002 𝜎~𝐵𝑒 10.2,3.9 𝑏~exp(1000) Priors for multiple parameters 𝑌𝑖 ~𝑁 𝑎 + 𝑏𝑥𝑖 , 𝜎 2 𝑎 ∈ ℝ, 𝑏 ∈ ℝ, 𝜎 ∈ ℝ+ • You must specify a joint prior for all parameters, eg p(a,b,σ) • Often easiest to assume the parameters are a priori independent, ie eg p(a,b,σ) = p(a) p(b) p(σ) • (note this does not force them to be independent a posteriori) • But you can incorporate dependency if appropriate, eg if you analyse dataset 1 and use its posterior as a prior for dataset 2 Aim for this part • Look at different classes of priors: – informative, non-informative – proper, improper – conjugate Informative and noninformative priors Informative Encapsulates information beyond that available solely in the data directly at hand Non-informative Opposite: a distribution that is flat or approximately flat over the range of parameter values with high likelihood values For instance, if someone has Eg pv ~ U(0,1) is non-informative previously estimated the risk of as it is flat over the range 0.5-infection by HIV in Thai adults 1.5% where the data tell you pv and reported point and interval should be estimates, you could take those Eg mu~U(-1000000,1000000) and convert into an appropriate might be non-informative for a prior distribution parameter on the real line; as might N(0,10002) When to choose which? Use a non-informative prior if: Use an informative prior if: Your primary data set has so much Your primary data set doesn’t give information in it you can estimate the enough information to estimate all parameter with no problems unknowns well (see next chapter for an example) You only have one data set You have multiple data sets and can best analyse them one at a time You have no really solid estimates from the literature that you can supplement the information from your primary data You have really good estimates from the literature that everyone accepts You want to approximate a frequentist analysis You are analysing the data for your own benefit, to make a decision, say, and do not need the acceptance of others Q: I’ve decided I want a noninformative prior. But what form? Parameter support Possible non-informative prior [0,1] U(0,1), Be(1,1), Be(1/2,1/2) Positive part U(0, ∞), U(0,big number), exp(big mean), of real line gamma(big variance?), log N(mean 1, big variance?), truncated N(0, big variance) Real line U(−∞, ∞), U(−big number, big number), N(0,big variance) Exact choice rarely makes a difference Q: I’ve decided I want an informative prior and have found an estimate in the literature. So, how? Aim for this part • Look at different classes of priors: – informative, non-informative – proper, improper – conjugate Proper and improper priors • • • • • Recall: f ( x ) dx 1 X Distributions are supposed to integrate to 1 Prior distributions really should, too A prior that integrates to 1 is proper One that doesn’t is improper p( ) Proper and improper posteriors An improper posterior is a bad outcome! Prior Proper Improper Improper Posterior Proper Proper Improper Bad likelihoods • If the likelihood is ‘badly behaved’ then not only do you need a proper prior, you need an informative prior, as there is insufficient information in the data to estimate that parameter (or those parameters) Aim for this part • Look at different classes of priors: – informative, non-informative – proper, improper – conjugate Conjugate priors • So, with our binomial model, we moved – from a prior for pv that was beta – to a posterior for pv that was beta • We therefore say that the beta is conjugate to the binomial ( ) Conjugate priors • There are a handful of other data models with conjugate priors • May encounter some later in the course • Most real problems do not have conjugate priors though • If it does, it makes sense to exploit it • Eg for the Thai vaccine, once you realise pv is beta a posteriori can summarise the posterior directly Summarising a posterior directly /(2+nv) Different kinds of priors Conjugate Non-conjugate Different kinds of priors Conjugate Non-conjugate Different kinds of priors Conjugate Non-conjugate Information to Bayesians prior posterior data Information to Bayesians prior posterior 1 data 2 data 1 posterior 2 Information to Bayesians prior posterior 1 data 2 data 1 posterior 2 ? A Gedanken • Consider experiments to estimate a probability p given a series of Bernoulli trials, xi, with yi = Σj=1:i xj • Use a Be(α,β) prior for p • Experimentor 1, instead of waiting for all the data to come in, recalculates the posterior from scratch based on yi and (α,β) each time a data point comes in • Experimentor 2, uses his last posterior and xi to recalculate the posterior Experimentor 1 Experimentor 2 • The two experimentors, using the same prior and same data, end with the same posterior • Experimentor 1 started afresh each time with the original prior and all data • Experimentor 2 updated the old posterior with the new datum Implications • If data come to you piecemeal, it doesn’t matter if you analyse them once at the end, or at each intermediate point and update your prior • (In practice one or the other may be convenient: eg if posterior is not analytic, makes estimate/approximate once, You cansense take to estimates from the literature rather and thanconvert once perthem datum)into priors You can always treat an old posterior obtained elsewhere as a prior What did we learn in chapter 1? Bayes rule Refresher on frequentist estimation Estimating a proportion given x, n Saw how Bayes rule could be Applied to probability of a state used to derive posterior of nature (BC) given evidence probability density of (MG) and background risk (age) parameter given data Priors Accumulation of evidence What did we learn in chapter 1? • Don’t know how to do Bayesian inference for problems with >1 parameter! Chapter 2 & 3: computing posteriors Importance sampling Markov chain Monte Carlo