Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Statistical inference for astrophysics A short course for astronomers Cardiff University 16-17 Feb 2005 Graham Woan, University of Glasgow Cardiff Feb 16-17 2005 1 Lecture Plan • What is probability? • Estimating parameters values Lectures 1 & 2 • Why statistical astronomy? o the Bayesian way o the Frequentist way othe Bayesian way o the Frequentist way • Assigning Bayesian probabilities Lectures 3 & 4 • Testing hypotheses • Monte-Carlo methods Cardiff Feb 16-17 2005 2 Why statistical astronomy? Generally, statistics has got a bad reputation “There are three types of lies: lies, damned lies and statistics” Mark Twain Benjamin Disraeli Often for good reason: The Economist Jun 3rd 2004 … two researchers at the University of Girona in Spain, have found that 38% of a sample of papers in Nature contained one or more statistical errors… Cardiff Feb 16-17 2005 3 Why statistical astronomy? Data analysis methods are often regarded as simple recipes… http://www.numerical-recipes.com/ Cardiff Feb 16-17 2005 4 Why statistical astronomy? Data analysis methods are often regarded as simple recipes… o o o o …but in analysis as in life, sometimes the recipes don’t work as you expect. Low number counts Distant sources Correlated ‘residuals’ Incorrect assumptions Systematic errors and/or misleading results Cardiff Feb 16-17 2005 5 Why statistical astronomy? …and the tools can be clunky: " The trouble is that what we [statisticians] call modern statistics was developed under strong pressure on the part of biologists. As a result, there is practically nothing done by us which is directly applicable to problems of astronomy." Jerzy Neyman, founder of frequentist hypothesis testing. Cardiff Feb 16-17 2005 6 Why statistical astronomy? For example, we can observe only the one Universe: (From Bennett et al 2003) Cardiff Feb 16-17 2005 7 The Astrophysicist’s Shopping List We want tools capable of: o dealing with very faint sources o handling very large data sets o correcting for selection effects o diagnosing systematic errors o avoiding unnecessary assumptions o estimating parameters and testing models Cardiff Feb 16-17 2005 8 Why statistical astronomy? Key question: How do we infer properties of the Universe from incomplete and imprecise astronomical data? Our goal: To make the best inference, based on our observed data and any prior knowledge, reserving the right to revise our position if new information comes to light. Let’s come to this problem afresh with an astrophyicist’s eye, and bypass some of the jargon of orthodox statistics, going right back to plain probability: Cardiff Feb 16-17 2005 9 Right-thinking gentlemen #1 “A decision was wise, even though it led to disastrous consequences, if with the evidence at hand indicated it was the best one to make; and a decision was foolish, even though it led to the happiest possible consequences, if it was unreasonable to expect those consequences” Herodotus, c.500 BC We should do the best with what we have, not what we wished we had. Cardiff Feb 16-17 2005 10 Right-thinking gentlemen #2 “Probability theory is nothing but common sense reduced to calculation” Pierre-Simon Laplace (1749 – 1827) Cardiff Feb 16-17 2005 11 Right-thinking gentlemen #3 Occam’s Razor “Frustra fit per plura, quod fieri potest per pauciora.” “It is vain to do with more what can be done with less.” William of Occam (1288 – 1348 AD) Everything else being equal, we favour models which are simple. Cardiff Feb 16-17 2005 12 The meaning of probability Definitions, algebra and useful distributions Cardiff Feb 16-17 2005 13 But what is “probability”? • There are three popular interpretations of the word, each with an interesting history: – Probability as a measure of our degree of belief in a statement – Probability as a measure of limiting relative frequency of outcome of a set of identical experiments – Probability as the fraction of favourable (equally likely) possibilities • We will call these the Bayesian, Frequentist and Combinatorial interpretations. • Note there are signs of trouble here: – How do you quantify “degree of belief”? – How do you define “relative frequency” without using ideas of probability? – What does “equally likely” mean? • Thankfully, at some level, all three interpretations agree on the algebra of probability, which we will present in Bayesian terms: Cardiff Feb 16-17 2005 14 Algebra of (Bayesian) probability • Probability [of a statement, such as “y = 3”, “the mass of a neutron star is 1.4 solar masses” or “it will rain tomorrow”] is a number between 0 and 1, such that p(if true ) 1 p(if false ) 0 0 p(if not sure ) 1 • For some statement X, p( X ) p( X ) 1 where the bar denotes the negation of the statement -- The Sum Rule • If there are two statements X and Y, then joint probability p( X ,Y ) p(Y )p( X |Y ) where the vertical line denotes the conditional statement “X given Y is true” – The Product Rule Cardiff Feb 16-17 2005 15 Algebra of (Bayesian) probability • From these we can deduce that p( X Y | I) p( X | I) p(Y | I) p( X ,Y | I) where “+” denotes “or” and I represents common background information -- The Extended Sum Rule • …and, because p(X,Y)=p(Y,X), we get p( X |Y , I) p( X | I)p(Y | X , I) p(Y | I) which is called Bayes’ Theorem • Note that these results are also applicable in Frequentist probability theory, with a suitable change in meaning of “p”. Cardiff Feb 16-17 2005 Thomas Bayes (1702 – 1761 AD) 16 Algebra of (Bayesian) probability Bayes’ theorem is the appropriate rule for updating our degree of belief when we have new data: p( X |Y , I) p( X | I) p(Y | X , I) p(Y | I) Prior Likelihood Posterior p(model | data, I) p(data | model, I) p(model | I) p(data | I) Evidence [note that the word evidence is sometimes used for something else (the ‘log odds’). We will stick to the p(d|I) definition here.] We can usually calculate all these terms Cardiff Feb 16-17 2005 17 Algebra of (Bayesian) probability • We can also deduce the marginal probabilities. If X and Y are propositions that can take on values drawn from X { x1, x 2 ,..., x n }, and Y { y 1, y 2 ,..., y m }, then p(x i ) p(x i ) p(y j j 1..m | xi ) p(x )p(y i j 1..m j | xi ) p(x , y ) i j j 1..m =1 this gives use the probability of X when we don’t care about Y. In these circumstances, Y is known as a nuisance parameter. • All these relationships can be smoothly extended from discrete probabilities to probability densities, e.g. p(x) p(x, y)dy where “p(y)dy” is the probability that y lies in the range y to y+dy. Cardiff Feb 16-17 2005 18 Algebra of (Bayesian) probability These equations are the key to Bayesian Inference – the methodology upon which (much) astronomical data analysis is now founded. Clear introduction by Devinder Sivia (OUP). (fuller bibliography tomorrow) See also the free book by Praesenjit Saha (QMW, London). http://ankh-morpork.maths.qmw.ac.uk/%7Esaha/book Cardiff Feb 16-17 2005 19 Example… • A gravitational wave detector may have seen a type II supernova as a burst of gravitational radiation. Burst-like signals can also come from instrumental glitches, and only 1 in 10,000 bursts is really a supernova, so the data are checked for glitches by examining veto channels. The test is expected to confirm the burst is astronomical in 95% of cases in which it truly is, and in 1% when it truly is not. The burst passes the veto test!! What is the probability we have seen a supernova? Answer: Denote S “burst really is a supernova” G “burst really is a glitch” “test says it' s a supernova” “test says it' s a glitch” Let I represent the information that the burst seen is typical of those used to deduce the information in the question. Then we are told that: p(S | I) 0.0001 p(G | I) 0.9999 p( | S, I) 0.95 p( | G, I) 0.01 } } Cardiff Feb 16-17 2005 Prior probabilities for S and G Likelihoods for S and G 20 Example cont… • But we want to know the probability that it’s a supernova, given it passed the veto test: p(S | , I). By Bayes, theorem p(S | , I) p(S | I) p( | S, I) , p( | I) and we are directly told everything on the rhs except p( | I), the probability that any burst candidate would pass the veto test. If the burst is either a supernova or a hardware glitch then we can marginalise over these alternatives: p( | I) p(, S | I) p(,G | I) p(G | I)p( | G, I) p(S | I)p( | S, I) 0.9999 so 0.01 0.0001 0.95 0.0001 0.95 p(S | , I) 0.01 0.9999 0.01 0.0001 0.95 Cardiff Feb 16-17 2005 21 Example cont… • • So, despite passing the test there is only a 1% probability that the burst is a supernova! Veto tests have to be blisteringly good if supernovae are rare. Why? Because most bursts that pass the test are just instrumental glitches – it really is just common sense reduced to calculation. • Note however that by passing the test, the probability that this burst is from a supernova has increased by a factor of 100 (from 0.0001 to 0.01). • Moral: p( | S, I) p(S | , I) Probability that a supernova burst gets through the veto (0.95) Probability that it’s a supernova burst if it gets through the veto (0.01) This is the likelihood: how consistent is the data with a particular model? This is the posterior: how probable is the model, given the data? Cardiff Feb 16-17 2005 22 the basis of frequentist probability Cardiff Feb 16-17 2005 23 Basis of frequentist probability •Bayesian probability theory is simultaneously a very old and a very young field: •Old : original interpretation of Bernoulli, Bayes, Laplace… •Young: ‘state of the art’ in (astronomical) data analysis •But BPT was rejected for several centuries because probability as degree of belief was seen as too subjective Frequentist approach Cardiff Feb 16-17 2005 24 Basis of frequentist probability Probability = ‘long run relative frequency’ of an event it appears at first that this can be measured objectively e.g. rolling a die. What is p(1) ? 1 If die is ‘fair’ we expect p(1) p(2) p(6) 6 These probabilities are fixed (but unknown) numbers. Can imagine rolling die M times. Number rolled is a random variable – different outcome each time. Cardiff Feb 16-17 2005 25 Basis of frequentist probability •We define n(1) p(1) lim M M If p(1) 1 6 die is ‘fair’ •But objectivity is an illusion: n(1) assumes each outcome equally likely p(1) lim M M (i.e. equally probable) •Also assumes infinite series of identical trials •What can we say about the fairness of the die after (say) 5 rolls, or 10, or 100 ? Cardiff Feb 16-17 2005 26 Basis of frequentist probability In the frequentist approach, a lot of mathematical machinery is specifically defined to let us address frequency questions: •We take the data as a random sample of size M , drawn from an assumed underlying pdf •Sampling distribution, derived from the underlying pdf and M •Define an estimator – function of the sample data that is used to estimate properties of the pdf But how do we decide what makes an acceptable estimator? Cardiff Feb 16-17 2005 27 Basis of frequentist probability Example: measuring a galaxy redshift • Let the true redshift = z0 -- a fixed but unknown parameter. Weuse two telescopes to estimate z0 and compute sampling distributions for z1 and z 2 modelling errors ˆ pz 1. pẑ2 Small telescope low dispersion spectrometer Unbiased: Repeat observation a large number of times average estimate is equal to z0 pẑ1 E(z1) z1 pz1 | z0 dz1 z0 BUT var z1 is large due to the low dispersion. Cardiff Feb 16-17 2005 28 Basis of frequentist probability Example: measuring a galaxy redshift • Let the true redshift = z0 -- a fixed but unknown parameter. Weuse two telescopes to estimate z0 and compute sampling distributions for z1 and z 2 modelling errors ˆ pz 2. pẑ2 Large telescope high dispersion spectrometer but faulty astronomer! (e.g. wrong calibration) Biased: E(z2 ) z2 p z2 | z0 dz2 z0 pẑ1 BUT varz 2 is small. Which is the better estimator? Cardiff Feb 16-17 2005 29 Basis of frequentist probability What about the sample mean? • Let x1,, x M be a random sample from pdf p(x) with mean and variance 2 . Then Can show that 1 M x i = sample mean M i 1 E( ) -- an unbiased estimator But bias is defined formally in terms of an infinite set of randomly chosen samples, each of size M. What can we say with a finite number of samples, each of finite size? Before that… Cardiff Feb 16-17 2005 30 Some important probability distributions quick definitions Cardiff Feb 16-17 2005 31 Some important pdfs: discrete case 1) Poisson pdf e.g. number of photons / second counted by a CCD, number of galaxies / degree2 counted by a survey n = number of detections p(n) ne n! Poisson pdf assumes detections are independent, and there is a constant rate Can show that p(n) 1 n 0 Cardiff Feb 16-17 2005 32 Some important pdfs: discrete case 1) Poisson pdf e.g. number of photons / second counted by a CCD, number of galaxies / degree2 counted by a survey p(n) p(n) ne n! n Cardiff Feb 16-17 2005 33 Some important pdfs: discrete case 2. Binomial pdf number of ‘successes’ from N observations, for two mutually exclusive outcomes (‘Heads’ and ‘Tails’) e.g. number of binary stars, Seyfert galaxies, supernovae… r = number of ‘successes’ = probability of ‘success’ for single observation pN (r) N! r (1 )N r r!(N r)! Can show that p N (r) 1 r 0 Cardiff Feb 16-17 2005 34 Some important pdfs: continuous case 1) Uniform pdf p(x) 1 b a 0 axb otherwise p(x) 1 b-a 0 a b Cardiff Feb 16-17 2005 x 35 Some important pdfs: continuous case 1) Central, or Normal pdf (also known as Gaussian ) p(x) 1 x 2 1 exp 2 2 p(x) 0.5 1 x Cardiff Feb 16-17 2005 x 36 Cumulative distribution function (CDF) a P(a) p(x)dx = Prob( x < a ) P(x) x Cardiff Feb 16-17 2005 x 37 Measures and moments of a pdf The nth moment of a pdf is defined as xn n x p(x | I) Discrete case x 0 xn n x p(x | I)dx Continuous case Cardiff Feb 16-17 2005 38 Measures and moments of a pdf The 1st moment is called the mean or expectation value E( x ) x x p(x | I) Discrete case x 0 E( x ) x x p(x | I)dx Continuous case Cardiff Feb 16-17 2005 39 Measures and moments of a pdf The 2nd moment is called the mean square x2 2 x p(x | I) Discrete case x 0 x2 2 x p(x | I)dx Continuous case Cardiff Feb 16-17 2005 40 Measures and moments of a pdf The variance is defined as varx x 2 p(x | I) x Discrete case x 0 varx x x 2 p(x | I)dx Continuous case and is often written . 2 In general 2 varx is called the standard deviation Cardiff Feb 16-17 2005 x 2 x 2 41 Measures and moments of a pdf pdf Poisson discrete Binomial pN (r) Uniform continuous p(r) mean variance re N! r (1 )N r r!(N r)! p( X ) 1 b a Normal p( X ) r! 1 X 2 1 exp 2 2 Cardiff Feb 16-17 2005 N 1 2 a b N (1 ) 1 12 b a 2 2 42 Measures and moments of a pdf The Median divides the CDF into two equal halves P(x med) P(x) x med p(x ' )dx ' 0. 5 Prob( x < xmed ) = Prob( x > xmed ) = 0.5 x Cardiff Feb 16-17 2005 x 43 Measures and moments of a pdf The Mode is the value of x for which the pdf is a maximum p(x) 0.5 1 x x For a Normal pdf, mean = median = mode = Cardiff Feb 16-17 2005 44 Parameter estimation The Bayesian way Cardiff Feb 16-17 2005 45 Bayesian parameter estimation In the Bayesian approach, we can test our model, in the light of our data (i.e. rolling the die) and see how our degree of belief in its ‘fairness’ evolves, for any sample size, considering only the data that we did actually observe. Posterior Likelihood Prior p(model | data, I) p(data | model, I) p(model | I) What we know now Influence of our observations Cardiff Feb 16-17 2005 What we knew before 46 Bayesian parameter estimation Astronomical example #1: Probability that a galaxy is a Seyfert 1 We want to know the fraction of Seyfert galaxies that are type 1. How large a sample do we need to reliably measure this? Model as a binomial pdf: = global fraction of Seyfert 1s Suppose we sample N Seyferts, and observe r Seyfert 1s pN (r | , N) (1 ) r N r Cardiff Feb 16-17 2005 probability of obtaining observed data, given model – the likelihood of 47 Bayesian parameter estimation • But what do we choose as our prior? This has been the source of much argument between Bayesians and frequentists, though it is often not that important. •We can sidestep this for a moment, realising that if our data are good enough, the choice of prior shouldn’t matter! Posterior Likelihood Prior p(model | data, I) p(data | model, I) p(model | I) Dominates Cardiff Feb 16-17 2005 48 Bayesian parameter estimation Consider a simulation of this problem using two different priors p( | I ) Flat prior; all values of equally probable Normal prior; peaked at = 0.5 Cardiff Feb 16-17 2005 49 p( | data, I ) After observing 0 galaxies Cardiff Feb 16-17 2005 50 p( | data, I ) After observing 1 galaxy: Seyfert 1 Cardiff Feb 16-17 2005 51 p( | data, I ) After observing 2 galaxies: S1 + S1 Cardiff Feb 16-17 2005 52 p( | data, I ) After observing 3 galaxies: S1 + S1 + S2 Cardiff Feb 16-17 2005 53 p( | data, I ) After observing 4 galaxies: S1 + S1 + S2 + S2 Cardiff Feb 16-17 2005 54 p( | data, I ) After observing 5 galaxies: S1 + S1 + S2 + S2 + S2 Cardiff Feb 16-17 2005 55 p( | data, I ) After observing 10 galaxies: 5 S1 + 5 S2 Cardiff Feb 16-17 2005 56 p( | data, I ) After observing 20 galaxies: 7 S1 + 13 S2 Cardiff Feb 16-17 2005 57 p( | data, I ) After observing 50 galaxies: 17 S1 + 33 S2 Cardiff Feb 16-17 2005 58 p( | data, I ) After observing 100 galaxies: 32 S1 + 68 S2 Cardiff Feb 16-17 2005 59 p( | data, I ) After observing 200 galaxies: 59 S1 + 141 S2 Cardiff Feb 16-17 2005 60 p( | data, I ) After observing 500 galaxies: 126 S1 + 374 S2 Cardiff Feb 16-17 2005 61 p( | data, I ) After observing 1000 galaxies: 232 S1 + 768 S2 Cardiff Feb 16-17 2005 62 Bayesian parameter estimation What do we learn from all this? •As our data improve (e.g. our sample increases), the posterior pdf narrows and becomes less sensitive to our choice of prior. • The posterior conveys our (evolving) degree of belief in different values of , in the light of our data • If we want to express our result as a single number we could perhaps adopt the mean, median, or mode • We can use the variance of the posterior pdf to assign an uncertainty for • It is very straightforward to define confidence intervals Cardiff Feb 16-17 2005 63 Bayesian parameter estimation p( | data, I ) Bayesian confidence intervals We are 95% sure that lies between 1 and 2 95% of area under pdf 1 Note: the confidence interval is not unique, but we can define the shortest C.I. 2 Cardiff Feb 16-17 2005 64 Bayesian parameter estimation Astronomical example #2: Flux density of a GRB Take Gamma Ray Bursts to be equally luminous events, distributed homogeneously in the Universe. We see three gamma ray photons from a GRB in an interval of 1 s. What is the flux of the source, F? •The seat-of-the-pants answer is F=3 photons/s, with an uncertainty of about 3 , but we can do better than that by including our prior information on luminosity and homogeneity. Call this background information I: Homogeneity implies that the probability the source is in any particular volume of space is proportional to the volume, so the prior probability that the source is in a thin shell of radius r is p(r | I)dr 4r 2dr Cardiff Feb 16-17 2005 65 Bayesian parameter estimation • But the sources have a fixed luminosity, L, so r and F are directly related by F hence L 4r 2 dF r 3 dr p(F|I) • The prior on F is therefore p(F | I) p(r | I) dr F 5 / 2 dF F Interpretation: low flux sources are intrinsically more probable, as there is more space for them to sit in. • We now apply Bayes’ theorem to determine the posterior for F after seeing n photons: p(F | n, I) p(F | I)p(n | F , I) Cardiff Feb 16-17 2005 66 Bayesian parameter estimation • The Likelihood for F comes from the Poisson nature of photons: p(n | F , I) F n exp(F ) / n! so finally, p(F | n, I) F 0.45 n 5/ 2 p(F|n=3,I) 0.4 exp(F ) 0.35 0.3 0.25 For n=3 we get this 0.2 with the most probable value of F equalling 0.5 photons/sec. 0.15 Fmost prob F=n/T 0.1 0.05 0 1 2 3 4 5 F • Clearly it is more probable this is a distant source from which we have seen an unusually high number of photons than it is an unusually nearby source from which we have seen an expected number of photons. (The most probable value of F is n-5/2, approaching n for n>>1) Cardiff Feb 16-17 2005 67 Parameter estimation The frequentist way Cardiff Feb 16-17 2005 68 Frequentist parameter estimation • Recall that in frequentist (orthodox) statistics, probability is limiting relative frequency of outcome, so : only random variables can have frequentist probabilities as only these show variation with repeated measurement. So we can’t talk about the probability of a model parameter, or of a hypothesis. E.g., a measurement of a mass is a RV, but the mass itself is not. • So no orthodox probabilistic statement can be interpreted as directly referring to the parameter in question! For example, orthodox confidence intervals do not indicate the range in which we are confident the parameter value lies. That’s what Bayesian intervals do. • So what do they mean? … Cardiff Feb 16-17 2005 69 Frequentist parameter estimation • Orthodox parameter estimate proceeds as follows: Imagine we have some data, {di}, that we want to use to estimate the value of a parameter, a, in a model. These data must depend on the true value of a in a known way, but as they are random variables all we can say is that we know, or can estimate, p({di}) for some given a. 1. Use the data to construct a statistic (i.e. a function of the data) that can be used as an estimator for a called â . A good estimator will have a pdf that depends heavily on a, and which is sharply peaked at, or very close to, a. p(aˆ| a) Measured value of a Cardiff Feb 16-17 2005 â â 70 Frequentist parameter estimation 2. One such estimator is the maximum likelihood (ML) estimator, constructed in the following way: given the distribution from which the data are drawn, p(d|a), construct the sampling distribution p({di}|a), which is just p(d1|a). p(d2|a). p(d3|a)… if the data are independent. Interpret this as a function of a, called the Likelihood of a, and call the value of a that maximises it for the given data â . The corresponding sampling distribution, p({d } | aˆ ), is the one from which the data were ‘most likely’ drawn. ML i ML L(a) â ML Cardiff Feb 16-17 2005 a 71 Frequentist parameter estimation …but remember that this is just one value of â ML , albeit drawn from a population that has an attractive average behaviour: p(aˆ | a) ML Measured value of a â â ML ML And we still haven’t said anything about a. 3. Define a confidence interval around a enclosing, say, 95% of the expected values of â ML from repeated experiments: p(aˆ | a) ML 2 a Cardiff Feb 16-17 2005 â ML 72 Frequentist parameter estimation may be known from the sampling pdf or it may have to be estimated too. 4. We can now say that Prob(a aˆ a ) 0.95 . Note this is a ML probabilistic statement about the estimator, not about a. However this expression can be restated: 0.95 is the relative frequency with which the statement ‘a lies in the region â ’ is true, over many repeated experiments. ML a No Yes The disastrous shorthand for this is Yes " a aˆ with 95% confidence" Yes ML Note that this is not a statement about our degree of belief that a lies in the numerical interval generated in our particular experiment. No Different experiments Cardiff Feb 16-17 2005 73 Example: B and F compared • Two gravitational wave (GW) detectors see 7 simultaneous burst-like signal in the data from one hour, consisting of GW signals and spurious signals. When the two sets of data are offset in time by 10 minutes there are 9 simultaneous signals. What is the true GW burst rate? (Note: no need to be a expert to realise there is not much GW activity here!) • A frequentist solution: Take the events as Poisson, characterised by the background rate, rb and the GW rate, rg. We get rb rg 7; ˆb g 7 rˆb 9; ˆb 9 so Where is the variance of the sampling distribution 2 rˆg (7 9) 2 and ˆg 9 7 4 Cardiff Feb 16-17 2005 74 Example: B and F compared • So we would quote our result as: rg 2 4 • But burst rates can’t be negative! What does this mean? Infact it is quite consistent with our definition of a frequentist confidence interval. Our value of r̂g is drawn from its sampling distribution, that will look something like: p(rˆg | rg ) So, in the shorthand of the subject, this result ~4 is quite correct. Our particular sample rg Cardiff Feb 16-17 2005 r̂g 75 Example: B and F compared • The Bayesian solution: We’ll go through this carefully. Our job is to determine rg. In Bayesian terms that means we are after p(rg|data). We start by realising there are two experiments here: one to determine the background rate and one to determine the joint rate, so we will also determine p(rb|data). If the background count comes from a Poisson process of mean rb then the probability of n events is rbn exp( rb ) p(n | rb ) n! which is our Bayesian likelihood for rb. We will choose a prior probability for rb proportional to 1/rb. This is the scale invariant prior that encapsulates total ignorance of the non-zero rate (of course in reality we may have something that constrains rb more tightly a priori). See later… Cardiff Feb 16-17 2005 76 Example: B and F compared • So our posterior for rb, based on the background counts, is which, normalising and setting n=9, gives p(rb | n 9) 1 8 rb exp(rb ) 8! Background rate 0.1 p(r |n=9) b 1 rbn exp(rb ) p(rb | n) rb n! 0.05 00 2 4 6 8 10 12 14 16 18 20 r b • The probability of seeing m coincident bursts, given the two rates, is (rb rg )m exp[ (rb rg )] p(m | rb , rg ) m! Cardiff Feb 16-17 2005 77 Example: B and F compared • And our joint posterior for the rates is, by Bayes’ theorem, p(rb , rg | m) p(rb , rg )p(m | rb , rg ) The joint prior in the above can be split into a probability for rb, which we have just evaluated, and a prior for rg. This may be zero, so we will say that we are equally ignorant over whether to expect 0,1,2,… counts from this source. This translates into a uniform prior on rg, so our joint prior is p(rg , rb ) 1 8 rb exp(rb ) 8! Finally we get the posterior on rg by marginalising over rb: p(rg | m 7) 1 rb8 exp(rb ). 1 (rb rg )7 exp[ (rb rg )]drb 8! 7! likelihood prior Cardiff Feb 16-17 2005 78 Example: B and F compared ...giving our final answer to the problem 0.25 p(rg|n=9,m=7) Compare with our frequentist result: 0.15 rg 2 4 g p(r |n=9,m=7) 0.2 0.1 to 1-sigma 0.05 0 0 1 2 3 4 5 rg 6 7 8 9 10 rg Note that p(rg<0)=0, due to the prior, and that the true value of rg could very easily be as high as 4 or 5 Cardiff Feb 16-17 2005 79 Example: B and F compared Let’s see how this result would change if the background count was 3, rather than 9 (joint count still 7): 0.14 p(rg|n=3,m=7) 0.12 0.08 0.06 g p(r |n=3,m=7) 0.1 0.04 0.02 0 0 5 10 15 rg rg Again, this looks very reasonable Cardiff Feb 16-17 2005 80 Bayesian hypothesis testing And why we already know how to do it Cardiff Feb 16-17 2005 81 Bayesian hypothesis testing • In Bayesian analysis, hypothesis testing can be performed as a generalised application of Bayes’ theorem. Generally a hypothesis differs from a parameter in that many values of the parameter(s) are consistent with one hypothesis. Hypotheses are models that depend of parameters. • Note however that we cannot define the probability of one hypothesis given some data, d, without defining all the alternatives in its class, i.e. prob( H1 | d, I) prob( H1 | I) prob(d | H1, I) prob( Hi | I) prob(d | Hi , I) all possible hypotheses and often this is impossible. So questions like “do the data fit a gaussian?” are not well-formed until you list all the curves the data could fit. A well-formed question would be “do the data fit a gaussian better than these other curves?”, or more usually “do the data fit a gaussian better than a lorentzian?”. • Simple comparisons can be expressed as an odds ratio, O O12 prob( H1 | d, I) prob( H1 | I) prob(d | H1, I) prob( H2 | d, I) prob( H2 | I) prob(d | H2 , I) Cardiff Feb 16-17 2005 82 Bayesian hypothesis testing • The odds ratio an be divided into the prior odds and the Bayes’ factor O12 prob( H1 | d, I) prob( H1 | I) prob(d | H1, I) prob( H2 | d, I) prob( H2 | I) prob(d | H2 , I) prior odds Bayes' factor • The prior odds simply express our prior preference for H1 over H2, and is set to 1 if you are indifferent. • The Bayes’ factor is just the ratio of the evidences, as defined in the earlier lectures. Recall that for a model that depends on a parameter a p(a | d, H1 , I) p(a | H1 , I)p(d | a, H1 , I) p(d | H1 , I) so the evidence is simply the joint probability of the parameter(s) and the data, marginalised over all hypothesis parameter values: p(d | H1, I) p(a | H1 , I)p(d | a, H1 , I)da Cardiff Feb 16-17 2005 83 Bayesian hypothesis testing • Example: A spacecraft is sent to a moon of Saturn and, using a penetrating probe, detects a liquid sea deep under the surface at 1 atmosphere pressure and a temperature of -3°C. However, the thermometer has a fault, so that the temperature reading may differ from the true temperature by as much as ±5°C, with a uniform probability within this range. Determine the temperature of the liquid, assuming it is water (liquid within 0<T<100°C) and then assuming it is ethanol (liquid within -80<T<80°C). What are the odds of it being ethanol? measurement -80 0 80 100 °C Ethanol is liquid water is liquid [based loosely on a problem by John Skilling] Cardiff Feb 16-17 2005 84 Bayesian hypothesis testing • Call the water hypothesis H1 and the ethanol hypothesis H2. For H1: The prior on the temperature is p(T | H1) 0.01 for 0 T 100 p(T | H1) 0 otherwise 0 100 T the likelihood of the temperature is the probability of the data d, given the temperature: p(d |T, H1) 0.1 for | d T | 5 p(d | T , H1) 0 otherwise thought of as a function of T, for d=-3, T 10 p(d 3 |T, H1) -3 10 Cardiff Feb 16-17 2005 d T 85 Bayesian hypothesis testing The posterior for T is the normalised product of the prior and the likelihood, giving p(T | d, H1) 0.5 for 0 T 2 p(T | d, H1) 0 otherwise 0 2 H1 only allows the temperature to be between 0 and 2°C. The evidence for water (as we defined it) is • T p(d | H1 ) p(d | T , H1)p(T | H1)dT 0.002 For H2: By the same arguments 0.1 for 8 T 2 p(T | d, H2 ) 0 otherwise p(T |d, H2) -8 and the evidence for ethanol is 0 2 T p(d | H2 ) p(d | T , H2 )p(T | H2 )dT 0.00625 Cardiff Feb 16-17 2005 86 Bayesian hypothesis testing • So under the water hypothesis we have a tighter possible range for the liquid’s temperature, but it may not be water. In fact, the odds of it being water rather than ethanol are O12 prob( H1 | d, I) prob( H1 | I) prob(d | H1, I) 0.002 1 0.32 prob( H2 | d, I) prob( H2 | I) prob(d | H2 , I) 0.00625 prior odds Bayes' factor which means 3:1 in favour of ethanol. Of course this depends on our prior odds too, which we have set to 1. If the choice was between water and whisky under the surface of the moon the result would be very different, though the Bayes’ factor would be roughly the same! • Why do we prefer the ethanol option? Because too much of the prior for temperature, assuming water, falls at values that are excluded by the data. In other words, the water hypothesis is unnecessarily complicated . Bayes’ factors naturally implement Occam’s razor in a quantitative way. Cardiff Feb 16-17 2005 87 Bayesian hypothesis testing p(d |T, H1)p(T | H1)dT 2 0.1 0.01 0.002 p(d |T, H1) 0.1 p(T | H1) -8 0 2 0.01 T p(d |T, H2 )p(T | H2 )dT 10 0.1 0.00625 0.00625 p(d |T, H2) 0.1 p(T | H2) -8 0 2 T 0.00625 Overlap integral (=evidence) is greater for H2 (ethanol) than for H1 (water) Cardiff Feb 16-17 2005 88 Bayesian hypothesis testing • To look at this a bit more generally, we can split the evidence into two approximate parts, the maximum of the likelihood and an “Occam factor”: Lmax likelihood Likelihood_range prior x Prior_range p(d | H) p(x | H)p(d | x, H)dx Lmax likelihood_range prior_range the 'Occam factor' i.e., evidence = maximum likelihood x Occam factor The Occam factor penalises models that include wasted parameter space, even if they show a good ML fit. Cardiff Feb 16-17 2005 89 Bayesian hypothesis testing • This is a very powerful property of the Bayesian method. Example: your given a time series of 1000 data points comprising a number of sinusoids embedded in gaussian noise. Determine the number of sinusoids and the standard deviation of the noise. • We could think of this as comparing hypotheses Hn that there are n sinusoids in the data, with n ranging from 0 to nmax. Equivalently, we could consider it as a parameter fitting problem, with n an unknown parameter within the model. The joint posterior for the n signals, with amplitudes {A}n and frequencies {}n 2 and noise variance given the overall model and data {D} is p(n, { A}n , { }n , | {D}, I) p(n, { A}n , { }n , | I)p({D} | n, { A}n , { }n , ) The likelihood term, based on gaussian noise is n 1 1 p({ D } | n, { A} n , { } n , ) exp 2 (D j Ai sinit j )2 i 0 2 2 j and we can set the priors as independent and uniform over sensible ranges. Cardiff Feb 16-17 2005 90 Bayesian hypothesis testing • Our result for n is just its marginal probability: p(n | {D}, I) { A}n ,{ }n , p(n, { A}n , { }n , | {D}, I)d{ A}n d{ }n d and similarly we could marginalise for . Recent work, in collaboration with Umstatter and Christensen, has explored this: •1000 data points with 50 embedded signals • =2.6 Result: around 33 signals can be recovered from the data, the rest are indistinguishable from noise, and is consequentially higher Cardiff Feb 16-17 2005 91 Bayesian hypothesis testing • This has implications for the analysis of LISA data, which is expected to contain many (perhaps 50,000) signals from white dwarf binaries. The data will contain resolvable binaries and binaries that just contribute to the overall noise (either because they are faint or because their frequencies are too close together). Bayes can sort these out without having to introduce ad hoc acceptance and rejection criteria, and without needing to know the “true noise level” (whatever that means!). Cardiff Feb 16-17 2005 92 Frequentist hypothesis testing And why we should tread carefully, if at all Cardiff Feb 16-17 2005 93 Frequentist hypothesis testing – significance tests • A note on why these should really be avoided! • The method goes like this: – To test a hypothesis H1 consider another hypothesis, called the null hypothesis, H0, the truth of which would deny H1. Then argue against H0… – Use the data you have gathered to compute a test statistic Tobs (eg, the chisquared statistic) which has a calculable pdf if H0 is true. This can be calculated analytically or by Monte Carlo methods. – Look where your observed value of the statistic lies in the pdf, and reject H0 based on how far in the wings of the distribution you have fallen. p(T|H0) Reject H0 if your result lies in here T Cardiff Feb 16-17 2005 94 Frequentist hypothesis testing – significance tests • H0 is rejected at the x% level if x% of the probability lies to the right of the observed value of the statistic (or is ‘worse’ in some other sense): p(T|H0) X% of the area Tobs T and makes no reference to how improbable the value is under any alternative hypothesis (not even H1!). So… An hypothesis [H0] that may be true is rejected because it has failed to predict observable results that have not occurred. This seems a remarkable procedure. On the face of it, the evidence might more reasonably be taken as evidence for the hypothesis, not against it. Harold Jeffreys Theory of Probability 1939 Cardiff Feb 16-17 2005 95 Frequentist hypothesis testing – significance tests • Eg, You are given a data point Tobs affected by gaussian noise, drawn either from N(mean=-1,=0.5) or N(+1,0.5). H0: The datum comes from = -1 H1: The datum comes from = +1 Test whether H1 is true. • Here our statistic is simply the value of the observed data. We will chose a critical region of T>0, so that if Tobs>0 we reject H0 and therefore accept H1. H0 H1 -1 +1 Cardiff Feb 16-17 2005 T 96 Frequentist hypothesis testing Type II error Type I error T • Formally, this can go wrong in two ways: – – A Type I error occurs when we reject the null hypothesis when it is true A Type II error occurs when we accept the null hypothesis when it is false both of which we should strive to minimise • The probabilities (‘p-values’) of these are shown as the coloured areas above (about 5% for this problem) • If Tobs>0 then we ‘reject the null hypothesis at the 5% level’ and therefore accept H1. Cardiff Feb 16-17 2005 97 Frequentist hypothesis testing • But note that we do not consider the relative probabilities of the hypotheses (we can’t! Orthodox statistics does not allow the idea of hypothesis probabilities), so the results can be misleading. • For example, let Tobs=0.01. This lies just on the boundary of the critical region, so we reject H0 in favour of H1 at the 5% level, despite the fact that we know a value of 0.01 is just as likely under both hypotheses (acutally just as unlikely, but it has happened). • Note that the Bayes’ factor for this same result is ~1, reflecting the intuitive answer that you can’t decide between the hypotheses based on such a datum. Moral The subtleties of p-values are rarely reflected in papers that quote them, and many errors of Type III (misuse!) occur. Always take them with a pinch of salt, and avoid them if possible – there are better tools available. Cardiff Feb 16-17 2005 98 Assigning Bayesian probabilities (you have to start somewhere!) Cardiff Feb 16-17 2005 99 Assigning Bayesian probabilities • We have done much on the manipulation of probabilities, but not much on the initial assignments of likelihoods and priors. Here are a few words… The principle of insufficient reason (Bernoulli, 1713) helps us out: If we have N mutually exclusive possibilities for an outcome, and no reason to believe one more than another, then each should be assigned the probability 1/N. • So the probability of throwing a 6 on a die is 1/6, if all you know is it has 6 faces. • The key is to realise that this is one example of a more general principle. Your state of knowledge is such that the probabilities are invariant under some transformation group (here, the exchange of numbers on faces). • Using this idea we can extend the principle of insufficient reason to continuous parameters. Cardiff Feb 16-17 2005 100 Assigning Bayesian probabilities • So, a parameter that is not known to within a change of origin (a ‘location parameter’) should have a uniform probability p(x) p(x a) constant • A parameter for which you have no knowledge of scale (a ‘scale parameter’) is a location parameter in its log so p(x) 1 x the so-called ‘Jeffreys prior’. • Note that both these prior are improper – you can’t normalise them, so their unfettered use is restricted. However, you are rarely that ignorant about the value of a parameter. Cardiff Feb 16-17 2005 101 Assigning Bayesian probabilities • More generally, given some information, I, that we wish to use to assign probabilities {p1…pn} to n different possibilities, then the most honest way of assigning the probabilities is the one that maximises n H pi log pi i 1 subject to the constraints imposed by I. H is traditionally called the information entropy of the distribution and measures the amount of uncertainty in the distribution in the sense of Shannon (though there are several routes to the same result). Honesty demands we maximise this, otherwise we are implicitly imposing further constraints not contained in I. • For example, the maximum entropy solution for the probability of seeing k events given only the information that they are characterised by a single rate constant is the Poisson distribution. If you are told the first and second moments of a continuous distribution the maxent solution is a gaussian etc… Cardiff Feb 16-17 2005 102 Monte Carlo methods How to do those nasty marginalisations Cardiff Feb 16-17 2005 103 Monte Carlo Methods 1. Uniform random number, U[0,1] See Numerical Recipes! http://www.numerical-recipes.com/ Cardiff Feb 16-17 2005 104 Monte Carlo Methods 2. Transformed random variables Suppose we have x ~ U[0,1] Let y y(x) Then p(y )dy p(x)dx p(y ) Probability of number between y and y+dy Probability of number between x and x+dx Cardiff Feb 16-17 2005 p(x(y )) dy dx Because probability must be positive 105 Monte Carlo Methods 2. Transformed random variables Suppose we have x ~ U[0,1] p(y) 1 b-a Let y a (b a)x Then y ~ U[a, b] 0 a b y Numerical Recipes uses the transformation method to provide y ~ N(0,1) Cardiff Feb 16-17 2005 106 Monte Carlo Methods 3. Probability integral transform Suppose we can compute the CDF of some desired random variable 1) y ~ U[0,1] 2) Compute 3) Then x P 1(y) x ~ p(x) Cardiff Feb 16-17 2005 107 Monte Carlo Methods 4. Rejection Sampling Suppose we want to sample from some pdf p(x) and we know that p(x) q(x) x where q(x) is a simpler pdf called the proposal distribution q(x) 1) Sample x1 from q(x) 2) Sample y~U[0,q(x1)] 3) If y<p(x) otherwise y p(x) ACCEPT REJECT x1 Set of accepted values {xi} are a sample from p(x). Cardiff Feb 16-17 2005 108 Monte Carlo Methods 4. Rejection Sampling The method can be very slow if the shaded region is too large -particularly in high-N problems, such as the LISA problem considered earlier. LISA Cardiff Feb 16-17 2005 109 Monte Carlo Methods p(x) 5. Metropolis-Hastings Algorithm Speed this up by letting the proposal distribution depend on the current sample: (1) o Sample initial point x o Sample tentative new state from Q (x' ; x(1)) (e.g. Gaussian) o Compute p(x' ) Q (x(1) ; x' ) a p(x(1) ) Q (x' ; x(1) ) o If a 1 Otherwise Accept Accept with probability a Acceptance: Rejection: x(2) x' x(2) x(1) X is a Markov Chain. Note that rejected samples appear in the chain as repeated values of the current state. Cardiff Feb 16-17 2005 110 Monte Carlo Methods A histogram of the contents of the chain for a parameter converges to the marginal pdf for that parameter. In this way, high-dimensional posteriors can be explored and marginalised to return parameter ranges and interrelations inferred fro complex data sets (eg, the WMAP results) http://www.statslab.cam.ac.uk/~mcmc/pages/links.html Cardiff Feb 16-17 2005 111 Cardiff Feb 16-17 2005 112 The end nearly Cardiff Feb 16-17 2005 113 Final thoughts • Things to keep in mind: – Priors are fundamental, but not always influential. If you have good data most reasonable priors return very similar posteriors. – Degree of belief is subjective but not arbitrary. Two people with the same degree of belief agree about the value of the corresponding probability. – Write down the probability of everything, and marginalise over those parameters that are of no interest. – Don’t pre-filter if you can help it. Work with the data, not statistics of the data. Cardiff Feb 16-17 2005 114 Bayesian bibliography for astronomy There are an increasing number of good books and articles on Bayesian methods. Here are just a few: • • • • E.T. Jaynes, Probability theory: the logic of science, CUP 2003 D.J.C. Mackay, Information theory, inference and learning algorithms, CUP, 2004 D.S. Sivia, Data analysis – a Bayesian tutorial, OUP 1996 T.J. Loredo, From Laplace to Supernova SN 1987A: Bayesian Inference in Astrophysics, 1990, copy at http://bayes.wustl.edu/gregory/articles.pdf • G.L. Bretthorst, Bayesian Spectrum Analysis and Parameter Estimation, 1988 copy at http://bayes.wustl.edu/glb/book.pdf, • G. D'Agostini, Bayesian reasoning in high-energy physics: principles and applications http://preprints.cern.ch/cgi-bin/setlink?base=cernrep&categ=Yellow_Report&id=99-03 • Soon to appear: P. Gregory, Bayesian Logical Data Analysis for the Physical Sciences, 2005, CUP Review sites: • http://bayes.wustl.edu/ • http://www.bayesian.org/ Cardiff Feb 16-17 2005 115