Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Statistics Statistics For For Astrophysics Astrophysics Week Week22 continuously improving... Robert da Silva, UCSC, May 24, 2013 Outline Outline ● ● Last Week: ● Discrete Distributions ● Bayes's Theorem a.k.a. Model Fitting ● Conjugacy This Week: ● Continuous Distributions ● Non-conjugacy – MCMC Gibbs – Censored Data ● Metropolis-Hastings Next Week: ● ● ● Model Selection ● Decision Theory? Notation Notation ● Distributions are at the heart of Statistics “distributed as” Parameters of the distribution p(x) » N (0; ¾2 ) The variable we are describing Name of the distribution we are using p(x) = N (x j ¹; ¾2 ) x » N (0; ¾ 2 ) Expectation Expectation E(f (x)) = Z f (x)p(x)dx E(x) = mean 2 2 2 ¾ = Variance = V ar(x) = E(x ) ¡ [E(x)] ¸ 0 E(1) = Z p(x)dx = 1 Integral is over the whole region the PDF is non-zero Bayes' Bayes' Rule Rule the “likelihood” of your data Prior on model parameters p(y j µ)p(µ) p(µ j y) = p(y) Posterior distribution of model parameters Normalizing Constant µ : parameters of your model y : your data If you can, choose a functional form of p(µ) such that your posterior is of a nice clean form i.e. conjugacy Normal Normal Standard Normal is N (0; 1) Law Law of of Large Large Numbers Numbers ● If you have enough data,everything starts to look Gaussian* ● Binomial ● Poisson ● Neg-Binomial ● Gamma * But “enough data” could be a lot more than the amount you have! Gamma(®; ¯) Beta(®; ¯) 2  Distribution Distribution Another name for a Gamma Distribution Q= k X i=1 Zi2 will be a Â2 distribution with k DOF if Zi are standard normals an example of Zi is (x ¡ ¹)=¾ Sampling Distribution Student-t Sampling Distribution Student-t For the normal, certain statistics follow speci¯c distributions ¹ j ¹; ¾2 ) » N (¹; ¾ 2 =n) p(X n X 1 2 2 ¾ ^0 = (Xi ¡ ¹) n i=1 n^ ¾02 2 »  n ¾2 n X ¡ ¢2 1 2 ¹ ¾ ^ = Xi ¡ X n i=1 n^ ¾2 2 »  n¡1 ¾2 Sampling Student-t Sampling Distribution Distribution Likelihood Student-t~~ Likelihood Once you know y, the likelihood takes a single value if you know µ You can then vary µ to see what values are most likely to produce the data Sampling Distribution: The Distribution for a statistic given parameters µ Likelihood: Once you know the data, you can evaluate the value of the sampling distribution at the observed y and plot as a function of µ This is a key part of Bayesian Inference because it lets you invert the probability Lets you go from p(y j µ) to p(µ j y) 2 Invese- Distribution Distribution If you had a Â2k variable and inverted it, you get a Inv-Â2k Scaled-Invese-Â2 Distribution Distribution If you had a Â2k variable x and wanted the distribution of ¿ 2 =x, you get this Also written as Inv-Â2 (º; ¿ 2 ) Normal Normal with with Known Known Variance, Variance, Unknown Unknown Mean Mean Consider the case where your noise is dominated by background, which you measure has a noise of ¾ = 2. You have 6 measurements, what is the mean? 2 µ = [¹; ¾ ] y = [4:64; 3:47; 6:06; 0:61; 6:96; 1:82] µ ¶ 1 1 p(yi j ¹; ¾ ) = p exp ¡ 2 (yi ¡ ¹)2 2¾ 2¼¾ 2 n Y p(y j ¹; ¾ 2 ) = p(yi j ¹; ¾2 ) à ! i=1 µ ¶n n X 1 1 p(y j ¹; ¾ 2 ) = p exp ¡ 2 (yi ¡ ¹)2 2¾ i=1 2¼¾2 2 Completing the square gives 2 p(y j ¹; ¾ ) = µ 1 p 2¼¾2 ¶n à " 1 exp ¡ 2 n(¹ ¡ y¹)2 + 2¾ n X i=1 ³ n ´ p(y j ¹; ¾2 ) / ¾¡n exp ¡ 2 (¹ y ¡ ¹)2 2¾ (y ¡ y¹) #! Normal Normal with with Known Known Variance, Variance, Unknown Unknown Mean Mean Consider the case where your noise is dominated by background, which you measure has a noise of ¾ = 2. You have 6 measurements, what is the mean? y = [4:64; 3:47; 6:06; 0:61; 6:96; 1:82] µ = [¹; ¾2 ] ³ n ´ p(y j ¹; ¾2 ) / ¾¡n exp ¡ 2 (¹ y ¡ ¹)2 2¾ So what should the prior look like? µ ¶ 1 p(¹) / exp ¡ 2 (¹ ¡ m0 )2 2v0 Weight by Inverse Variances p(¹ j y; ¾ 2 ) » N (m1 ; v12 ) m1 = m0 v02 1 v02 + + y ¹n ¾2 n ¾2 v12 = 1 v02 1 + Add the Inverse Variances n ¾2 Normal Normal with with Known Known Variance, Variance, Unknown Unknown Mean Mean ¹ = 5; ¾ 2 = 4 y¹ = 3:93; ¾ 2 = 4 Normal Normal with with Known Known Mean, Mean, Unknown Unknown Variance Variance The classic calibration. We know the mean and measure multiple times to understand our errors à ! n 1 X 2 2 ¡n=2 2 p(y j ¹; ¾ ) / (¾ ) exp ¡ (y ¡ ¹) i 2 n 2¾ X 1 i=1 s2 = (yi ¡ ¹)2 ³ n ´ n i=1 p(y j ¹; ¾2 ) / (¾2 )¡n=2 exp ¡ 2 s2 2¾ p(¾2 ) » Inv-Â2 (º0 ; ¿ 2 ) µ ¶ 2 ¡ 2 ¢¡(1+º=2) º¿ 2 p(¾ ) / ¾ exp ¡ 2 2¾ We want a scaled inverse-Â2 prior here µ ¶ 2 2 º0 ¿ + ns 2 2 2 2 p(¾ j y; ¹) / p(¾ )p(y j ¹; ¾ ) » Inv- º0 + n; º0 + n Normal Normal with with Known Known Mean, Mean, Unknown Unknown Variance Variance ¾2 ´ 4 Monte Monte Carlo Carlo Methods Methods Consider the question of the probability of you winning a game of solitaire ● This is a very difficult problem that involves annoying details of how the cards are shuffled and appear and even involves needing to take into account the skill of the player ? ● You could spend years of your life trying to figure this out or.... You could just ● play a few games! Lose Lose Win Lose P (Win) = 1=4 Monte Monte Carlo Carlo Methods Methods So there are a lot of ways of working with probabilities, one way is to analytically calculate the PDFs. This way you have nearly infi nite accuracy for a wide variety of tasks, but it can be impossible in many instances. So what can we do? How about instead of analytically solving the PDF, we instead just have a large sample of numbers drawn from the distribution. Then we could do whatever we want. For example: ● Calculate the 95% conf i dence range for the PDF ● Determine what fraction of values fall in a particular region ● Plug in those values into some analysis to propagate uncertainty ● What is the maximum value I expect to see in a sample of size N? Monte Monte Carlo Carlo Methods Methods Last week James asked how can I tell if my Poisson variable is overdispersed. One simple way is to assume Poisson, draw N equivalent data sets and see what fraction have as large of discrepancy between the mean and variance as your data. If your data is a signifi cant outlier, then try something more robust (like negative binomial). Counts data: y = [3; 6; 4; 9; 2]; E(x) = 4:8; V ar(x) = 7:7 V ar(x)=E(x) = 1:32 Monte Monte Carlo Carlo Methods Methods My personal favorite Monte Carlo technique is that of calibration. You have some data with some signal and some noise. Make a whole bunch of fake data with the same level of noise and put it through Fake Data with known properties Your Software Outputs that you can compare to the true values Drawing Drawing Random Random Numbers Numbers Most packages have their own built in routines for drawing random numbers from standard distributions. R : Language used by most statistician obvious has the cleanest implementation. rgamma() gives you a random gamma, rbeta() a beta, etc IDL: randomn() supports binomial, single parameter gamma, uniform, poisson, and normal distributions. Other routines fi ll in the other gaps (try googling). Python: scipy.stats has nice implementations of most common distributions Many distributions can be constructed by manipulation of other distributions. Inverse-Gamma ~ 1/Gamma() N(20, 10) ~ 20 + 10N(0,1) Normal Normal with with Unknown Unknown Mean Mean and and Variance Variance There is a way to tackle this with conjugacy but its not as instructive as showing an entirely di®erent method: Gibbs Sampler MCMC The basic idea is that we know how to solve the problem when we know the mean or the variance, so we can break it into these problems 1. 2. 3. 4. Initialize to some reasonable values for ¹ and ¾ 2 called ¹0 and ¾02 2 At iteration i draw a value from p(¹ j ¾i¡1 ; y) and call it ¹i Then draw a value from p(¾2 j ¹2i ; y) and call it ¾i2 Continue steps 2,3 for a large number N » 104 steps, ignore ¯rst 1/3 You now have 2N=3 joint samples of (¹,¾2 ) Your samples will reasonably approximate draws from p(¹; ¾ 2 j y) Do whatever you want with them! MCMC: MCMC: Gibbs Gibbs Sampler Sampler In general, you can implement a Gibbs Sampler as follows 1. Identify your parameter vector µ of length D and set initial values µ0 2. For t = 1; : : : ; N do the following For i = 1; : : : ; D t¡1 t¡1 t Draw µit from the full conditional p(µi j µ1t ; : : : ; µi¡1 ; µi+1 ; : : : ; µD ; y) with obvious edge cases Thus each parameter element is updated conditional on the values of the other components of θ which are the values of the current iteration if they have already been updated and the previous iteration if they have not Also known as alternating conditonal sampling Censored Censored Data Data @&#%! Now imagine that our data looked had lower limits y = [4:64; 3:47; 6:06; > 0:61; 6:96; 1:82] How do we tackle this problem? Censored Censored Data Data Now imagine that our data looked had lower limits y = [4:64; 3:47; 6:06; > 0:61; 6:96; 1:82] How do we tackle this problem? Easy! Just add the true value to θ µ = [¹; ¾ 2 ; y4 ] So what is the full conditional of y4 p(y4 j ¹t ; ¾t2 ; y1t ; y2t ; y3t ; y5t ; y6t ) = p(y4 j ¹t ; ¾t2 ) » N (¹t ; ¾t2 )1>0:61 This is a truncated normal distribution Censored Censored Data Data Censored Censored Data Data What is the likelihood of a lower limit, L? Z 1 p(L j µ) = p(y j µ)dy L More generally, you can think of any interval as [L,U] Z U p([L; U ] j µ) = p(y j µ)dy L where you can let U or L go to limiting values for lower/upper limits Rejection Rejection Sampling Sampling How do you draw from a truncated normal distribution? The simplest way is to draw from a normal and reject and replace any any draws in the undesired range. Obviously, if you are out in the tails this is very expensive and bad, so you would use something else (typically the Transformation Method) But this method is more powerful than just restricted ranges! Consider a x »Uniform(0,1) random number and reject above 0.6 with probability 1=2 This will produce what kind of PDF? p(x) x Rejection Rejection Sampling Sampling Draw from a proposal distribution g(x) that for some c is always greater than f (x) Then reject with probability 1 ¡ f (x) cg(x) Marginalization Marginalization Before we saw that p(µ1 ; µ2 ) = p(µ1 j µ2 )p(µ2 ) How do we get p(µ1 ) from p(µ1 ; µ2 ) Z p(µ1 ) = p(µ1 ; µ2 )dµ2 e.g. θ2 is a nuisance variable This can be done via analytic integration (hard) or via Monte Carlo methods (easy) For Monte Carlo, just draw (µ1 ; µ2 ) jointly and throw away µ2 Student-t Student-t as df→ infinity, Z it approaches gaussian p(¹ j y) = p(¹; ¾2 j y)d¾2 Z = p(¹ j ¾2 ; y)p(¾2 j y)d¾ 2 Student-t Student-t with with Unknown Unknown DOF DOF Imagine that our science demanded that we know the posterior distribution for the DOF ¡ º+1 ¢ µ ¶¡ º+1 2 2 ¡ 2 y p(y j º) = q ¡º¢ 1 + º º¼¡ 2 As is suggested by the complexity of this expression, there is no clean conjugacy that works. Sometimes the likelihood and/or the prior don't even have an analytic forms. This can also be an issue if you have actual prior knowledge that is multimodal or is not in the nice clean conjugate form. So what can we do? MCMC: MCMC: Metropolis-Hastings Metropolis-HastingsAlgorithm Algorithm 1. Pick a starting point µ 0 2. For t = 1; : : : ; N do the following (a) Sample a new µ¤ from a jumping distribution Jt (µ¤ j µt¡1 ) This distribution does not have to be symmetric. Commonly this is a gaussian. p(µ ¤ jy)=Jt (µ ¤ jµ t¡1 ) p(µ t¡1 jy)=Jt (µ t¡1 jµ ¤ ) (b) Calculate r = ( µ¤ with probability min(r; 1) t (c) µ = µt¡1 otherwise 3. Throw away ¯rst N0 draws and use rest as draws from posterior. This means you only need to be able to calculate the likelihood and prior given the parameters There are other methods of making the jumping distribution see affine-invariance ensemble methods: see emcee MCMC: MCMC: Metropolis MetropolisAlgorithm Algorithm 1. Pick a starting point µ 0 2. For t = 1; : : : ; N do the following (a) Sample a new µ¤ from a jumping distribution Jt (µ¤ j µt¡1 ) This distribution must be symmetric Jt (µ¤ j µt¡1 ) = Jt (µt¡1 j µ¤ ) Commonly this is a gaussian. p(µ ¤ jy) p(µ t¡1 jy) (b) Calculate r = ( µ¤ with probability min(r; 1) t (c) µ = µt¡1 otherwise 3. Throw away ¯rst N0 draws and use rest as draws from posterior. When When to to Use Use What What ● ● ● If you can, you should use a conjugate method If that isn't possible, try to see if you can find the full conditionals and try a Gibbs Sampler If all else fails, use Metropolis-Hastings, which can require careful tuning AAQuick Quick Detour Detour into into Confidence Confidence Intervals Intervals 95% Con¯dence Region >0.95 for each parameter What is Pr(x1 < µ1 < x2 )? > 0:95 y2 µ2 0.95 y1 x1 µ1 x2 If the pink lines bounded the 95 percent range for each variable, the maximum inside the box would be 0.952 < 0.95 That means the projections of the confi dence region are too conservative. The marginals are the real way of doing this. AAQuick Quick Detour Detour into into Confidence Confidence Intervals Intervals The standard technique non-Bayesian technique I have seen is: 2 1) Find the best fi tting model in terms of and call it µ^ 2) Then move out in parameter space a distance until the 2 has increased by an amount 2. Number of Parameters What does this Assume? Numerical Recipes in C++, pg 815 p (%) 68.27 90 95.45 99 99.73 99.99 1 2 3 4 5 6 1.00 2.71 4.00 6.63 9.00 15.1 2.3 4.61 6.18 9.21 11.8 18.4 3.53 6.25 8.02 11.3 14.2 21.1 4.72 7.78 9.72 13.3 16.3 23.5 5.89 9.24 11.3 15.1 18.2 25.7 7.04 10.6 12.8 16.8 20.1 27.9 AAQuick Quick Detour Detour into into Confidence Confidence Intervals Intervals Assume that your posterior is Gaussian p(µ j y) » N (¹µ ; §µ ) µ ¶ 1 ¡d=2 p(µ j y) / j§µ j exp ¡ (µ ¡ ¹µ )T §¡1 µ (µ ¡ ¹µ ) 2 d 1 log p(µ j y) = constant ¡ log j§µ j ¡ (µ ¡ ¹µ )T §¡1 µ (µ ¡ ¹µ ) 2 2 1 ¡ £ a Â2d distributed variable 2 Constant w.r.t θ So you can ¯nd the \posterior maximum" µ^ and go out ¢Â2 = x such that exp(¡ 12 Â2d (x)) = 0:95 for 95% con¯dence intervals For 2D, a multivariate gaussian can only produce rotated ellipses This is just a benchmark, a heuristic that probably shouldn't be used for real serious work. You should be using MCMC realization to figure out your contours. y2 µ2 y1 x1 µ1 x2 Is the maximum the best? AAQuick Quick Detour Detour into into Confidence Confidence Intervals Intervals Conclusion Conclusion Be careful with your confi dence intervals! Â2 isn't the best way to do this. You want MCMC! The peak of the posterior is not always the best. So don't just use the peak value § an interval. You need to think about the full distribution, including all of its joint properties!