Download Statistics For Astrophysics Statistics For Astrophysics

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Statistics
Statistics For
For Astrophysics
Astrophysics
Week
Week22
continuously
improving...
Robert da Silva, UCSC, May 24, 2013
Outline
Outline
●
●
Last Week:
●
Discrete Distributions
●
Bayes's Theorem a.k.a. Model Fitting
●
Conjugacy
This Week:
●
Continuous Distributions
●
Non-conjugacy
–
MCMC
Gibbs
– Censored Data
● Metropolis-Hastings
Next Week:
●
●
●
Model Selection
●
Decision Theory?
Notation
Notation
●
Distributions are at the heart of Statistics
“distributed as”
Parameters
of the distribution
p(x) » N (0; ¾2 )
The variable we
are describing
Name of the
distribution we
are using
p(x) = N (x j ¹; ¾2 )
x » N (0; ¾ 2 )
Expectation
Expectation
E(f (x)) =
Z
f (x)p(x)dx
E(x) = mean
2
2
2
¾ = Variance = V ar(x) = E(x ) ¡ [E(x)] ¸ 0
E(1) =
Z
p(x)dx = 1
Integral is over the whole region the PDF is non-zero
Bayes'
Bayes' Rule
Rule
the “likelihood”
of your data
Prior on model
parameters
p(y j µ)p(µ)
p(µ j y) =
p(y)
Posterior
distribution of
model parameters
Normalizing Constant
µ : parameters of your model
y : your data
If you can, choose a functional form of p(µ) such that
your posterior is of a nice clean form i.e. conjugacy
Normal
Normal
Standard Normal is N (0; 1)
Law
Law of
of Large
Large Numbers
Numbers
●
If you have enough
data,everything starts
to look Gaussian*
●
Binomial
●
Poisson
●
Neg-Binomial
●
Gamma
* But “enough data”
could be a lot more than
the amount you have!
Gamma(®; ¯)
Beta(®; ¯)
2
Â
Distribution
Distribution
Another name for a Gamma Distribution
Q=
k
X
i=1
Zi2
will be a Â2 distribution with k DOF if Zi are standard normals
an example of Zi is (x ¡ ¹)=¾
Sampling
Distribution
Student-t
Sampling
Distribution
Student-t
For the normal, certain statistics follow speci¯c distributions
¹ j ¹; ¾2 ) » N (¹; ¾ 2 =n)
p(X
n
X
1
2
2
¾
^0 =
(Xi ¡ ¹)
n
i=1
n^
¾02
2
»
Â
n
¾2
n
X
¡
¢2
1
2
¹
¾
^ =
Xi ¡ X
n
i=1
n^
¾2
2
»
Â
n¡1
¾2
Sampling
Student-t
Sampling Distribution
Distribution
Likelihood
Student-t~~ Likelihood
Once you know y, the likelihood takes a single value if you know µ
You can then vary µ to see what values are most likely to produce the data
Sampling Distribution: The Distribution for a statistic given parameters µ
Likelihood: Once you know the data, you can evaluate the value of the sampling distribution
at the observed y and plot as a function of µ
This is a key part of
Bayesian Inference
because it lets you
invert the probability
Lets you go from p(y j µ) to p(µ j y)
2
Invese-Â
Distribution
Distribution
If you had a Â2k variable and inverted it, you get a Inv-Â2k
Scaled-Invese-Â2 Distribution
Distribution
If you had a Â2k variable x and wanted the distribution of ¿ 2 =x, you get this
Also written as Inv-Â2 (º; ¿ 2 )
Normal
Normal with
with Known
Known Variance,
Variance, Unknown
Unknown Mean
Mean
Consider the case where your noise is dominated by background,
which you measure has a noise of ¾ = 2.
You have 6 measurements, what is the mean?
2
µ
=
[¹;
¾
]
y = [4:64; 3:47; 6:06; 0:61; 6:96;
1:82]
µ
¶
1
1
p(yi j ¹; ¾ ) = p
exp ¡ 2 (yi ¡ ¹)2
2¾
2¼¾ 2
n
Y
p(y j ¹; ¾ 2 ) =
p(yi j ¹; ¾2 )
Ã
!
i=1
µ
¶n
n
X
1
1
p(y j ¹; ¾ 2 ) = p
exp ¡ 2
(yi ¡ ¹)2
2¾ i=1
2¼¾2
2
Completing the square gives
2
p(y j ¹; ¾ ) =
µ
1
p
2¼¾2
¶n
Ã
"
1
exp ¡ 2 n(¹ ¡ y¹)2 +
2¾
n
X
i=1
³ n
´
p(y j ¹; ¾2 ) / ¾¡n exp ¡ 2 (¹
y ¡ ¹)2
2¾
(y ¡ y¹)
#!
Normal
Normal with
with Known
Known Variance,
Variance, Unknown
Unknown Mean
Mean
Consider the case where your noise is dominated by background,
which you measure has a noise of ¾ = 2.
You have 6 measurements, what is the mean?
y = [4:64; 3:47; 6:06; 0:61; 6:96; 1:82]
µ = [¹; ¾2 ]
³ n
´
p(y j ¹; ¾2 ) / ¾¡n exp ¡ 2 (¹
y ¡ ¹)2
2¾
So what should the prior look like?
µ
¶
1
p(¹) / exp ¡ 2 (¹ ¡ m0 )2
2v0
Weight by
Inverse Variances
p(¹ j y; ¾ 2 ) » N (m1 ; v12 )
m1 =
m0
v02
1
v02
+
+
y
¹n
¾2
n
¾2
v12
=
1
v02
1
+
Add the Inverse
Variances
n
¾2
Normal
Normal with
with Known
Known Variance,
Variance, Unknown
Unknown Mean
Mean
¹ = 5; ¾ 2 = 4
y¹ = 3:93; ¾ 2 = 4
Normal
Normal with
with Known
Known Mean,
Mean, Unknown
Unknown Variance
Variance
The classic calibration. We know the mean and measure multiple times
to understand our errors
Ã
!
n
1 X
2
2 ¡n=2
2
p(y
j
¹;
¾
)
/
(¾
)
exp
¡
(y
¡
¹)
i
2
n
2¾
X
1
i=1
s2 =
(yi ¡ ¹)2
³ n
´
n
i=1
p(y j ¹; ¾2 ) / (¾2 )¡n=2 exp ¡ 2 s2
2¾
p(¾2 ) » Inv-Â2 (º0 ; ¿ 2 )
µ
¶
2
¡ 2 ¢¡(1+º=2)
º¿
2
p(¾ ) / ¾
exp ¡ 2
2¾
We want a scaled inverse-Â2 prior here
µ
¶
2
2
º0 ¿ + ns
2
2
2
2
p(¾ j y; ¹) / p(¾ )p(y j ¹; ¾ ) » Inv-Â º0 + n;
º0 + n
Normal
Normal with
with Known
Known Mean,
Mean, Unknown
Unknown Variance
Variance
¾2 ´ 4
Monte
Monte Carlo
Carlo Methods
Methods
Consider the question of the
probability of you winning a game of
solitaire
●
This is a very difficult problem that
involves annoying details of how the
cards are shuffled and appear and even
involves needing to take into account
the skill of the player
?
●
You could spend years of your life
trying to figure this out or....
You could just
●
play a few games!
Lose
Lose
Win
Lose
P (Win) = 1=4
Monte
Monte Carlo
Carlo Methods
Methods
So there are a lot of ways of working with probabilities, one way is to
analytically calculate the PDFs. This way you have nearly infi nite accuracy for
a wide variety of tasks, but it can be impossible in many instances.
So what can we do? How about instead of analytically solving the PDF, we
instead just have a large sample of numbers drawn from the distribution.
Then we could do whatever we want.
For example:
● Calculate the 95% conf
i dence range for the PDF
● Determine what fraction of values fall in a particular region
● Plug in those values into some analysis to propagate uncertainty
● What is the maximum value I expect to see in a sample of size N?
Monte
Monte Carlo
Carlo Methods
Methods
Last week James asked how can I tell if my Poisson variable is overdispersed. One simple way is to assume Poisson, draw N equivalent data sets
and see what fraction have as large of discrepancy between the mean and
variance as your data. If your data is a signifi cant outlier, then try something
more robust (like negative binomial).
Counts data: y = [3; 6; 4; 9; 2]; E(x) = 4:8; V ar(x) = 7:7
V ar(x)=E(x) = 1:32
Monte
Monte Carlo
Carlo Methods
Methods
My personal favorite Monte Carlo technique is that of calibration.
You have some data with some signal and some noise. Make a whole bunch
of fake data with the same level of noise and put it through
Fake Data with
known properties
Your Software
Outputs that you
can compare to
the true values
Drawing
Drawing Random
Random Numbers
Numbers
Most packages have their own built in routines for drawing
random numbers from standard distributions.
R : Language used by most statistician obvious has the cleanest
implementation. rgamma() gives you a random gamma, rbeta() a
beta, etc
IDL: randomn() supports binomial, single parameter gamma,
uniform, poisson, and normal distributions. Other routines fi ll in
the other gaps (try googling).
Python: scipy.stats has nice implementations of most common
distributions
Many distributions can be constructed by manipulation of other
distributions.
Inverse-Gamma ~ 1/Gamma()
N(20, 10) ~ 20 + 10N(0,1)
Normal
Normal with
with Unknown
Unknown Mean
Mean and
and Variance
Variance
There is a way to tackle this with conjugacy but its not as instructive as
showing an entirely di®erent method: Gibbs Sampler MCMC
The basic idea is that we know how to solve the problem when we know
the mean or the variance, so we can break it into these problems
1.
2.
3.
4.
Initialize to some reasonable values for ¹ and ¾ 2 called ¹0 and ¾02
2
At iteration i draw a value from p(¹ j ¾i¡1
; y) and call it ¹i
Then draw a value from p(¾2 j ¹2i ; y) and call it ¾i2
Continue steps 2,3 for a large number N » 104 steps, ignore ¯rst 1/3
You now have 2N=3 joint samples of (¹,¾2 )
Your samples will reasonably approximate draws from p(¹; ¾ 2 j y)
Do whatever you want with them!
MCMC:
MCMC: Gibbs
Gibbs Sampler
Sampler
In general, you can implement a Gibbs Sampler as follows
1. Identify your parameter vector µ of length D and set initial values µ0
2. For t = 1; : : : ; N do the following
For i = 1; : : : ; D
t¡1
t¡1
t
Draw µit from the full conditional p(µi j µ1t ; : : : ; µi¡1
; µi+1
; : : : ; µD
; y)
with obvious edge cases
Thus each parameter element is updated conditional on the values
of the other components of θ which are the values of the current
iteration if they have already been updated and the previous
iteration if they have not
Also known as
alternating
conditonal sampling
Censored
Censored Data
Data
@&#%!
Now imagine that our data looked had lower limits
y = [4:64; 3:47; 6:06; > 0:61; 6:96; 1:82]
How do we tackle this problem?
Censored
Censored Data
Data
Now imagine that our data looked had lower limits
y = [4:64; 3:47; 6:06; > 0:61; 6:96; 1:82]
How do we tackle this problem?
Easy! Just add the
true value to θ
µ = [¹; ¾ 2 ; y4 ]
So what is the full conditional of y4
p(y4 j ¹t ; ¾t2 ; y1t ; y2t ; y3t ; y5t ; y6t ) = p(y4 j ¹t ; ¾t2 ) » N (¹t ; ¾t2 )1>0:61
This is a truncated
normal distribution
Censored
Censored Data
Data
Censored
Censored Data
Data
What is the likelihood of a lower limit, L?
Z 1
p(L j µ) =
p(y j µ)dy
L
More generally, you can think of any interval as [L,U]
Z U
p([L; U ] j µ) =
p(y j µ)dy
L
where you can let U or L go to limiting values for lower/upper limits
Rejection
Rejection Sampling
Sampling
How do you draw from a truncated normal distribution?
The simplest way is to draw from a normal and reject and replace any
any draws in the undesired range.
Obviously, if you are out in the tails this is very expensive and bad, so you
would use something else (typically the Transformation Method)
But this method is more powerful than just restricted ranges!
Consider a x »Uniform(0,1) random number and reject above 0.6 with probability 1=2
This will produce what kind of PDF?
p(x)
x
Rejection
Rejection Sampling
Sampling
Draw from a proposal distribution g(x) that for some c is always greater than f (x)
Then reject with probability 1 ¡
f (x)
cg(x)
Marginalization
Marginalization
Before we saw that p(µ1 ; µ2 ) = p(µ1 j µ2 )p(µ2 )
How do we get p(µ1 ) from p(µ1 ; µ2 )
Z
p(µ1 ) = p(µ1 ; µ2 )dµ2
e.g. θ2 is a nuisance
variable
This can be done via analytic integration (hard) or via Monte
Carlo methods (easy)
For Monte Carlo, just draw (µ1 ; µ2 ) jointly and throw away µ2
Student-t
Student-t
as df→ infinity,
Z it approaches gaussian
p(¹ j y) = p(¹; ¾2 j y)d¾2
Z
= p(¹ j ¾2 ; y)p(¾2 j y)d¾ 2
Student-t
Student-t with
with Unknown
Unknown DOF
DOF
Imagine that our science demanded that we know the posterior distribution for the DOF
¡ º+1 ¢ µ
¶¡ º+1
2
2
¡ 2
y
p(y j º) = q
¡º¢ 1 + º
º¼¡ 2
As is suggested by the complexity of this expression, there is no clean conjugacy that works.
Sometimes the likelihood and/or the prior don't even have an analytic forms.
This can also be an issue if you have actual prior knowledge that is multimodal or is not
in the nice clean conjugate form.
So what can we do?
MCMC:
MCMC: Metropolis-Hastings
Metropolis-HastingsAlgorithm
Algorithm
1. Pick a starting point µ 0
2. For t = 1; : : : ; N do the following
(a) Sample a new µ¤ from a jumping distribution Jt (µ¤ j µt¡1 )
This distribution does not have to be symmetric.
Commonly this is a gaussian.
p(µ ¤ jy)=Jt (µ ¤ jµ t¡1 )
p(µ t¡1 jy)=Jt (µ t¡1 jµ ¤ )
(b) Calculate r =
(
µ¤
with probability min(r; 1)
t
(c) µ =
µt¡1 otherwise
3. Throw away ¯rst N0 draws and use rest as draws from posterior.
This means you only need to be
able to calculate the
likelihood and prior given the parameters
There are other methods of making
the jumping distribution see affine-invariance
ensemble methods: see emcee
MCMC:
MCMC: Metropolis
MetropolisAlgorithm
Algorithm
1. Pick a starting point µ 0
2. For t = 1; : : : ; N do the following
(a) Sample a new µ¤ from a jumping distribution Jt (µ¤ j µt¡1 )
This distribution must be symmetric Jt (µ¤ j µt¡1 ) = Jt (µt¡1 j µ¤ )
Commonly this is a gaussian.
p(µ ¤ jy)
p(µ t¡1 jy)
(b) Calculate r =
(
µ¤
with probability min(r; 1)
t
(c) µ =
µt¡1 otherwise
3. Throw away ¯rst N0 draws and use rest as draws from posterior.
When
When to
to Use
Use What
What
●
●
●
If you can, you should use a conjugate method
If that isn't possible, try to see if you can find the
full conditionals and try a Gibbs Sampler
If all else fails, use Metropolis-Hastings, which
can require careful tuning
AAQuick
Quick Detour
Detour into
into Confidence
Confidence Intervals
Intervals
95% Con¯dence Region
>0.95 for each parameter
What is Pr(x1 < µ1 < x2 )?
> 0:95
y2
µ2
0.95
y1
x1
µ1
x2
If the pink lines bounded
the 95 percent range for
each variable, the
maximum inside the box
would be 0.952 < 0.95
That means the projections
of the confi dence region
are too conservative. The
marginals are the real way
of doing this.
AAQuick
Quick Detour
Detour into
into Confidence
Confidence Intervals
Intervals
The standard technique non-Bayesian technique I have seen is:
2
1) Find the best fi tting model in terms of
and call it
µ^
2) Then move out in parameter space a distance until the 2 has
increased by an amount 2.
Number of Parameters
What does
this Assume?
Numerical Recipes in C++, pg 815
p
(%)
68.27
90
95.45
99
99.73
99.99
1
2
3
4
5
6
1.00
2.71
4.00
6.63
9.00
15.1
2.3
4.61
6.18
9.21
11.8
18.4
3.53
6.25
8.02
11.3
14.2
21.1
4.72
7.78
9.72
13.3
16.3
23.5
5.89
9.24
11.3
15.1
18.2
25.7
7.04
10.6
12.8
16.8
20.1
27.9
AAQuick
Quick Detour
Detour into
into Confidence
Confidence Intervals
Intervals
Assume that your posterior is Gaussian
p(µ j y) » N (¹µ ; §µ )
µ
¶
1
¡d=2
p(µ j y) / j§µ j
exp ¡ (µ ¡ ¹µ )T §¡1
µ (µ ¡ ¹µ )
2
d
1
log p(µ j y) = constant ¡ log j§µ j ¡ (µ ¡ ¹µ )T §¡1
µ (µ ¡ ¹µ )
2
2
1
¡ £ a Â2d distributed variable
2
Constant w.r.t θ
So you can ¯nd the \posterior maximum" µ^ and go out
¢Â2 = x such that exp(¡ 12 Â2d (x)) = 0:95
for 95% con¯dence intervals
For 2D, a multivariate gaussian can only produce rotated ellipses
This is just a benchmark,
a heuristic that probably
shouldn't be used for
real serious work. You
should be using MCMC
realization to figure out
your contours.
y2
µ2
y1
x1
µ1
x2
Is the maximum
the best?
AAQuick
Quick Detour
Detour into
into Confidence
Confidence Intervals
Intervals
Conclusion
Conclusion
Be careful with your confi dence intervals!
Â2 isn't the best way to do this. You want MCMC!
The peak of the posterior is not always the best. So don't just
use the peak value § an interval. You need to think about the
full distribution, including all of its joint properties!