Download FREQUENTLY ASKED QUESTIONS Content

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Inductive probability wikipedia , lookup

Birthday problem wikipedia , lookup

Mixture model wikipedia , lookup

Ars Conjectandi wikipedia , lookup

Probability interpretations wikipedia , lookup

Risk aversion (psychology) wikipedia , lookup

Law of large numbers wikipedia , lookup

Transcript
FREQUENTLY ASKED QUESTIONS
October 5, 2010
Content Questions
What is the relation between binomial, Gaussian, Poisson, and
Landau distribution? How can we “jump” from one distribution
to another?
The Gaussian and Poisson distributions are both limits of the binomial distribution, and the Gaussian is also a limit of the Poisson distribution. Specifically:
• The Poisson is the limit of the binomial for p small and n >> µ: see
FAQ 2 for how this comes about mathematically.
• The Gaussian is the limit of the Poisson for large µ. For a mathematical
argument (using Stirling’s approximation for the log of a factorial), see
Barlow, p. 40.
• The Gaussian is the limit of the binomial for large µ = np.
I think “jump” is not quite the right word... if appropriate conditions
are satisfied, the distributions approach each other, and how close is “close
enough” to replace one distribution by another depends on how good your
answer needs to be. For example, the Poisson distribution looks a lot like
√
a Gaussian with the same mean and a sigma equal to µ, for large n. It’s
not too bad a match for n = 5 or so (might be good enough for what you
need, or not, to approximate the Poisson as a Gaussian), and the Poisson
and the Gaussian look almost indistinguishable for n = 30. If you play with
the Mathematica notebook from last lecture you can get a feeling for this.
The Landau distribution (not sure if I read it correctly?) is something else
– I did not mention this distribution in class. This is a specific distribution
that in fact doesn’t look Gaussian (it has a long tail). It describes energy
loss in thin layers of matter, and it can be derived from physical principles
for this description.
What if our Poisson distribution isn’t counting– what if µ has units?
The argument of the exponential, −µ, in the Poisson distribution, must be
dimensionless– this distribution always describes some kind of counting. In
the distribution P(x; µ), µ is the mean value of the distribution of x, and x
is always “the number of something”.
Note that while the x values corresponding to the measurement are discrete, the mean µ need not be an integer.
Another note: in the limit of large µ, the x values get finely spaced,
and can eventually be treated as a continuous as the distribution becomes
Gaussian– and for the Gaussian, the values of x can correspond to some
dimension-bearing quantity, like length or energy.
Can you explain the meaning of likelihood as probability of measuring x1 , x2 , ..., xN at the same time?
Suppose you have a data set of N measurements x1 , x2 , ..., xN : this is a
result of measuring the specific values x1 and x2 and x3 , ..., and xN . This
is what I mean by “measuring them at the same time”: your dataset is the
assemblage of the specific measurements.
Now you imagine your (unknown for the moment) mean value is µ0 : supposing this µ0 is the right parameter describing the distribution, then given
that parameter, the probability of measuring some particular xi is P (xi ; µ0 ).
The probability of measuring all of them, i.e. the probability of measuring
the actual data set you got, is Πi P (xi ; µ0 ) (assuming the measurements are
independent). This probability product is the “likelihood”: it’s the probability of getting your particular data set, given the parameter. (Then you
maximize as a function of the parameter to get the best parameter– more on
that later.)
To make a simple example: take a coin with heads and tails, that might
or might not be loaded, so that the probability of getting heads is p. Say you
toss the coin N = 3 times. x = 1 is heads, and x = 0 is tails. Suppose your
data set is 0, 0, 1.
The probability of the first measurement being tails is 1 − p; the probability of the second measurement being tails is 1 − p; the probability of the
third measurement being heads is p. So the likelihood of getting the specific
data set you got (given p) is (1 − p)(1 − p)p.
We’ll see more examples of likelihoods later.
How do Gaussian and Poisson mean and mean-error estimations
differ in practice?
In practice, we can usually treat a Poisson as having Gaussian properties if
µ is greater than 8 or 10 or so. Not infrequently, for µ as small as 5 or so
it’s also OK to treat as Gaussian (for the purpose of the using the arithmetic
mean x̄ as √the estimator of the true mean, that’s typically fine). Taking
√
σ = µ ∼ x̄ is also fine– the square root relation between µ and σ is exact
for a Poisson. However, there are cases when it’s better to take Poissonianness into account: for example, when setting limits based on observations of
very small numbers of events, or in ratios of small numbers.
Why is s2 the best estimate of σ 2 ?
Although I didn’t show it, this can be shown to be the “best” estimator by
a maximum likelihood method (although there are some complications as to
the meaning of “best”: you actually want an “unbiased” estimator, for which
expectation value is the true value, as we discussed a bit last class. For this
the N1−1 factor is desirable when s is estimated using x̄ rather than the true
mean.)
There are also cases where you may be better off estimating sigma from
knowledge of your apparatus, or from a known property of the distribution–
for example,
if you’re dealing with a Poisson, it usually works well to assume
√
σ = x̄.
What’s the relation between the estimation and a fit, e.g. a Gaussian fit? Do we do the estimation first, then do the fit?
Well, a fit of a data set to a function is a more general thing than just
estimating a mean: for the fit, you are comparing each point to a predicted
value for a given function parameterization, and minimizing with respect to
the parameters to find the best estimates of the parameters (and the function
could be a Gaussian or could be something else). You don’t necessarily need
to estimate a mean (or other parameter) before launching into a fit algorithm.
However, in practice, it’s often useful to start the fit minimization off with
decent estimates of the parameters, to save computation time– so it can
be a good idea to estimate mean or other parameters first, then feed these
parameters to the fit. I’ll talk about fitting later.
How do the off-diagonal entries in M manifest in the probability
distribution?
The off-diagonal entries in the error matrix M will appear as cross-terms in
the joint probability P (u, v)’s exponential argument– we’ll get to this next
class.
Can you explain the contours of equal probability? Where did the
different amplitudes really come from – just σu and σv ?
I’m not sure what you mean by “amplitudes”... but the concentric ellipses
I drew were intended to be lines in u and v space for which the probability
is equal. If you look at the joint binormal probability expression, P (u, v) =
2
2
− 12 ( u2 + v 2 )
1
σu
σv
,
e
2πσu σv
2
2
u
v
if you choose ( σu2 + σv 2 ) = constant, the probability P (u, v)
2
2
will have a constant value. This equation, ( σu2 + σv 2 ) = constant, describes
u
v
an ellipse in u − v space. You can draw a different ellipse for a different
2
2
chosen constant; if you choose ( σu2 + σv 2 ) = 1, that’s the contour at which the
u
v
√
probability is down from its maximum
√ by a factor of 1/ e, just as the 1D
Gaussian is down by a factor of 1/ e at one σ away from the mean. The
2
2
ellipse ( σu2 + σv 2 ) = 1 is the “1σ contour”. The semi-major and semi-minor
u
v
axis lengths are given by the σv and σu values.
For the ellipse, when moving u, why would the max of v be the
same?
It won’t be the same maximum value of v, but the maximum value of probability will always be for the same value of v. For the vertical ellipse I drew (for
non-correlated variables), along any vertical line at constant u, the maximum
value of probability as a function of v is always at v = 0.