Download Physics 6720 – Introduction to Statistics – 1 Statistics of Counting

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Inductive probability wikipedia , lookup

Birthday problem wikipedia , lookup

Ars Conjectandi wikipedia , lookup

Probability interpretations wikipedia , lookup

Law of large numbers wikipedia , lookup

Transcript
Physics 6720 – Introduction to Statistics –
1
December 5, 2012
Statistics of Counting
Often an experiment yields a result that can be classified according to a set of
discrete events, giving rise to an integer count or set of integer counts as a result.
For example the measurement of a radioactive decay may yield the number of counts
in a detector over a period of time. The measurement of the scattering of a beam of
particles from a target may yield so may counts over a particular range of deflection
angles over a period of time. Any series of measurements, the result of which can
be classified into histogram bins, produces counts. In this section we will discuss
briefly the statistics of counting, with particular emphasis on Poisson statistics.
1.1
Binomial Distribution
We begin with the binomial distribution. Here we consider an experiment that is
repeated many times. There are two possible outcomes: A and B. The probability
for outcome A is p and the probability for outcome B is 1 − p. We assume that
each experiment has the same probability for each outcome and that there is no
correlation between the outcome of one experiment and that of another. We may
then ask the question, out of N repetitions of the experiment, what is the probability
that we get A exactly k times? For example, suppose N = 4 and k = 3. The answer
is found by first asking for the probability for a particular sequence of outcomes
AABA, for example. The probability is just the product of the probabilities for
each event: pp(1 − p)p. This statement makes use of the fact that there is no
correlation between one experiment and another. Since our question doesn’t ask
for a particular order of outcomes, but just any order that yields 3 A’s out of 4
trials, we then ask how many different ways there are of getting 3 A’s. We can
enumerate them: BAAA, ABAA, AABA, and AAAB. Since the probability for
each is the same, the probability for any of them is four times the probability for
just one of them. So the probability is 4p3 (1 − p) to get 3 A’s out of 4 trials. The
general expression is called the binomial distribution. The probability for getting k
A’s (and N − k B’s) out of N trials is
P (k, N ) =
N!
pk (1 − p)N −k .
k!(N − k)!
(1)
Notice that the binomial probabilities generate the binomial series, which adds
up to 1, as it should:
N
X
P (k, N ) = [p + (1 − p)]N = 1.
k=0
1
(2)
1.2
Poisson Distribution
The Poisson distribution applies in cases in which the probability for getting A is
very small compared with other possible outcomes. In that case we would use the
binomial formula with a value of k much smaller than N . For example, suppose we
were counting radioactive decays as a function of time and we observe the decays
over a time interval dt that is much smaller than the decay lifetime, so the amount
of radioactive material available for decay does not change noticeably during the
time of observation. If the decay rate is λ and we consider just one single atom,
the probability that it decays in a time interval dt is p = λdt (true as long as this
is very small). Call this event A. If it doesn’t decay (probability 1 − p) we call it
event B. If we now consider N atoms we can use the binomial distribution to give
us the probability that that k atoms out of N atoms decay in the time interval dt.
We expect that on average there will be k̄ = pN decays. Let’s find the probability
for getting k events in the limit of large N , if the expected (average) number k̄ is
constant as we take the limit. Notice that to keep k̄ constant we have to decrease p as
we increase N . This means we are decreasing the time interval dt as we increase N .
To get the probability, we start with the binomial distribution, substitute p = k̄/N
and take the limit
N!
P (k, k̄) = lim
(k̄/N )k (1 − k̄/N )N −k
(3)
N →∞ k!(N − k)!
After some algebra, using the Stirling approximation for the factorial and the Taylor
expansion for the exponential function, we get the Poisson distribution:
k̄ k e−k̄
.
(4)
k!
This distribution is normalized to 1 as well. The sum generates the Taylor series
for the exponential function:
P (k, k̄) =
∞
X
P (k, k̄) = ek̄ e−k̄ = 1.
(5)
k=0
We will return to a discussion of properties of the Poisson distribution after
discussing the Gaussian normal distribution.
2
2.1
Normal Distribution
Populations and Their Means and Standard Deviations
In order to develop confidence in the result of a measurement of a single quantity,
such as the length of a table top, we often repeat the measurement process a number
2
of times. The results of the measurement vary because of difficulties in reading the
meter stick scale to the last tenth of a millimeter, and for other reasons. Suppose
we repeated the measurement N times, getting a list of values xi . Our best guess
for the true value is usually the average of these values:
x̄∗ =
N
X
xi /N,
(6)
i=1
which is also called the “mean” value of this sample set of observations. In our
notation, x̄∗ indicates our best, imperfect estimate of the true value x̄. If we repeat
the measurement an infinite number of times, ideally, the mean value should approach the “true” value of the measurement. The statistical way to describe what
is happening is that our set of N measurements is a sample of N values taken from
an infinite “population”. The true population mean is given by
x̄ = lim
N
X
N →∞
xi /N.
(7)
i=1
We might ask of this infinite population, what is the probability of getting a
value of x in the range (x, x + dx) when we make a measurement? This probability
is expressed in terms of a probability function P (x) as P (x)dx. The factor dx is
necessary because as the interval width dx gets smaller, the probability of getting a
value in that tiny range must get smaller in proportion to dx. If we make enough
measurements, we can begin to construct this probability function, but usually we
don’t make enough measurements to know it very well. So we often assume for
want of any better reason that the probability is given by the Gaussian distribution
function (normal distribution)
√
P (x) = exp[−(x − x̄)2 /2σ 2 ]/( 2πσ)
(8)
In this expression the true mean of the population is x̄ and σ is the true “standard
deviation”. This probability is normalized so that
Z ∞
P (x)dx = 1.
(9)
−∞
i.e. the probability of measuring any value of x is 1. The Gaussian distribution is
peaked at x = x̄ and falls off on either side of x̄ over a distance in x that is controlled
by the value of σ. If σ is large, the fall off is slow and the most probable values of
x are in a broad range around x̄; if σ is small, the fall off is rapid, and the most
probable values of x are narrowly clustered around x̄. A property of the Gaussian
distribution is that the probability of making a measurement and getting a value in
3
the range x̄ − σ and x̄ + σ is about 68%. (This value is found by calculating the
integral under the probability distribution from x̄ − σ to x̄ + σ.) Thus in common
usage, we say that for a single measured value of x, the result is x̄ ± σ. The standard
deviation of a quantity is sometimes called the “error” in that quantity, so we say
the error in a single measurement is σ. The statement that x lies in the range
x̄ ± σ is a statement we can make with 68% confidence. That means the result of
a measurement is likely to be outside this range 32% of the times we repeat the
experiment.
A measure of the width of this peak is given by
Var(x) = σ 2 =
Z ∞
(x − x̄)2 P (x)dx
(10)
−∞
This is just the average of (x − x̄)2 over the population.
If we made an infinite number of measurements, we would be able to determine
the two parameters x̄ and σ of the distribution exactly. With a finite set of measurements, however, we can estimate them. To estimate the mean value, we simply
compute the average of the measurements xi :
x̄∗ = hxi =
N
X
xi /N.
(11)
i=1
Notice that we have put a star on x̄∗ to distinguish the estimate from the true value
x̄. The sample also permits an estimate of this population standard deviation σ. It
is just
Var(x1 , x2 , . . . ) = σ ∗2 =
N
X
(xi − hxi)2 /(N − 1).
(12)
i=1
The quantity σ ∗ is the estimated standard deviation, and its square is called the
estimated variance of x from the mean value x̄, or just the estimated variance of x.
1
Another useful formula is obtained by expanding the square on the right side to
give
X
X
(N − 1)σ ∗2 =
x2i − 2hxi
xi + N hxi2 = N (hx2 i − hxi2 ).
(13)
The hx2 i means the average of x2i . In other words the estimated variance is just the
difference between the average of the squares and the square of the average times
N/(N − 1).
As an exercise in this course, you will be asked to write a program that reads a
list of values xi and calculates x̄∗ and σ ∗ .
This expression for σ ∗2 is based on the sample mean x̄∗ , and so is biased. To compensate for
the bias, we divide by N − 1 instead of N .
1
4
2.2
The Error in the Estimated Mean
So we see that if we have a finite data “sample”, we can get an estimate of the true
values of x̄ and σ. But how far is our estimate x̄∗ from the true value x̄? This is
the central question of every measurement, because it tells us how much confidence
we may put in our result. Measurements without error ranges are meaningless! For
example, there is really no meaning to the statement that the length of the table top
is 3 meters, because the associated error might be a kilometer. There is meaning
only if we can associate an error with this figure and say, for example, that the
length is 3.00 meters with an error of plus or minus 0.01 meter.
Now suppose we make N measurements to make up one data “sample” on one
day and make another N measurements to make a second sample on the next, and
so collect a large number of samples. We determine the estimated mean value x̄∗ for
each sample. What is the probability distribution for this estimated mean value?
Note that it is not the same as the probability distribution of the population. One
way to see this is to realize that if we take larger and larger samples almost all of our
values would be expected to be closer and closer to the true mean x̄. In fact, a famous
theorem of statistics, called the “central limit” theorem states that the probability
distribution of the mean value approaches a Gaussian normal distribution as the
sample size increases, regardless of whether the underlying population distribution
P (x) is itself Gaussian. The standard deviation of the mean value is estimated by
√
∗
σmean
= σ∗/ N ,
(14)
∗
where σ ∗ is given by Eq (13). As the sample
√ size grows, σ stabilizes, and the
standard deviation of the mean shrinks as 1/ N , so that the distribution of sample
means x̄∗ gets sharper around the true value x̄.
So as a result of measuring one sample, we estimate the true mean value to be
√
∗
x̄∗ ± σmean
= x̄∗ ± σ ∗ / N .
This is a practical formula. With it we need only make N measurements, then
estimate the population mean from Eq. (11) and the population standard deviation
from Eq. (12). Then we compute the error in the mean from Eq. (14). Please bear
in mind the difference between σ ∗ , which is the estimate of the error in a single
∗
measurement, i.e. the “population” standard deviation, and σmean
, which is the
∗
estimate of the error in our estimated mean value x̄ .
2.3
Systematic Error
The error we have been discussing so far is a statistical error. It is an error that can
be made smaller by simply making more and more measurements of the same type.
5
Another error that occurs all-too-frequently is a “systematic” error. For example,
in measuring the table top, it may happen that our meter stick was slightly miscalibrated, so it gave consistently large results. We would not be able to correct for
such an error by repeating the measurement. We would instead have to recalibrate
the meter stick. Sometimes we aren’t able to do the recalibration, but are assured
by the manufacturer or by some other means that the meter stick agrees with a
precise standard to within an error of, say 0.005 m. We might then quote the result
of a measurement of the table top by saying it is 3.158 ± 0.002(stat) ± 0.005(syst)
meters, thereby identifying separately the two sources of error.
3
3.1
Properties of the Poisson Distribution
Mean and Variance
Figure 1: Poisson distribution with mean value 5
6
The Poisson distribution for k̄ = 5 is shown in Fig. 1. Notice that it peaks at
k = 5. Let us determine the mean and variance for the Poisson distribution. The
mean is just
hki =
∞
X
kP (k, k̄).
(15)
k=0
A little algebra gives
hki = k̄.
(16)
This result is naturally what we would expect, of course. The variance is given by
Var(k) = hk 2 i − hki2 =
∞
X
k 2 P (k, k̄) − k̄ 2
(17)
k=0
A little algebra shows that the first term is just k̄(k̄ + 1) so
Var(k) = k̄.
(18)
√
This result says that the standard deviation is approximately k̄. Actually we have
to be careful about using the term “standard deviation” for the Poisson distribution,
unless k̄ is large. For small k̄ the shape is not very much like a Gaussian, but for
large k̄ the shape approximates a Gaussian reasonably well.
3.2
Bayes Theorem and Maximum Likelihood
So far we have been thinking of the probability for getting a result k if we know
that the mean value should be k̄. Now suppose we make a measurement and get k
counts, but we don’t know anything about k̄, except that it must be nonnegative,
of course. We may turn the question around and ask what is the most likely value
for k̄, given the result of our measurement. To make this turned-around idea more
concrete, we use the concept of conditional probability. We say that the Poisson
distribution P (k, k̄) tells us the probability that we get k, on the condition that the
mean value is k̄. The notation P (A|C) denotes the probability for getting A, given
that C occurs or C is true. Thus we could write
P (k|k̄) = P (k, k̄) =
k̄ k e−k̄
.
k!
(19)
Now the reverse question is, “What is the probability that the mean value is k̄, given
that we just made a measurement and got k?”. This probability would be denoted
P (k̄|k). Now a trivial but important theorem due to Bayes states that
P (A|C)P (C) = P (C|A)P (A)
7
(20)
where P (C) is the a priori probability for C to occur, regardless of whether the
event A occurs, and P (A) is the a priori probability for A to occur, regardless of
whether the event C occurs. From this theorem we conclude that
P (k̄|k) = P (k|k̄)P (k̄)/P (k)
(21)
So we need to know P (k̄) and P (k) to make progress. The first is the a priori
probability for getting a particular value for k̄. If we don’t know anything about
k̄, except that it is nonnegative, then we must say that any nonnegative value
whatsoever is equally probable. Thus without benefit of knowing the outcome of
the measurement, we say P (k̄) is constant, independent of k̄ for nonnegative k̄, and
it is zero for negative k̄. So the rhs of this equation reduces simply to
P (k̄|k) = N
k̄ k e−k̄
k!
(22)
where the normalization factor N = P (k̄)/P (k) can be determined by requiring that
the total probability for having any k̄ is 1. In fact it turns out that N = 1, so
P (k̄|k) =
k̄ k e−k̄
.
k!
(23)
This distribution is called the likelihood function for the parameter k̄. Notice that
we are now thinking of the rhs as a continuous function of k̄ with fixed k. This result
is very remarkable, since a single measurement is giving us the whole probability
distribution! Recall that if we were to measure the length of a table top, even
if we started by assuming we were going to get a Gaussian distribution, a single
measurement would allow us only to guess x̄ and would tell us nothing about σ.
To get σ takes at least two measurements, and even then we would be putting
ourselves at the mercy of the gods of statistics for taking a chance with only two
measurements. If we weren’t so rash as to assume a Gaussian, we would have
to make many measurements of the length of the table top to get the probability
distribution in the measured length.
We now ask, what is the most probable value of k̄, given that we just found
k? This is the value with maximum likelihood. If we examine the probability
distribution, we see that it peaks at k, just as we might have expected. We may
then ask, what is the error in the determination of this value. This is a tricky
question, because the Poisson distribution is not shaped like a Gaussian distribution.
However, for large k it looks more and more like a Gaussian. Expanding the log of
the Poisson distribution for large k and fixed k̄ − k gives
p
P (k̄|k) ≈ exp[−(k − k̄)2 /(2k̄)]/ 2π k̄
8
(24)
so for large k the error is
√
σk̄ =
k.
(25)
To summarize, a single measurement yields the entire probability distribution. For
large enough k we can say that
√
k̄ = k ± k.
(26)
To see how Bayesian statistics works, suppose we repeated the experiment and
got a new value k 0 . What is the probability distribution for k̄ in light of the new
result? Now things have changed, since the a priori probability for k̄ is no longer
constant because we already made one measurement and got k. Instead we have
P (k̄) =
so
k̄ k e−k̄
k!
(27)
0
k̄ k e−k̄ k̄ k e−k̄
P (k̄|k ) = N
.
(28)
k0 !
k!
Notice that the likelihood function is now the product of the individual likelihood
functions. A more systematic notation would write this function as P (k̄|k, k 0 ), i.e.
the probability for k̄ having a particular value, given that we made two measurements and found k and k 0 . The normalization factor N is obtained by requiring the
total probability to be 1. For large k̄, k, and k 0 , the most likely value of k̄ is easily
shown to be just the average
k̄ = (k + k 0 )/2,
(29)
0
as we should have expected.
The Bayesian approach insists that we fold together all of our knowledge about
a parameter in constructing its likelihood function. Thus a generalization of these
results would state that the likelihood function for the parameter set C, given the
independently measured results A1 , A2 , A3 , etc. is just
P (C|A1 , A2 , A3 , . . . ) = N P (A1 |C)P (A2 |C)P (A3 |C) . . . ,
(30)
where N is a normalization factor. Again, this is just the product of the separate
likelihood functions. The result is completely general and applies to any probability
distribution, not just a Poisson distribution. We will use this result in discussing χ2
fits to data as a maximum likelihood search.
9