Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Physics 6720 – Introduction to Statistics – 1 December 5, 2012 Statistics of Counting Often an experiment yields a result that can be classified according to a set of discrete events, giving rise to an integer count or set of integer counts as a result. For example the measurement of a radioactive decay may yield the number of counts in a detector over a period of time. The measurement of the scattering of a beam of particles from a target may yield so may counts over a particular range of deflection angles over a period of time. Any series of measurements, the result of which can be classified into histogram bins, produces counts. In this section we will discuss briefly the statistics of counting, with particular emphasis on Poisson statistics. 1.1 Binomial Distribution We begin with the binomial distribution. Here we consider an experiment that is repeated many times. There are two possible outcomes: A and B. The probability for outcome A is p and the probability for outcome B is 1 − p. We assume that each experiment has the same probability for each outcome and that there is no correlation between the outcome of one experiment and that of another. We may then ask the question, out of N repetitions of the experiment, what is the probability that we get A exactly k times? For example, suppose N = 4 and k = 3. The answer is found by first asking for the probability for a particular sequence of outcomes AABA, for example. The probability is just the product of the probabilities for each event: pp(1 − p)p. This statement makes use of the fact that there is no correlation between one experiment and another. Since our question doesn’t ask for a particular order of outcomes, but just any order that yields 3 A’s out of 4 trials, we then ask how many different ways there are of getting 3 A’s. We can enumerate them: BAAA, ABAA, AABA, and AAAB. Since the probability for each is the same, the probability for any of them is four times the probability for just one of them. So the probability is 4p3 (1 − p) to get 3 A’s out of 4 trials. The general expression is called the binomial distribution. The probability for getting k A’s (and N − k B’s) out of N trials is P (k, N ) = N! pk (1 − p)N −k . k!(N − k)! (1) Notice that the binomial probabilities generate the binomial series, which adds up to 1, as it should: N X P (k, N ) = [p + (1 − p)]N = 1. k=0 1 (2) 1.2 Poisson Distribution The Poisson distribution applies in cases in which the probability for getting A is very small compared with other possible outcomes. In that case we would use the binomial formula with a value of k much smaller than N . For example, suppose we were counting radioactive decays as a function of time and we observe the decays over a time interval dt that is much smaller than the decay lifetime, so the amount of radioactive material available for decay does not change noticeably during the time of observation. If the decay rate is λ and we consider just one single atom, the probability that it decays in a time interval dt is p = λdt (true as long as this is very small). Call this event A. If it doesn’t decay (probability 1 − p) we call it event B. If we now consider N atoms we can use the binomial distribution to give us the probability that that k atoms out of N atoms decay in the time interval dt. We expect that on average there will be k̄ = pN decays. Let’s find the probability for getting k events in the limit of large N , if the expected (average) number k̄ is constant as we take the limit. Notice that to keep k̄ constant we have to decrease p as we increase N . This means we are decreasing the time interval dt as we increase N . To get the probability, we start with the binomial distribution, substitute p = k̄/N and take the limit N! P (k, k̄) = lim (k̄/N )k (1 − k̄/N )N −k (3) N →∞ k!(N − k)! After some algebra, using the Stirling approximation for the factorial and the Taylor expansion for the exponential function, we get the Poisson distribution: k̄ k e−k̄ . (4) k! This distribution is normalized to 1 as well. The sum generates the Taylor series for the exponential function: P (k, k̄) = ∞ X P (k, k̄) = ek̄ e−k̄ = 1. (5) k=0 We will return to a discussion of properties of the Poisson distribution after discussing the Gaussian normal distribution. 2 2.1 Normal Distribution Populations and Their Means and Standard Deviations In order to develop confidence in the result of a measurement of a single quantity, such as the length of a table top, we often repeat the measurement process a number 2 of times. The results of the measurement vary because of difficulties in reading the meter stick scale to the last tenth of a millimeter, and for other reasons. Suppose we repeated the measurement N times, getting a list of values xi . Our best guess for the true value is usually the average of these values: x̄∗ = N X xi /N, (6) i=1 which is also called the “mean” value of this sample set of observations. In our notation, x̄∗ indicates our best, imperfect estimate of the true value x̄. If we repeat the measurement an infinite number of times, ideally, the mean value should approach the “true” value of the measurement. The statistical way to describe what is happening is that our set of N measurements is a sample of N values taken from an infinite “population”. The true population mean is given by x̄ = lim N X N →∞ xi /N. (7) i=1 We might ask of this infinite population, what is the probability of getting a value of x in the range (x, x + dx) when we make a measurement? This probability is expressed in terms of a probability function P (x) as P (x)dx. The factor dx is necessary because as the interval width dx gets smaller, the probability of getting a value in that tiny range must get smaller in proportion to dx. If we make enough measurements, we can begin to construct this probability function, but usually we don’t make enough measurements to know it very well. So we often assume for want of any better reason that the probability is given by the Gaussian distribution function (normal distribution) √ P (x) = exp[−(x − x̄)2 /2σ 2 ]/( 2πσ) (8) In this expression the true mean of the population is x̄ and σ is the true “standard deviation”. This probability is normalized so that Z ∞ P (x)dx = 1. (9) −∞ i.e. the probability of measuring any value of x is 1. The Gaussian distribution is peaked at x = x̄ and falls off on either side of x̄ over a distance in x that is controlled by the value of σ. If σ is large, the fall off is slow and the most probable values of x are in a broad range around x̄; if σ is small, the fall off is rapid, and the most probable values of x are narrowly clustered around x̄. A property of the Gaussian distribution is that the probability of making a measurement and getting a value in 3 the range x̄ − σ and x̄ + σ is about 68%. (This value is found by calculating the integral under the probability distribution from x̄ − σ to x̄ + σ.) Thus in common usage, we say that for a single measured value of x, the result is x̄ ± σ. The standard deviation of a quantity is sometimes called the “error” in that quantity, so we say the error in a single measurement is σ. The statement that x lies in the range x̄ ± σ is a statement we can make with 68% confidence. That means the result of a measurement is likely to be outside this range 32% of the times we repeat the experiment. A measure of the width of this peak is given by Var(x) = σ 2 = Z ∞ (x − x̄)2 P (x)dx (10) −∞ This is just the average of (x − x̄)2 over the population. If we made an infinite number of measurements, we would be able to determine the two parameters x̄ and σ of the distribution exactly. With a finite set of measurements, however, we can estimate them. To estimate the mean value, we simply compute the average of the measurements xi : x̄∗ = hxi = N X xi /N. (11) i=1 Notice that we have put a star on x̄∗ to distinguish the estimate from the true value x̄. The sample also permits an estimate of this population standard deviation σ. It is just Var(x1 , x2 , . . . ) = σ ∗2 = N X (xi − hxi)2 /(N − 1). (12) i=1 The quantity σ ∗ is the estimated standard deviation, and its square is called the estimated variance of x from the mean value x̄, or just the estimated variance of x. 1 Another useful formula is obtained by expanding the square on the right side to give X X (N − 1)σ ∗2 = x2i − 2hxi xi + N hxi2 = N (hx2 i − hxi2 ). (13) The hx2 i means the average of x2i . In other words the estimated variance is just the difference between the average of the squares and the square of the average times N/(N − 1). As an exercise in this course, you will be asked to write a program that reads a list of values xi and calculates x̄∗ and σ ∗ . This expression for σ ∗2 is based on the sample mean x̄∗ , and so is biased. To compensate for the bias, we divide by N − 1 instead of N . 1 4 2.2 The Error in the Estimated Mean So we see that if we have a finite data “sample”, we can get an estimate of the true values of x̄ and σ. But how far is our estimate x̄∗ from the true value x̄? This is the central question of every measurement, because it tells us how much confidence we may put in our result. Measurements without error ranges are meaningless! For example, there is really no meaning to the statement that the length of the table top is 3 meters, because the associated error might be a kilometer. There is meaning only if we can associate an error with this figure and say, for example, that the length is 3.00 meters with an error of plus or minus 0.01 meter. Now suppose we make N measurements to make up one data “sample” on one day and make another N measurements to make a second sample on the next, and so collect a large number of samples. We determine the estimated mean value x̄∗ for each sample. What is the probability distribution for this estimated mean value? Note that it is not the same as the probability distribution of the population. One way to see this is to realize that if we take larger and larger samples almost all of our values would be expected to be closer and closer to the true mean x̄. In fact, a famous theorem of statistics, called the “central limit” theorem states that the probability distribution of the mean value approaches a Gaussian normal distribution as the sample size increases, regardless of whether the underlying population distribution P (x) is itself Gaussian. The standard deviation of the mean value is estimated by √ ∗ σmean = σ∗/ N , (14) ∗ where σ ∗ is given by Eq (13). As the sample √ size grows, σ stabilizes, and the standard deviation of the mean shrinks as 1/ N , so that the distribution of sample means x̄∗ gets sharper around the true value x̄. So as a result of measuring one sample, we estimate the true mean value to be √ ∗ x̄∗ ± σmean = x̄∗ ± σ ∗ / N . This is a practical formula. With it we need only make N measurements, then estimate the population mean from Eq. (11) and the population standard deviation from Eq. (12). Then we compute the error in the mean from Eq. (14). Please bear in mind the difference between σ ∗ , which is the estimate of the error in a single ∗ measurement, i.e. the “population” standard deviation, and σmean , which is the ∗ estimate of the error in our estimated mean value x̄ . 2.3 Systematic Error The error we have been discussing so far is a statistical error. It is an error that can be made smaller by simply making more and more measurements of the same type. 5 Another error that occurs all-too-frequently is a “systematic” error. For example, in measuring the table top, it may happen that our meter stick was slightly miscalibrated, so it gave consistently large results. We would not be able to correct for such an error by repeating the measurement. We would instead have to recalibrate the meter stick. Sometimes we aren’t able to do the recalibration, but are assured by the manufacturer or by some other means that the meter stick agrees with a precise standard to within an error of, say 0.005 m. We might then quote the result of a measurement of the table top by saying it is 3.158 ± 0.002(stat) ± 0.005(syst) meters, thereby identifying separately the two sources of error. 3 3.1 Properties of the Poisson Distribution Mean and Variance Figure 1: Poisson distribution with mean value 5 6 The Poisson distribution for k̄ = 5 is shown in Fig. 1. Notice that it peaks at k = 5. Let us determine the mean and variance for the Poisson distribution. The mean is just hki = ∞ X kP (k, k̄). (15) k=0 A little algebra gives hki = k̄. (16) This result is naturally what we would expect, of course. The variance is given by Var(k) = hk 2 i − hki2 = ∞ X k 2 P (k, k̄) − k̄ 2 (17) k=0 A little algebra shows that the first term is just k̄(k̄ + 1) so Var(k) = k̄. (18) √ This result says that the standard deviation is approximately k̄. Actually we have to be careful about using the term “standard deviation” for the Poisson distribution, unless k̄ is large. For small k̄ the shape is not very much like a Gaussian, but for large k̄ the shape approximates a Gaussian reasonably well. 3.2 Bayes Theorem and Maximum Likelihood So far we have been thinking of the probability for getting a result k if we know that the mean value should be k̄. Now suppose we make a measurement and get k counts, but we don’t know anything about k̄, except that it must be nonnegative, of course. We may turn the question around and ask what is the most likely value for k̄, given the result of our measurement. To make this turned-around idea more concrete, we use the concept of conditional probability. We say that the Poisson distribution P (k, k̄) tells us the probability that we get k, on the condition that the mean value is k̄. The notation P (A|C) denotes the probability for getting A, given that C occurs or C is true. Thus we could write P (k|k̄) = P (k, k̄) = k̄ k e−k̄ . k! (19) Now the reverse question is, “What is the probability that the mean value is k̄, given that we just made a measurement and got k?”. This probability would be denoted P (k̄|k). Now a trivial but important theorem due to Bayes states that P (A|C)P (C) = P (C|A)P (A) 7 (20) where P (C) is the a priori probability for C to occur, regardless of whether the event A occurs, and P (A) is the a priori probability for A to occur, regardless of whether the event C occurs. From this theorem we conclude that P (k̄|k) = P (k|k̄)P (k̄)/P (k) (21) So we need to know P (k̄) and P (k) to make progress. The first is the a priori probability for getting a particular value for k̄. If we don’t know anything about k̄, except that it is nonnegative, then we must say that any nonnegative value whatsoever is equally probable. Thus without benefit of knowing the outcome of the measurement, we say P (k̄) is constant, independent of k̄ for nonnegative k̄, and it is zero for negative k̄. So the rhs of this equation reduces simply to P (k̄|k) = N k̄ k e−k̄ k! (22) where the normalization factor N = P (k̄)/P (k) can be determined by requiring that the total probability for having any k̄ is 1. In fact it turns out that N = 1, so P (k̄|k) = k̄ k e−k̄ . k! (23) This distribution is called the likelihood function for the parameter k̄. Notice that we are now thinking of the rhs as a continuous function of k̄ with fixed k. This result is very remarkable, since a single measurement is giving us the whole probability distribution! Recall that if we were to measure the length of a table top, even if we started by assuming we were going to get a Gaussian distribution, a single measurement would allow us only to guess x̄ and would tell us nothing about σ. To get σ takes at least two measurements, and even then we would be putting ourselves at the mercy of the gods of statistics for taking a chance with only two measurements. If we weren’t so rash as to assume a Gaussian, we would have to make many measurements of the length of the table top to get the probability distribution in the measured length. We now ask, what is the most probable value of k̄, given that we just found k? This is the value with maximum likelihood. If we examine the probability distribution, we see that it peaks at k, just as we might have expected. We may then ask, what is the error in the determination of this value. This is a tricky question, because the Poisson distribution is not shaped like a Gaussian distribution. However, for large k it looks more and more like a Gaussian. Expanding the log of the Poisson distribution for large k and fixed k̄ − k gives p P (k̄|k) ≈ exp[−(k − k̄)2 /(2k̄)]/ 2π k̄ 8 (24) so for large k the error is √ σk̄ = k. (25) To summarize, a single measurement yields the entire probability distribution. For large enough k we can say that √ k̄ = k ± k. (26) To see how Bayesian statistics works, suppose we repeated the experiment and got a new value k 0 . What is the probability distribution for k̄ in light of the new result? Now things have changed, since the a priori probability for k̄ is no longer constant because we already made one measurement and got k. Instead we have P (k̄) = so k̄ k e−k̄ k! (27) 0 k̄ k e−k̄ k̄ k e−k̄ P (k̄|k ) = N . (28) k0 ! k! Notice that the likelihood function is now the product of the individual likelihood functions. A more systematic notation would write this function as P (k̄|k, k 0 ), i.e. the probability for k̄ having a particular value, given that we made two measurements and found k and k 0 . The normalization factor N is obtained by requiring the total probability to be 1. For large k̄, k, and k 0 , the most likely value of k̄ is easily shown to be just the average k̄ = (k + k 0 )/2, (29) 0 as we should have expected. The Bayesian approach insists that we fold together all of our knowledge about a parameter in constructing its likelihood function. Thus a generalization of these results would state that the likelihood function for the parameter set C, given the independently measured results A1 , A2 , A3 , etc. is just P (C|A1 , A2 , A3 , . . . ) = N P (A1 |C)P (A2 |C)P (A3 |C) . . . , (30) where N is a normalization factor. Again, this is just the product of the separate likelihood functions. The result is completely general and applies to any probability distribution, not just a Poisson distribution. We will use this result in discussing χ2 fits to data as a maximum likelihood search. 9