Download X - Physics

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of network traffic models wikipedia , lookup

Dragon King Theory wikipedia , lookup

Central limit theorem wikipedia , lookup

Birthday problem wikipedia , lookup

History of statistics wikipedia , lookup

Karhunen–Loève theorem wikipedia , lookup

Foundations of statistics wikipedia , lookup

Poisson distribution wikipedia , lookup

Law of large numbers wikipedia , lookup

Negative binomial distribution wikipedia , lookup

Transcript
Probability and Statistics
Many of the process involved with detection of particles are statistical in nature
Number of ion pairs created when proton goes through 1 cm of gas
Energy lost by an electron going through 1 mm of lead
The understanding and interpretation of all experimental data depend
on statistical and probabilistic concepts:
“The result of the experiment was inconclusive so we had to use statistics”
how do we extract the best value of a quantity from a set of measurements?
how do we decide if our experiment is consistent/inconsistent with a given theory?
how do we decide if our experiment is internally consistent?
how do we decide if our experiment is consistent with other experiments?
Definition of probability: Lets define probability by example:
Suppose we have N trials and a specified event occurs r times.
For example the trial could be rolling a dice and the event could be rolling a 6.
We define the probability (P) of an event (E) occurring as:
P(E) = r/N when N 
Examples: coin toss P(heads) = 0.5
six sided dice P(6) = 1/6
(P(1) = P(2) = P(3) = P(4) = P(5) = P(6) for “honest” die)
Remember: P(heads) should approach 0.5 the more times you toss the coin.
Obviously for a single coin toss we can never get P(heads) = 0.5!
880.A20 Winter 2002
Richard Kass
Probability and Statistics
By definition probability is a non-negative real number bounded by 0P 1
If P = 0 then the event never occurs
If P = 1 then the event always occurs
intersection,  union
Events are independent if: P(AB)=P(A)P(B)
Events are mutually exclusive (disjoint) if: P(AB)=0 or P(AB)= P(A)+P(B)
The sum (or integral) of all probabilities if they are mutually exclusive must = 1.
Probability can be a discrete or a continuous variable.
In the discrete case only certain values of P are allowed.
example of discrete case: tossing a six-sided dice.
P(xi) = Pi here xi = 1, 2, 3, 4, 5, 6 and Pi = 1/6 for all xi.
another example is tossing a coin. Only 2 choices, heads or tails.
For both of the above discrete examples (and in general) when we sum over all
mutually exclusive possibilities:
 P  xi   1
i
880.A20 Winter 2002
Richard Kass
Probability and Statistics
Continuous Probability: In this case P can be any number between 0 and 1.
We can define a “probability density function”, pdf, f (x)
f  x dx  dP x  a  x  dx  with a a continuous variable
The probability for x to be in the range a x  b is:
b
P(a  x  b)   f  x dx
a
Just like the discrete case the sum of all probabilities must equal 1.
For the continuous case this means:

We say that f(x) is normalized to one.
 f  x dx  1

NOTE: The probability for x to be exactly some number is zero since:
x a
 f  x dx  0
x a
Aside: Probability theory is an interesting branch of mathematics.
Calculus of Probabilities set theory.
880.A20 Winter 2002
Richard Kass
Probability and Statistics
Examples of some common P(x)’s and f(x)’s:
Discrete = P(x)
Continuous = f(x)
binomial
uniform, i.e. = constant
Poisson
Gaussian
exponential
chi square
How do we describe a probability distribution?
mean, mode, median, and variance
For a continuous distribution these quantities are defined by:
Mean
average


 xf(x)dx

Mode
most probable
 f x 
0
x x a
Median
50% point
a
0.5 
 f (x)dx

Variance
width of distribution

 
2

f (x)x    dx
2

For discrete distribution the mean and variance are defined by:
n
n
2
   xi / n
   ( xi   )2 / n
i 1
880.A20 Winter 2002
i 1
Richard Kass
Probability and Statistics
Some continuous pdfs.
v=1 Cauchy (Breit-Wigner)
v=gaussian
Chi-square distribution
880.A20 Winter 2002
u
Student t distribution
Richard Kass
Probability and Statistics
We use results from probability and statistics as a way of indicating how
“good” a measurement is.
The most common quality indicator is relative precision.
Relative precision = [uncertainty of measurement]/measurement
Uncertainty in measurement is usually square root of variance:
= standard deviation
Example: we measure table to be 10 inches with uncertainty of 1 inch.
the relative precision is 1/10=0.1 or 10% (% relative precision)
 is usually calculated using the technique of “propagation of errors”.
However this  is not what most people think it is!
We will discuss this in more detail soon.
880.A20 Winter 2002
Richard Kass
Probability and Statistics
Some comments on accuracy and precision:
Accuracy: The accuracy of an experiment refers to how close the experimental measurement
is to the true value of the quantity being measured.
Precision: This refers to how well the experimental result has been determined, without
regard to the true value of the quantity being measured.
Note: Just because an experiment is precise it does not mean it is accurate!!
The above figure shows various measurements of the neutron lifetime over the years.
Note the big jump downward in the 1960’s. Are any of these measurements accurate?
880.A20 Winter 2002
Richard Kass
Binomial Probability Distributions
For the binomial distribution P is the probability
of m successes out of N trials. Here p is probability of a success
and q=1-p is probability of a failure.
P(m, N , p) 
N!
p m q N m
m!( N  m)!
Does this formula make sense, e.g. if we sum over all possibilities do we get 1?
To show that this distribution is normalized properly, first remember the Binomial
Theorem:
k
k  k l l
k
(a  b)    a b
l 0 l 
For this example a = q = 1 - p and b = p, and (by definition) a +b = 1.
N
N
N
P(m,
N,
p)


 mp m q N m ( p  q)N  1
Thus the distribution is normalized properly.
m0
m0
Tossing a coin N times
and asking for m heads
is a binomial process.
What is the mean of this distribution?
N

 mP(m, N, p)
m 0
N
 P(m, N, p)
N
  mP(m,N, p) 
m 0
N
N
 mmp
m
q
N m
m0
m0
A cute way of evaluating the above sum is to take the derivative:
N
N
  N 
N p m q N m  0   m
N p m 1q N  m   
N p m (N  m)(1 p)N  m 1


m
 p 
m  0 m

m 0
m 0 m 
N
N
 mmp
m0
p
1
m 1
q
N m

N
N
 mp
m
(N  m)(1  p)
N m 1
m0
N
N
N
N
N

 pm q N m  (1  p) 1 N 
 p m (1  p) N  m  (1 p)1 
m
 m
 m
 mmpm (1  p) N  m
m0
m0
m0
N
p 1  (1 p)1 N(1)  (1  p)1 
= Np for a binomial distribution
880.A20 Winter 2002
Richard Kass
Binomial Probability Distribution
What’s the variance of a binomial distribution?
Using a trick similar to the one used for the average we find:
N
2
 (m   ) P(m, N , p)
 2  m 0
N
 Npq
 P(m, N , p )
m 0
Example 1: Suppose you observed m special events (or successes) in a sample of N
events. The measured probability (sometimes called “efficiency”) for a special event to
occur is   m / N . What is the error ( standard deviation or ) in  ? Since N is a
fixed quantity it is plausible (we will show it soon) that the error in  is related to the
error ( standard deviation or m) in m by:
  m / N .
This leads to:
    m / N  Npq / N  N (1  ) / N   (1   ) / N
This is sometimes called "error on the efficiency".
Thus you want to have a sample (N) as large as possible to reduce the uncertainty in the
probability measurement!
Note: , the “error in the efficiency” 0 as  0 or   1.
880.A20 Winter 2002
Richard Kass
Binomial Probability Distributions
Example 2: Suppose a baseball player's batting average is 0.333 (1 for 3 on average).
Consider the case where the player either gets a hit or makes an out (forget about walks here!).
In this example: p = prob. for a hit = 0.333 and q = 1 - p = 0.667 (prob. for "no hit").
On average how many hits does the player get in 100 at bats?
= Np = 100(0.33) = 33 hits
What's the standard deviation of the number of hits in 100 at bats?
 = (Npq)1/2 = (100·0.33·0.67)1/2  4.7 hits
Thus we expect  33 ± 5 hits per 100 at bats
Consider a game where the player bats 4 times:
Probability of 0/4 = (0.67)4 = 20%
Probability of 1/4 = [4!/(3!1!)](0.33)1(0.67)3 = 40%
Probability of 2/4 = [4!/(2!2!)](0.33)2(0.67)2 = 29%
Probability of 3/4 = [4!/(1!3!)](0.33)3(0.67)1 = 10%
Probability of 4/4 = [4!/(0!4!)](0.33)4(0.67)0 = 1%
Note: the probability of getting at least one hit is: 1 - P(0) = 0.8
880.A20 Winter 2002
Richard Kass
Poisson Probability Distribution
Another important discrete distribution is the Poisson distribution. Consider the following conditions:
a) p is very small and approaches 0.
For example suppose we had a 100 sided dice instead of a 6 sided dice.
Here p = 1/100 instead of 1/6. Suppose we had a 1000 sided dice, p = 1/1000...etc
b) N is very large, it approaches .
For example, instead of throwing 2 dice, we could throw 100 or 1000 dice.
The number of counts
c) The product Np is finite.
in a time interval is a
A good example of the above conditions occurs when one considers radioactive decay.
20
Suppose we have 25 mg of an element. This is  10 atoms.
Poisson process.
Suppose the lifetime () of this element = 1012 years  5x1019 seconds.
The probability of a given nucleus to decay in one second = 1/ = 2x10-20/sec.
For this example:
N = 1020 (very large) p = 2x10-20 (very small)
Np = 2 (finite!)
We can derive an expression for the Poisson distribution by taking the appropriate limits of the binomial distribution.
P(m, N, p) 
Using condition b) we obtain:
q
N m
 (1 p)
N!
p mq N  m
m!(N  m)!
N!
N(N  1) (N  m  1)(N  m)!

 Nm
(N  m)!
(N  m)!
N m
Putting this altogether we obtain:
p2 (N  m)(N  m  1)
( pN)2
 pN
 1 p(N  m) 
... 1  pN 
 e
2!
2!
N m p me p N e    m
P(m, N, p) 

m!
m!
Here we've let  = pN.
It is easy to show that:  = Np = mean of a Poisson distribution
2 = Np =  = variance of a Poisson distribution.
Note: m is always an integer 0 however,  does not have to be an integer.
880.A20 Winter 2002
Richard Kass
Poisson Probability Distribution
Radioactivity Example:
a) What’s the probability of zero decays in one second if the average = 2 decays/sec?
e 2 20 e2 1
2
P(0,2) 

 e  0.135  13.5%
0!
1
b) What’s the probability of more than one decay in one second if the average = 2 decays/sec?
e 2 20 e 2 21
2
2
P( 1,2)  1 P(0,2)  P(1,2)  1 

 1 e  2e  0.594  59.4%
0!
1!
c) Estimate the most probable number of decays/sec?

P(m,  ) m*  0
We want:
m
To solve this problem its convenient to maximize
lnP(m, ) instead of P(m, ).
e   m
ln P(m,  )  ln (
)     m ln   ln m!
m!
In order to handle the factorial when take the derivative we use Stirling's Approximation:
ln(m!)  mln(m)-m



ln P(m,  ) 
(   m * ln   ln m *!) 
(   m * ln   m * ln m * m*)  ln   ln m * 1 1  0
m
m
m
m* = 
In this example the most probable value for m is just the
m average of the distribution. Therefore if you observed m
events in an experiment, the error on m is
.
Caution: The above derivation is only approximate since we used Stirlings Approximation which is only
valid for large m. Another subtle point is that strictly speaking m can only take on integer values
while  is not restricted to be an integer.
0.4
0.5
0.35
poisson
binomialN=3, p=1/3
0.3
0.3
Probability
Comparison of
Binomial and Poisson
distributions with mean
=1.
Probability
0.4
0.2
binomialN=10,p=0.1
poisson
0.25
0.2
0.15
0.1
0.1
0.05
0
0
880.A20 Winter 2002
1
2 m 3
4
5
0
0.0
1.0
2.0
3.0
m
4.0
5.0
6.0
7.0
Richard Kass
Not much
difference
between
them here!
Gaussian Probability Distribution
The Gaussian probability distribution (or “bell shaped curve” or Normal
distribution) is perhaps the most used distribution in all of science. Unlike the
binomial and Poisson distribution the Gaussian is a continuous distribution. It is
given by:
p(y) 
1
e
 2
Plot of gaussian pdf
 y   
2 2
2
with = mean of distribution (also at the same place as mode and median)
2 = variance of distribution
y is a continuous variable (-y 
The probability (P) of y being in the range [a, b] is given by an integral:

1
p(x) 
e
 2
P(x)
(x  )2
2
2
gaussian
2
( y   )
1 b 22
P(a  y  b) 
dy
e
 2 a
x
Since this integral cannot be evaluated in closed form for arbitrary a and b (at least
no one's figured out how to do it in the last couple of hundred years) the values of
the integral have to be looked up in a table.
The total area under the curve is normalized to one.
In terms of the probability integral we have:

1
P(  y  ) 
e
 2  
 ( y  ) 2
2 2
dy  1
Quite often we talk about a measurement being a certain number of
standard deviations () away from the mean () of the Gaussian.
We can associate a probability for a measurement to be
|- n| from the mean just by calculating the area outside of this region.
n
Prob. of exceeding ±n
0.67
0.5
1
0.32
2
0.05
3
0.003
4
0.00006
880.A20 Winter 2002
It is very unlikely (<0.3%) that a
measurement taken at random from
a gaussian pdf will be more than
3from the true mean of the
distribution.
Richard Kass
Central Limit Theorem
Why is the gaussian pdf so important ?
“Things that are the result of the addition of lots of small effects tend to become Gaussian”
The above is a crude statement of the Central Limit Theorem:
A more exact statement is:
Let Y1, Y2,...Yn be an infinite sequence of independent random variables each with the
2
same probability distribution. Suppose that the mean () and variance ( ) of this
distribution are both finite. Then for any numbers a and b:
Actually, the Y’s can
 Y  Y 2  Y n  n

lim P a  1
 b 
n 

 n

1
2
b
1/2 y2
e
dy

a
be from different pdf’s!
Thus the C.L.T. tells us that under a wide range of circumstances the probability
distribution that describes the sum of random variables tends towards a Gaussian
distribution as the number of terms in the sum .
Alternatively,
Y 
Y 
lim P a 
 b lim P a 
 b
  / n
 n  
n 

m
1 b  1/2 y 2
dy
e
2
a
Note: m is sometimes called “the error in the mean” (more on that later).
For CLT to be valid:
 and  of pdf must be finite
No one term in sum should dominate the sum
880.A20 Winter 2002
Richard Kass
Central Limit Theorem
Best illustration of the CLT.
a) Take 12 numbers (ri) from your computer’s random number generator
b) add them together
c) Subtract 6
d) get a number that is from a gaussian pdf !
Computer’s random number generator gives numbers distributed uniformly in the interval [0,1]
A uniform pdf in the interval [0,1] has =1/2 and 2=1/12
12
12
12






 ri  12
 ri  12  (1 / 2)
 ri  12  (1 / 2)






Y

Y

Y

Y

n



1
2
3
n
i 1
i 1
i 1
P a 
 b   P a 
 b   P  6 
 6  P   6 
 6
 n
 n
(1 / 12 )  12
(1 / 12 )  12














12
1  6  ( y 2 / 2)


P  6   ri  6  6 
dy
e
2  6


i 1
Thus the sum of 12 uniform
random numbers
minus 6 is distributed as if it
came from a gaussian pdf
with =0 and =1.
A) 5000 random numbers
C) 5000 triplets (r1+ r2+ r3)
of random numbers
B) 5000 pairs (r1+ r2)
of random numbers
D) 5000 12-plets (r1+ ++r12)
of random numbers.
E) 5000 12-plets
E
(r + ++r -6) of
1
12
random numbers.
Gaussian
=0 and =1
-6
880.A20 Winter 2002
0
+6
Richard Kass
Propagation of errors
Suppose we measure the branching fraction BR(D0+-) using the number
of produced D0 mesons (Nproduced), the number of D0+- decays found
(Nfound), and the efficiency for finding a D0+– decay ().
BR(D0+-)=Nfound/(Nproduced),
If we know the uncertainties (’s) of Nproduced, Nfound, and  what is the
uncertainty on BR(D0+-) ?
More formally we could ask, given that we have a functional relationship between several
measured variables (x, y, z), i.e.
Q = f(x, y, z)
What is the uncertainty in Q if the uncertainties in x, y, and z are known? Usually when we talk
about uncertainties in a measured variable such as x we usually assume that the value of x
represents the mean of a Gaussian distribution and the uncertainty in x is the standard deviation
() of the Guassian distribution. A word of caution here, not all measurements can be
represented by Gaussian distributions, but more on that later!
To answer this question we use a technique called Propagation of Errors.
880.A20 Winter 2002
Richard Kass
Propagation of errors
To calculate the variance in Q as a function of the variances in x and y we use the following:
2
2
 Q2   2x  Q / x    2y  Q /  y  2 x y  Q / x  Q /  y
Note: if x and y are uncorrelated (xy =0) then the last term in the above equation is 0.
Assume we have several measurement of the quantities x (e.g. x1, x2...xi) and y (e.g. y1, y2...yi).
We can calculate the average of x and y using:
N
N
 x   xi / N and  y   yi / N
i 1
i1
Qi f(xi, yi)
Q f(x, y) evaluated at the average values
Now expand Qi about the average values:
Q
Q
Qi  f (  x ,  y )  (xi   x )
 (yi   y )
 higher order terms
x x
y  y
Let's define:
Assume that we neglect the higher order terms (i.e. the measured values are close to the average values). We can
rewrite the above as:
Q
Q
Qi  Q  (x i   x )
 (yi   y )
x  x
y y
We would like to find the variance of Q. By definition the variance of Q is just:
2
1 N
   (Qi  Q)
N i1
2
Q
880.A20 Winter 2002
Richard Kass
Propagation of errors
If we expand the summation using the definition of Q - Qi we get:
2
2
2
2 
N

1 N
 Q
1 N
 Q
2

  2  (x   )(y   ) Q  Q

 Q   (xi   x )
  (yi   y )
i
x
i
y
 x 
 x   y 
 y 
N i1
N
i1
N
i1
x

x

y
y
Since the derivatives are all evaluated at the average values ( x, y) we can pull the derivatives outside
of the summations. Finally, remembering the definition of the variance we can write:
2
2


N
2
2  Q 
2  Q 
  2  Q  Q
  (x   )(y   )
Q  x
 y
i
x
i
y
 x 
 y 
N  x   y  i 1
y
x
x
y
If the measurements are uncorrelated then the summation in the above equation will be very close to
zero (if the variables are truly uncorrelated then the sum is 0) and can be neglected. Thus for
uncorrelated variables we have:
2
2

2
2  Q 
2  Q 

Q  x
 y
uncorrelated errors
 x 
 y 
y
x
If however x and y are correlated, then we define xy using:  xy 
1 N
 (xi   x )(yi   y )
N i 1
The variance in Q including correlations is given by:
2
2


2
2  Q 
2  Q 
  2Q  Q 
 
Q  x
 y
xy
 x 
 x   y 
 y 
x

x

y
correlated errors
y
Example: Error in BR(D0+– ). Assume: Nproduced =106  103, Nfound =10  3,  = 0.02  0.002
2
 BR   N
2
 BR 
2
pr
 BR 

  2
N fd
 N 
pr 

6
2
2
 BR 
BR 
2

   2 
   N pr
  
 N fd 
2
2
 N
fd

 N 2
pr

2

  2
N fd


 1

 N
pr

2


   2  N fd
2
 


  N pr
2
10
1
10





10 
 9
 ( 4  10 6 )
 1.6  10 4
12 
6 
4
6 
 0.02  10 
 0.02  10 
 4  10  10 
880.A20 Winter 2002
Richard Kass




2
Propagation of errors
Example: The error in the average.
The average of several measurements each with the same uncertainty () is given by:

2
 2
x1  x2    xn
n
2
2
  
  
  
1
1
1
1
   2x 
   2     2        n 2  
  x     2x 2 
1 x
n x
 n
 n
 n
 n
 1
 x 2 
 n
2
2
2
2
2
 

n
“error in the mean”
This is a very important result! It says that we can determine the mean better by combining measurements.
Unfortunately, the precision only increases as the square root of the number of measurements.
Do not confuse  with !
 is related to the width of the pdf (e.g. gaussian) that the measurements come from.
It does not get smaller as we combine measurements.
A slightly more complicated problem is the case of the weighted average or unequal ’s:
x1

x2

xn
 12  22
 n2

1 /  12  1 /  22    1 /  n2
Using same procedure
as above we obtain:
880.A20 Winter 2002
 2 
1
1 /  12
 1 /  22

 1 /  n2
“error in the weighted mean”
Richard Kass
Propagation of errors
Problems with Propagation of Errors:
In calculating the variance using propagation of errors we usually assume that we are dealing with Gaussian like errors
for the measured variable (e.g. x). Unfortunately, just because x is described by a Gaussian distribution
does not mean that f(x) will be described by a Gaussian distribution.
Example: when the new distribution is Gaussian.
Let y = Ax, with A = a constant and x a Gaussian variable. Let the pdf for x be gaussian:

p(x,  x ,  x )dx 
e
( x  x ) 2
2  2x
 x 2
dx
Then y = Ax and y = Ax. Putting this into the above equation we have:

p(x,  x ,  x )dx 


e
Ae
y 2

dx 
e
( y / A  y / A )
2
2
 x 2
( y  y ) 2
2  2y
( x  x )
2  2x

dx 
e
2(  / A )
2
y
2
 y / A 2
dx
( y  y ) 2
2  2y
 y 2
dy
 p( y,  y ,  y )dy
Thus the new pdf for y, p(y, y, y) is also given by a Gaussian probability distribution function.
100
y = 2x with x = 10 2
dN/dy
80
24
y
x
60
Start with a gaussian with =10, =2.
Get another gaussian with =20, = 4
40
20
0
880.A20 Winter 2002
0
10
20
y
30
40
Richard Kass
Propagation of errors
Example when the new distribution is non-Gaussian: Let y = 2/x
The transformed probability distribution function for y does not have the form of a Gaussian pdf.
100
y = 2/x with x = 10  2
dN/dy
80
2
2/
y
x x
Start with a gaussian with =10, =2.
DO NOT get another gaussian !
Get a pdf with = 0.2, = 0.04.
This new pdf has longer tails than a gaussian pdf.
60
40
Prob(y>y+5y) =5x10-3, for gaussian 3x10-7
20
0
0.1
0.2
0.3
0.4
0.5
0.6
y
Unphysical situations can arise if we use the propagation of errors results blindly!
Example: Suppose we measure the volume of a cylinder: V = R2L.
Let R = 1 cm exact, and L = 1.0 ± 0.5 cm.
Using propagation of errors we have: V = R2L = /2 cm3.
and V = ± /2 cm3
However, if the error on V (V) is to be interpreted in the Gaussian sense then
the above result says that there’s a finite probability (≈ 3%) that the volume (V) is < 0 since V is
only two standard deviations away from than 0!
Clearly this is unphysical ! Care must be taken in interpreting the meaning of V.
880.A20 Winter 2002
Richard Kass
Maximum Likelihood Method (MLM)
Suppose we are trying to measure the true value of some quantity (xT). We make repeated
measurements of this quantity {x1 , x2 ...x n} . The standard way to estimate xT from our
measurements is to calculate the mean value of the measurements:
N
 xi
 x  i 1
N
and set xT   x .
Does this procedure make sense?
The MLM answers this question and provides a method for estimating parameters from existing data.
Statement of the Maximum Likelihood Method (MLM):
Assume we have made N measurements of x {x1, x 2 ... xn }.
Assume we know the probability distribution function that describes x: f (x, a )
Assume we want to determine the parameter a.
The MLM says that we pick asuch as to maximize the probability of getting the
measurements (the xi's) that we obtained!
How do we use the MLM?
The probability of measuring x1 is f (x1 , a )
The probability of measuring x2 is f (x2 , a )
The probability of measuring xn is f (xn , a)
If the measurements are independent, the probability of getting our measurements is:
L  f (x1 , a ) f (x2 ,a )  f (x n , a ) .
L is called the Likelihood Function. It is convenient to write L as:
N
L   f (xi ,a )
i 1
We want to pick the a that maximizes L. Thus we want to solve:
880.A20 Winter 2002
L
0
a a  a *
Richard Kass
Maximum Likelihood Method (MLM)
In practice it is easier to maximize lnL rather than L itself since lnL turns the product into
a summation. However, maximizing lnL gives the same asince L and lnL are both
maximum at the same time.
N
ln L   ln  f ( x i , a )
i 1
The maximization condition is now:
 ln L
a

( ln f ( xi ,a ))
0
i 1  a
a a *
N

a a *
Note: acould be an array of parameters or just a single variable. The equations to
determine a range from simple linear equations to coupled non-linear equations.
Example: Let f(xa) be given by a Gaussian distribution and let athe mean of the Gaussian. We
want the best estimate of a from our set of n measurements (x1, x2, ..xn). Let’s assume that  is the
same for each measurement.

1
f ( xi , a ) 
e
 2
The likelihood function for this problem is:
 (x  a)2
n
( x  a )2
(x i  a ) 2
22
( x
 a)2
 (x
 a )2
n
n ( x  a )2
i
2 2
i
1
2
n

1
 1

 1
 i
2 2
2 2
2 2
2 2
L   f ( xi ,a )  
e

e
e
e

e 1
i 1
i 1 
2
 2 
 2  
We want to find the a that maximizes the likelihood function (actually we will maximize lnL).
n ( x  a )2 
 ln L
 
 1

i

n
ln




 0
 2  i1 2 2
a
a 

n
n
Since is the same for each data point we can factor it out:
n
 xi 
i1
n
 a  0 or
i 1
Finally, solving for a we have:
n
 xi
 na
i 1
Average !
1 n
 xi
n i1
For the case where the are different for each data point we get the weighted average:
n
xi
( 2 )
i 1 
a  n 1i
( 2 )
a 
i 1
880.A20 Winter 2002
i
Richard Kass
Maximum Likelihood Method (MLM)
Example: Let f(x,a) be given by a Poisson distribution with athe mean of the Poisson.
We want the best estimate of a from our set of n measurements (x1, x2, ..xn). We want the
best estimate of a from our set of n measurements (x1, x2, ..xn). The Poisson distribution
is a discrete distribution, so (x1, x2, ..xn) are integers. The Poisson distribution is given by:
ea a x
f (x, a ) 
x!
The likelihood function for this problem is:
n
 xi
e  aa i e a a 1 e  aa 2 e a a n e  na a i 1

..

x1 !
x2 !
xn !
x1 !x2 !.. xn !
i1
i1 xi !
We want to find the a that maximizes the likelihood function (actually we will maximize
lnL).
n
x
n
L   f (xi , a )  
x
x
x
n
 xi

d ln L
d 
d ln L

 na  ln a   xi  ln(x1 !x 2 !.. xn !) or
 n  i 1  0
da
da 

da
a
i1
1 n
a   xi .
Average !
n i1
n
Some general properties of the maximum likelihood method:
a) For large data samples (large n) the likelihood function, L, approaches a Gaussian distribution.
b) Maximum likelihood estimates are usually consistent.
By consistent we mean that for large n the estimates converge to the true value of the
parameters we wish to determine.
c) For many instances the estimate from MLM is unbiased.
Unbiased means that for all sample sizes the parameter of interest is calculated correctly.
d) The maximum likelihood estimate of a parameter is the estimate with the smallest variance. Cramer-Rao bound
We say the estimate is efficient.
e) The maximum likelihood estimate is sufficient.
By sufficient we mean that it uses all the information in the observations (the xi’s).
f) The solution from MLM is unique.
The bad news is that we must know the correct probability distribution for the problem at hand!
880.A20 Winter 2002
Richard Kass
Maximum Likelihood Method (MLM)
How do we calculate errors (’s) using the MLM?
Start by looking at the case where we have a gaussian pdf. The likelihood function is:
1
L   f (xi ,a )  
e
i 1
i 1  2
n
n
 (xi  a ) 2
2 2
n
 1 

e
 2 
(x1 a )2
2 2
e
(x 2  a ) 2
2 2
e
 (xn a )2
2 2
n
n ( x a )2
i
2 2
 1  i1

e
 2  
It is easier to work with lnL:
( xi  a ) 2
ln L  n ln( 2 )  
2
i 1 2
n
If we take two derivatives of lnL with respect to a we get:
 2 ln L
 n ( xi  a )
n





a i 1  2
a 2
2
n (x  a )
 ln L
 2  i 2
a
i 1 2
For the case of a gaussian pdf we get the familiar result:
  2 ln L 

2

a 
 
2 
n
 a 
2
1
The big news here is that the variance of the parameter of interest
is related to the 2nd derivative of the likelihood function.
Since our example uses a gaussian pdf the result is exact. More important, the result is
asymptotically true for ALL pdf’s since for large samples (n) all likelihood functions
become “gaussian”.
880.A20 Winter 2002
Richard Kass
Maximum Likelihood Method (MLM)
The previous example was for one variable. We can generalize the result to the case where
we determine several parameters from the likelihood function (e.g. a1, a2, … an):
  2 ln L 

Vij  
 a a 
 i j
1
Here Vij is a matrix, (the “covariance matrix” or “error matrix”) and it is evaluated at the values
of (a1, a2, … an) that maximize the likelihood function.
In practice it is often very difficult or impossible to evaluate the 2 nd derivatives.
The procedure most often used to determine the variances in the parameters relies on the property
that the likelihood function becomes gaussian (or parabolic) asymptotically.
We expand lnL about the ML estimate for the parameters. For the one parameter case we have:
 ln L
1  2 ln L
*
ln L(a )  ln L(a ) 
(a  a ) 
(a  a * )2   
2
a a *
2! a a *
*
Since we are evaluating lnL at the value of a (=a*) that maximizes L, the term with the 1st derivative is zero.
Using the expression for the variance of a on the previous page and neglecting higher order terms we find:
ln L(a )  ln Lmax
(a  a * )2
n2
*

or ln L(a  n a )  ln Lmax  
2
2
2 a
Thus we can determine the  n limits on the parameters by finding the values where lnL
decreases by n2/2 from its maximum value.
880.A20 Winter 2002
Richard Kass
This is
what
MINUIT
does
Maximum Likelihood Method (MLM)
Example: MLM and determining slope and intercept of a line
Assume we have a set of measurements: (x1, y1), (x2, y22… (xn, ynnand the
points are thought to come from a straight line, y=a+bx, and the measurements come
from a gaussian pdf. The likelihood function is:

1
L   f (xi , a , b)  
e
i1
i1  i 2
n
n
( y i q (x i ,a , b )) 2
2
2 i

1

e
i1  i 2
n
( yi  a  b x i ) 2
2
2 i
We wish to find the a and b that maximizes the likelihood function L.
Thus we need to take some derivatives:
 ln L
 n  
1 
(yi  a  bxi )2 



 ln

0

a
a i 1 
2 i2
 i 2  

2 
 ln L  n  
1 
  (yi  a 2 bxi )  0

 ln 

b
b i 1   i 2  
2 i

We have to solve the two equations for the two unknowns, a and b .
We can get an exact solution since these equations are linear in a and b.
yi n x 2i n yi x i n x i
 2 2  2  2
i1 i i1 i
i1  i i1 i
a n
2
n x
1 n x
 2  i2  (  i2 )2
i1  i i1  i
i1 i
n
880.A20 Winter 2002
1 n x i yi n yi n x i
2  2  2  2
i1 i i1  i
i1 i i1 i
and b  n
2
n x
1 n x
 2  i2  (  i2 )2
i1  i i1  i
i1 i
n

Richard Kass
Chi-Square (c2) Distribution
Chi-square (c2) distribution:
Assume that our measurements (xii’s) come from a gaussian pdf with mean =.
n ( x   )2
2
i
Define a statistic called chi-square: c  
 i2
i 1
It can be shown that the pdf for
c2
is:
2
p( c ,n) 
1
2 n/ 21 c 2 /2
[
c
]
e
2n /2 (n / 2)
2
0 c 
This is a continuous pdf.
It is a function of two variables, c2 and n = number of degrees of freedom. ( = "Gamma Function“)
A few words about the number of degrees of freedom n:
n = # data points - # of parameters calculated from the data points
Reminder: If you collected N events in an experiment and you histogram
your data in n bins before performing the fit, then you have n data points!
For n  20, P(c2>y) can
be approximated using a
gaussian pdf with
y=(2c2) 1/2 -(2n-1)1/2
EXAMPLE: You count cosmic ray events in 15 second intervals and sort the data into 5 bins:
number of intervals with 0 cosmic rays
2
number of intervals with 1 cosmic rays
7
c2 distribution for different degrees of freedom v
number of intervals with 2 cosmic rays
6
number of intervals with 3 cosmic rays
3
number of intervals with 4 cosmic rays
2
RULE of THUMB
Although there were 36 cosmic rays in your sample you have only 5 data points.
A good fit has
EXAMPLE: We have 10 data points with  and  the mean and standard deviation of the data set.
If we calculate  and  from the 10 data point then n = 8
c2/DOF 1
If we know  and calculate  OR if we know and calculate  then n = 9
If we know  and  then n = 10
880.A20 Winter 2002
Richard Kass
MLM, Chi-Square, and Least Squares Fitting
Assume we have n data points of the form (yi,i) and we believe a functional
relationship exists between the points:
y=f(x,a,b…)
In addition, assume we know (exactly) the xi that goes with each yi.
We wish to determine the parameters a, b,..
A common procedure is to minimize the following c2 with respect to the parameters:
n
c 
2
[ yi  f ( xi , a, b,..)]2
i 1
 i2
If the yi’s are from a gaussian pdf then minimizing the c2 is equivalent to the MLM.
However, often times the yi’s are NOT from a gaussian pdf.
In these instances we call this technique “c2 fitting” or “Least Squares Fitting”.
Strictly speaking, we can only use a c2 probability table when y is from a gaussian pdf.
However, there are many instances where even for non-gaussian pdf’s the above sum approximates c2 pdf.
From a common sense point of view minimizing the above sum makes sense
regardless of the underlying pdf.
880.A20 Winter 2002
Richard Kass
Least Squares Fitting
Example: Leo’s 4.8 (P107) The following data from a radioactive source
was taken at 15 s intervals. Determine the lifetime (t) of the source.
The pdf that describes radioactivity (or the decay of a charmed particle) is:
Technically the pdf is |dN(t)/(N(0)dt)| =N(t)/(N(0)t).
N (t )  N (0)et / t
As written the above pdf is not linear in t. We can turn this into a linear problem by
taking the natural log of both sides of the pdf.
ln( N (t ))  ln( N (0))  t / t  y  C  Dt
We can now use the methods of linear least squares to find D and then t.
In doing the LSQ fit what do we use to weight the data points ?
The fluctuations in each bin are governed by Poisson statistics: 2i=Ni.
However in this problem the fitting variable is lnN so we must use propagation of errors
to transform the variances of N into the variances of lnN.
 y2   N2 y / N 2  ( N ) ln N / N 2  ( N )1 / N 2  1 / N
Leo has a “1” here
880.A20 Winter 2002
i
1
2
3
4
5
6
7
8
9
10
ti
0
15
30
45
60
75
90
105
120
135
Ni
106
80
98
75
74
73
49
38
37
22
yi = lnNi
4.663
4.382
4.585
4.317
4.304
4.290
3.892
3.638
3.611
3.091
Richard Kass
Least Squares Fitting
The slope of the line is given by:
n

D
1
n
t i yi
n

2
2
i 1 i i 1  i
n
5
ti
2  2
i 1 i i 1 i
n 1 n t2
 2  i2
i 1 i i 1 i
D
yi
n
 (
Line of “best fit”
4.5
ti
2
)
2
Y(x)
i 1 i
4
652  132800  2780.3  33240
 0.00903
2
652  2684700  (33240)
3.5
3
-20
Thus the lifetime (t) = -1/D = 110.7 s
0
The error in the lifetime is:
n
 D2 
 2
i 1 i
n 1 n t2
n t
2
i

(
 2  2  i2 )
i 1 i i 1 i
i 1 i
 t   D2
2
1
t / D 
2

652
 1.01  10 6
2
652  2684700  (33240)


  t   D 1/ D 
t = 110.7 ± 12.3 sec.
2
1.005  103
9.03 10 
3 2
 12.3 s
20
40
60
t
80
100 120 140
Caution: Leo has a factor of ½ in his
error matrix (V-1)ij, Eq 4.72.
He minimizes:
y  f ( xi , a, b,...) 2
S  [ i
]
i
Using MLM we minimized:
( yi  f ( xi , a, b,...) 2
ln L  
2 i2
Note: fitting without weighting yields: t=96.8 s.
880.A20 Winter 2002
Richard Kass
Hypothesis testing
The goal of hypothesis testing is to set up a procedure(s) to allow us to decide
if a model is acceptable in light of our experimental observations.
Example: A theory predicts that BR(B)= 2x10-5 and you measure (42) x10-5 .
The hypothesis we want to test is “are experiment and theory consistent?”
Hypothesis testing does not have to compare theory and experiment.
Example: CLEO measures the Lc lifetime to be (180  7)fs while SELEX measures (198  7)fs.
The hypothesis we want to test is “are the lifetime results from CLEO and SELEX consistent?”
There are two types of hypotheses tests: parametric and non-parametric
Parametric: compares the values of parameters (e.g. does the mass of proton = mass of electron ?)
Non-parametric: deals with the shape of a distribution (e.g. is angular distribution consistent with being flat?)
Consider the case of neutron decay. Suppose we have two
theories that both predict the energy spectrum of the
electron emitted in the decay of the neutron. Here a
parametric test might not be able to distinguish between
the two theories since both theories might predict the
same average energy of the emitted electron.
However a non-parametric test would be able to
distinguish between the two theories as the shape of the
energy spectrum differs for each theory.
880.A20 Winter 2002
Richard Kass
Hypothesis testing
A procedure for using hypothesis testing
a)
b)
c)
d)
e)
Measure (or calculate) something
Find something that you wish to compare with your measurement (theory, experiment)
Form a hypothesis (e.g. my measurement is consistent with the PDG value)
Calculate the confidence level that the hypothesis is true
Accept or reject the hypothesis depending on some minimum acceptable confidence level
Problems with the above procedure
a)
b)
c)
What is a confidence level ?
How do you calculate a confidence level?
What is an acceptable confidence level ?
How would we test the hypothesis “the space shuttle is safe?”
Is 1 explosion per 10 launches safe? Or 1 explosion per 1000 launches?
A working definition of the confidence level:
The probability of the event happening by chance.
Example: Suppose we measure some quantity (X) and we know that it is described by a gaussian
pdf with =0 and =1. What is the confidence level for measuring X 2 (i.e. 2 from the mean)?


2
2
P( X  2)   P(  , , x)dx   P(0,1, x)dx 
2
 x
1
2
e

2
2
dx  0.025
Thus we would say that the confidence level for measuring X 2 is 0.025 or 2.5%
and we would expect to get a value of X 2 one out of 40 tries if the underlying pdf is gaussian.
880.A20 Winter 2002
Richard Kass
Hypothesis testing
A few cautions about using confidence limits
a) You must know the underlying pdf to calculate the limits.
Example: suppose we have a scale of known accuracy ( = 10 gm ) and we weigh
something to be 20 gm. Assuming a gaussian pdf we could calculate a 2.5% chance that
our object weighs ≤ 0 gm?? We must make sure that the probability distribution is
defined in the region where we are trying to extract information.
b) What does a confidence level really mean? Classical vs Baysian viewpoints
Example: Suppose we measure a value of x for the mean of a Gaussian distribution with
an unknown mean . Suppose we know the standard deviation () of the distribution. It
is tempting to say:
“The probability that  lies in the interval [x-2, x+2] = 95%”
However according to Classical probability this is a meaningless statement! By definition
the mean () is a constant, not a random variable, thus  does not have a probability
distribution associated with it! What we can say is that we will reject any value of  that
gives a probability of ≤ 5% of obtaining our (measured) value of x. Here we are assuming
that we are really measuring . But how do we really know what we are measuring?
880.A20 Winter 2002
Richard Kass
Hypothesis testing
Hypothesis testing for gaussian variables:
We wish to test if a quantity we have measured (=average of n measurements )
is consistent with a known mean (0).
Test
= o
Conditions
2 known
Test Statistic
Test Distribution
  o
Gaussian
/ n
avg.0
t(n  1)



2
o
 unknown
= o
s/ n
In the above chart t(n-1) stands for the “t-distribution” with n-1 degrees of freedom.
Example: Do free quarks exist? Quarks are nature's fundamental building blocks and are
thought to have electric charge (|q|) of either (1/3)e or (2/3)e (e = charge of electron). Suppose
we do an experiment to look for |q| = 1/3 quarks.
We measure: q = 0.90 ± 0.2 This gives and 

Quark theory: q = 0.33
This is 
We want to test the hypothesis = o when  is known. Thus we use the first line in the table.
   o 0.9  0.33
z

 2.85
/ n
0.2 / 1
We want to calculate the probability for getting a z  2.85, assuming a Gaussian pdf.


2.85
2.85
prob(z  2.85)   P( ,  ,x)dx   P(0,1, x)dx 
1
2

e

x2
2
dx  0.002
2.85
The CL here is just 0.2 %! What we are saying here is that if we repeated our experiment 1000
times then the results of 2 of the experiments would measure a value q  0.9 if the true mean
was q = 1/3. This is not strong evidence for q = 1/3 quarks!
880.A20 Winter 2002
Richard Kass
Hypothesis testing
Do charge 2/3 quarks exist?
If instead of q = 1/3 quarks we tested for q = 2/3 what would we get for the CL?
Now we have = 0.9 and = 0.2 as before but o = 2/3.
We now have z = 1.17 and prob(z  1.17) = 0.13 and the CL = 13%.
Now free quarks are starting to get believable!
Another variation of the quark problem
Suppose we have 3 measurements of the charge q:
q1 = 1.1, q2 = 0.7, and q3 = 0.9
We don't know the variance beforehand so we must determine the variance from our data.
Thus we use the second test in the table.
= (q1+ q2+ q3)/3 = 0.9
n
2
s 
2
 (qi   )
0.2 2  (0.2) 2  0

 0.04
n 1
2
   o 0.9  0.33
z

 4.94
s/ n
0.2 / 3
i1
In this problem z is described by Student’s t-distribution.
Note: Student is the pseudonym of statistician W.S. Gosset who was employed by a famous English brewery.
Just like the Gaussian pdf, in order to evaluate the t-distribution one must resort to a look up table (see
for example Table 7.2 of Barlow).
In this problem we want prob(z  4.94) when n-1 = 2. The probability of z  4.94 is  0.02.
This is about 10X greater than the 1st part of this example where we knew the variance ahead of time.
880.A20 Winter 2002
Richard Kass
Hypothesis testing
Tests when both means are unknown but come from a gaussian pdf:
Test
1 - 2 = 0
1
1 - 2 =0
2
Conditions
and 22 known
12 = 22 = 2
unknown 
Test Statistic
1  2
12 / n   22 / m
1  2
12  22
unknown
t (n+m-2)
Q 1 / n 1 / m
Q2 
1 - 2 =0
Test Distribution
Gaussian
avg.0
( n1)s12 ( m1)s 2
2
n m 2
1   2
s12
/ n
s22
approx. Gaussian
avg.0
/m
n and m are the number of measurements for each mean
Example: Do two experiments agree with each other?
CLEO measures the Lc lifetime to be (180  7)fs while SELEX measures (198  7)fs.
z
1   2
198  180

 1.82
2
2
2
2
1 / n   2 / m
(7 )  (7 )
2
1.82
1.82
1.82
1.82
P( z  1.82)  1   P(  , , x)dx  1   P(0,1, x)dx  1 
1
2
1.82  x
e 2

dx  1  0.93  0.07
1.82
Thus 7% of the time we should expect the experiments to disagree at this level.
But, is this acceptable agreement?
880.A20 Winter 2002
Richard Kass
Hypothesis testing
A non-gaussian example, Poisson distribution
The following is the numbers of neutrino events detected in 10 second intervals
by the IMB experiment on 23 February 1987 around which time the supernova
S1987a was first seen by experimenters:
#events
0
1
2
3
4
5
6
7
8 9
#intervals
1024
860
307
58
15
3
0
0
0 1
Assuming the data is described by a Poisson distribution.
calculate the average and compute the average number
events expected in an interval.
8

 (# events ) (# intervals )
i0
8
 #intervals
 0.774
= 0.777 if we include
interval with 9 events
i0
We can calculate a c2 assuming the data are described by a Poisson distribution:
i8
 e    n
The predicted number of intervals is given by: # intervals   #intervals 
i0
 n!
8 (# intervals  prediction ) 2 
Note: we use 2=prediction for a Poisson
c 2   
 3.6

prediction
i0
(# intervals  prediction ) 2 

2
i0
8
c 2   
#events
0
1
2
3
4
5
6
7
8
9
#intervals
1064
823
318
82
16
2
0.3
0.03
0.003
0.0003
There are 7 (= 9-2) DOF’s here and the probability of c2/D.O.F. = 3.6/7 is high (≈80%), indicating a good fit to a Poisson
(# intervals  prediction ) 2 
2
 3335 and c /D.O.F. = 3335 / 8  417.


prediction
i 0
9
However, if the last data point is included: c 2   
The probability of getting a c2/D.O.F. which is this large from a Poisson distribution with  = 0.774 is  0.
Hence the nine events are most likely coming from the supernova explosion and not just from a Poisson.
880.A20 Winter 2002
Richard Kass
Confidence Intervals
Confidence intervals (CI) are related to confidence limits (CL).
To calculate a CI we assume a CL and find the values of the parameters that give us the CL.
Caution
CI’s are not always uniquely defined.
We usually seek the minimum interval or symmetric interval.
Example: Assume we have a gaussian pdf with =3 and =1. What is the 68% CI ?
We need to solve the following equation:
0.68  ab G ( x,3,1)dx
Here G(x,3,1) is the gaussian pdf with =3 and =1.
There are infinitely many solutions to the above equation.
We seek the solution that is symmetric about the mean ():
0.68  ccG ( x,3,1)dx
To solve this problem we either need a probability table, or remember that 68% of the
area of a gaussian is within  of the mean.
Thus for this problem the 68% CI interval is: [2,4]
Example: Assume we have a gaussian pdf with =3 and =1.
What is the one sided upper 90% CI ?
Now we want to find the c that satisfies:
0.9  c G ( x,3,1)dx
Using a table of gaussian probabilities we find 90% of the area in the interval [-, +1.28]
Thus for this problem the 90% CI is: [-, 4.28]
880.A20 Winter 2002
Richard Kass
Confidence Intervals
Suppose an experiment is looking for the X particle but observes no candidate events.
What can we say about the average number of X particles expected to have been produced?
First, we need to pick a pd. Since events are discrete we need a discrete pd Poisson.
Next, how unlucky do you want to be ? It is common to pick 10% of the time to be unlucky.
We can now re-state the question as:
“Suppose an experiment finds zero candidate events. What is the 90% CL upper limit on the
average number of events () expected assuming a Poisson pd ?”
Thus we need to solve for  in the following equation:
 e   n
CL  0.9  
n 1 n!
In practice it is much easier to solve for 1-CL:
So, if =2.3 then 10% of the time
e   n
e   n
1  CL  1  
 
 e       ln(1  CL) we should expect to find 0 candidates.
There was nothing wrong with our
n!
n 1 n!
n0
experiment. We were just unlucky.
For our example, CL=0.9 and therefore =2.3 events.
Example: Suppose an experiment finds one candidate event. What is the 95% CL upper limit
on the average number of events () ?

1 e   n
e   n
The 5% includes 1 AND 0 events.
1  CL  1  
 
 e    e      4.74
n!
n!
n2
n 0
Here we are saying that we would get 2 or more events 95% of the time if =4.74.

PDG 1994 has a good Table (17.3, P1280) for these types of problems.
880.A20 Winter 2002
Richard Kass
Maximum Likelihood Method Example
Example: Exponential decay:
pdf : f (t ,t )  et / t / t
n
n
i 1
i 1
L   eti / t / t and ln L  n ln t   ti / t
Generate events according to an exponential distribution with t0= 100
Calculate lnL vs t(time) and find maximum of LnL and the points
where LnL =Lmax-1/2 (“1 points”)
-5.613 10 4
-62
lnL
lnL
-5.613 10 4
-63
-5.613 10 4
lnL
lnL
-64
y = m3-(m0-m1)^ 2/(2*m2^ 2)
Value
Error
m1
100.8
0.013475
m2
1.01 0.0088944
m3
-56128
0.034297
Chisq
0.055864
NA
R
0.99862
NA
-65
-66
-5.613 10 4
-5.613 10 4
-67
0
100
200
t
300
400
500
600
Log-likelihood function for 10 events
LnL max for t=189
1 points: (140, 265)
L not gaussian
880.A20 Winter 2002
-5.613 10 4
97
98
99
100
t
101
102
103
104
Log-likelihood function for 104 events
LnL max for t=100.8
1 points: (99.8, 101.8)
L is fit by a gaussian
Richard Kass
Maximum Likelihood Method Example
How do we calculate confidence intervals for our MLM example?
For the case of 104 events we can just use gaussian stats since the likelihood
function is to a very good approximation gaussian. Thus the “1  points” will give
us 68% of the area under gaussian curve, the “2  points” points ~95% of area, etc.
Unfortunately, the likelihood function for the 10 event case is NOT
approximated by a gaussian. So the “1  points” do not necessarily give you
68% of the area under the gaussian curve.
In this case we can calculate a confidence interval about the mean using a Monte
Carlo calculation as follows:
1)
2)
3)
4)
Generate a large number (e.g. 107) of 10 event samples each sample having a mean
lifetime equal to our original 10 event sample (t*=189)
For each 10 event sample calculate the maximum of the log-likelihood function (=ti)
Make a histogram of the ti’s. This histogram is the pdf for t.
To calculate a X% confidence interval about the mean, find the region where
X%/2 of the area is in the region [tL, t*] and X% is in the region [t*, tH].
NOTE: since the pdf may not be symmetric around its mean, we may not be able
to find equal area regions below and above the mean.
880.A20 Winter 2002
Richard Kass
Maximum Likelihood Method Example
7 10 5
6 10
106
5
Linear
4 10 5
events/10
5 10 5
events/10
105
t*
3 10 5
10
4
100
1 10 5
10
0
100
t
t*
1000
2 10 5
0
Semi-log
1
200
300
400
500
0
100
200
t
300
400
500
Above is the histogram or pdf of 107 ten event samples each with t*=189. By counting
events (i.e. integrating) in an interval around t*, the histogram (actually, I printed out
the number of events in one unit steps from 0 to 650) gives the following:
54.9% of the area is in the region (0t189)
“±1  region”: 34% of area in regions (139t189) and (189t263)
90% CI region: 45% of area in regions (117t189)] and (189t421)
The upper 95% region (i.e. 47.5% of the area above the mean) is not defined.
Very close
To likelihood
result
NOTE: the variance of an
exponential distribution can be calculated analytically:

 t2   (t  t )2et / t dt / t  t 2 / n
0
Thus for the 10 event sample we expect = 60, not too far off from the 68% CI!
For the 104 event sample, the CI’s from the ML estimate of  and the analytic 
are essentially identical (both give =1.01).
880.A20 Winter 2002
Richard Kass
Confidence Regions
Often we have a problem that involves two or more parameters. In these instances
it makes sense to define confidence regions rather than an interval.
Consider the case where we are doing a MLM fit to two variables a, b.
Previously we have seen that for large samples the Likelihood function becomes “gaussian”:
ln L(a )  ln Lmax
(a  a * )2
n2
*

or ln L(a  n a )  ln Lmax 
2
2
2 a
We can generalize this to two correlated variables a, b:
ln L(a , b )  ln Lmax
 (a  a * )2 ( b  b * )2
 ab
1
(a  a * )( b  b * ) 



2

with




 a b
 a b
2(1   2 )   a2
 b2

The contours of constant probability are given by:
1  (a  a * ) 2 ( b  b * ) 2
(a  a * )( b  b * ) 

 2

 Q
 a b
(1   2 )   a2
 b2

Q=1 contains 39% of the area
Q=2.3 contains 68% of the area
Q=4.6 contains 0% of the area
Q=6.2 contains 95% of the area
Q=9.2 contains 99% of the area
880.A20 Winter 2002
Integral of c2 pdf with 2 DOF’s can
done analytically:
Q 1q
e 2 dq
P(q  Q)  1 
20
1
 Q
1 e 2
Richard Kass
Confidence Regions
Example: The CLEO experiment did a maximum likelihood analysis to search
for B+ and Bk+  events. The results of the MLM fit are:
N + =164, N k+  =255, =0.5 (warning: these are made up numbers!)
N + and N k+  are highly correlated, since at high momentum (>2GeV) CLEO
has a hard time separating ’s and k’s.
The contours of constant probability are given by:
1  ( N  25) 2 ( N k  16) 2
(a  25)( b  16) 


2
(
0
.
5
)
 Q
2 
2
2
5 4
(1  .5 ) 
5
4

40
99%, Q=9.2
95%, Q=6.2
N +
30
68%,Q=2.3
20

10
39%, Q=1
0
0
880.A20 Winter 2002
10
20
N k+ 
30
40
Richard Kass