Download Document

Document related concepts

Mixture model wikipedia , lookup

Transcript
Statistics
We collect a sample of data, what do
we do with it?


Estimate parameters (possibly of some
model)
Test whether a particular theory is
consistent with our data (hypothesis
testing)
Statistics is a set of tools that allows us
to achieve these goals
1
Statistics
 Preliminaries
An estimator ˆ is a function of the data whose
value, the estimate, is intended as a meaningful guess
of the true value of the parameter 
There is no fundamenta l rule for doing this
One thing we desire is consistenc y
lim ˆ  
n 
Another th ing we desire is no bias
b  E ˆ    0

2
Statistics
 Some common estimators are for the mean
and variance Consider N independen t measuremen ts xi
of unknown  and unknown  2
1
ˆ 
N
N
x
i 1
i
N
1
2
2
2
ˆ
ˆ
xi   
 s 

N  1 i 1
Furthermor e
V ˆ  
2
N
1
N 3 4 
2
V ˆ   m4 
 
N
N 1 
 
3
c2 Distribution
 A common situation is that you have a set of
measurements xi and you know the true
value of each xit

How good are our measurements?
 Similarly you may be comparing a histogram
of data with another that contains
expectation values under some hypothesis

How well do the data agree with this hypothesis?
 Or if parameters of a function were estimated
using the method of least squares, a
minimum value of c2 was obtained

How good was the fit?
4
c2 Distribution
 Assuming


The measurements are independent of each other
The measurements come from a Gaussian
distribution
 One can use the “goodness-of-fit” statistic c2 to
answer these questions
n
c 
2
xi  xti 
i 1
2

2
i
 In the case of Poisson distributed numbers, i2=xti, this is
called Pearson’s c2 statistic
5
c2 Distribution
 Chi-square distribution
1
f  z; n   n / 2
z n / 21e  z / 2 z  0 
2 n / 2 
n  1,2,... is the number of degrees of freedom
Ez   n
V z   2n
The usefulness of this pdf is that for
n independen t xi , i ,  i2
n
z
i 1
xi  i 
2
 i2
follows the c 2 distributi on with n d.o.f.
6
c2 Distribution
7
c2 Distribution
The integrals (or cumulative
distributions) between arbitrary points
for both the Gaussian and c2
distributions cannot be evaluated
analytically and must be looked up


What is the probability of getting a c2 > 10
with 4 degrees of freedom?
This number tells you the probability that
random fluctuations (chance fluctuations)
in the data would give a value of c2 > 10
8
c2 Distribution
 Note the p-value is defined as

p
 f z, n dz
c2
 We’ll come back to p-values in a moment
9
c2 Distribution
 1- cumulative c2 distribution
10
c2 Distribution
Often one uses the reduced c2 = c2/n
11
Hypothesis Testing
 Hypothesis tests provide a rule for accepting or
rejecting hypotheses depending on the
outcome of a measurement

Suppose H predicts f x , H  for some vector

of data x

We do an experiment and measure xobs
What can we say about H?
12
Hypothesis Testing
Normally we define regions in x-space
that define where the data is
compatible with H or not
13
Hypothesis Testing
 Let’s say there is just one hypothesis H
 We can define some test statistic t whose value
in some way reflects the level of agreement
between the data and they hypothesis
 We can quantify the goodness-of-fit by
specifying a p-value given an observed tobs in
the experiment

p


g
t
,
H
dt

tobs


Assumes t is defined such that large values
correspond to poor agreement with the hypothesis
g is the pdf for t
14
Hypothesis Testing
Notes



p is not the significance level of the test
p is not the confidence level of a
confidence interval
p is not the probability that H is true
 That’s Bayesian speak

p is the probability, under the assumption
of H, of obtaining data (x or t(x)) having
equal or lesser compatibility with H as xobs
15
Hypothesis Testing
 Flip coins

N!
f nh ; ph , N  
phnh 1  phnh
nh ! N  nh !



N  nh
Hypothesis H is coin is fair (random) so ph=pt=0.5
We could take t=|nh-N/2|
 Toss coin N=20 times and observe nh=17
Region of t - space with  compatibil ity t  7
p  value  P(nh  0,1,2,3,17,18,19,20)  0.026
 Is H false?



Don’t know
We can say that probability of observing 17 or more heads
assuming H is 0.0026
p is the probability of observing this result “by chance” 16
Kolmogorov-Smirnov (K-S) Test
The K-S test is an alternative to the c2
test when the data sample is small
It is also more powerful than the c2 test
since it does not rely on bins – though
one commonly uses it that way

A common use is to quantify how well data
and Monte Carlo distributions agree
It also does not depend on the
underlying cumulative distribution
function being tested
17
K-S Test
Data – Monte Carlo comparison
18
K-S Test
 The K-S test is based on the empirical
distribution function (ECDF) Fn(x)
1 n 1 if yi  x
Fn  x   
n i 1 0 otherwise

For n ordered data points yi
 This is a step function that increases by 1/N at
the value of each ordered data point
19
K-S Test
 The K-S statistic is given by
D  max Fn x   F x 

n
D  max F x   Fn x 

n
F x  is the hypothesiz ed distributi on
 If D > some critical value obtained from tables,
the hypothesis (data and theory distributions
agree) is rejected
20
K-S Test
21
Statistics
Suppose N independent measurements
xi are drawn from a pdf f(x;)
We want want to estimate the
parameters 


The most important method for doing this
is the method of maximum likelihood
A related method in the case of least
squares
22
Hypothesis Testing
 Example


Properties of some selected events
Hypothesis H is these are top quark events
 Working in x-space is hard so usually one
constructs a test statistic t instead whose
value reflects the compatibility between the
data vector x and H


Low t – data more compatible with H
High t – data less compatible with H
 Since f(x,H) is known, g(t,H) can be
determined
23
Hypothesis Testing
 Notes



p is not the significance level of the test
p is not the confidence level of a confidence
interval
p is not the probability that H is true
 That’s Bayesian speak

p is the probability, under the assumption of H, of
obtaining data (x or t(x)) having equal or lesser
compatibility with H as xobs
 Since p is a function of r.v. x, p itself is a r.v


If H is true, p is uniform in [0,1]
If H is not true, p is peaked closer to 0
24
Hypothesis Testing
 Suppose we observe nobs=ns+nb events


ns, nb are Poisson r.v.’s with means ns,nb
nobs=ns+nb is Poisson r.v. with mean n=ns+nb

n s n b 
f n;n s ,n b  
n
n!
e
 n s n b 
25
Hypothesis Testing
 Suppose nb=0.5 and we observe nobs=5

Publish/NY Times headline or not?
 Often we take H to be the null hypothesis –
assume it’s random fluctuation of background

Assume ns=0
p  pn  nobs   pspace of  compatibil ity 
p

nobs1
n bn
n  nobs
n 0
n!
 f n;0,n b   1 
p  1.7  10


e n b
4
This is the probability of observing 5 or more
resulting from chance fluctuations of the
background
26
Hypothesis Testing
 Another problem, instead of counting events
say we measure some variable x

Publish/NY Times headline or not?
27
Hypothesis Testing
 Again take H to be the null hypothesis – assume it’s
random fluctuation of background

Assume ns=0
p  f n  11;n b  3.2,n s  0  5.4  10
4
 Again p is the probability of observing 11 or more
events resulting from chance fluctuations of the
background




How did we know where to look / how to bin?
Is the observed width consistent with the resolution in x?
Would a slightly different analysis still show a peak?
What about the fact that the bins on either side of the peak
are low?
28
Least Squares
 Another approach is to compare a histogram with
a hypothesis that provides expectation values

In this case we’d compare a vector of Poisson
distributed numbers (the histogram) with their
expectation values ni=E[ni]
N
c 
2
n 1


ni  n i  

2
i
N

n 1
ni  n i 
ni
This is called Pearson’s statistic
If the ni are not too small (e.g. ni > 5) then the
observed c2 will follow the chi-square pdf for N dof
 Or more generally for N – number of fitted parameters
 Same will hold true for N independent measurements yi that
are Gaussian distributed
29
Least Squares
 We can calculate the
p-value as

p

f z; N dz where f is the c 2 pdf
c2
Recall for the c 2 pdf, E z   N so
χ2
often
is taken as the measure of agreement
N
 In our example
c  29.8 for 20 dof
2
p  0.073
30
Least Squares
In our example though we have many
bins with a small number of counts or 0
We can still use Pearson’s test but we
need to determine the pdf f(c2) by
Monte Carlo



Generate ni from Poisson, mean ni in each
bin
Compute c2 and record in a histogram
Repeat for a large number of times (see
next slide)
31
Least Squares
 Using the modified pdf would give p=0.11
rather than p=0.073

In either case, we won’t publish
32
K-S Test
 Usage in ROOT





TFile * data
TFile * MC
TH1F * jet_pt = data → Get(“h_jet_pt”)
TH1F * MCjet_pt = MC → Get(“h_jet_pt”)
Double_t KS=MCjet_pt→KolmogorovTest(jet_pt)
 Notes

The returned value is the probability of the test
 << 1 means the two histograms are not compatable

The returned value is not the maximum KS
distance though you can return this with option
“M”
 Also available in statistical toolbox in MatLab
33
Limiting Cases
Binomial
Poisson
Gaussian
34
Nobel Prize or IgNobel Prize?
CDF result
35
Kaplan-Meier Curve
A patient is treated for a disease. What is
the probability of an individual surviving
or remaining disease-free?



Usually patients will be followed for various
lengths of time after treatment
Some will survive or remain disease-free
while others will not. Some will leave the
study.
A nonparametric method can be found using
 Kaplan-Meier curve
 Life table
 Survival curve
36
Kaplan-Meier Curve
Calculate a conditional probability

S(tN) = P(t1) x P(t2) x P(t3) x … P(tN)
 The survival function S(t) is equivalent to the
empirical distribution function F(t)

S t  
We can write this as
p
j
j ;t j  t
 dj 
p j  1  
 n 
j 

d j is the number dying during period j
n j is the number tha t have survived to the beginning of period j
37
Kaplan-Meier Curve
38
Kaplan-Meier Curve
The square root of the variance of S(t)
can be calculated as
  pk   pk 1 pk  / nk
Assuming the pk follow a Gaussian
(normal) distribution, then the 95% CL
will be
pk  1.95  pk 
39
Gaussian Confidence Interval
40
Gaussian Confidence Interval
41
Gaussian Distribution
 Some useful properties of the Gaussian distribution are
in range ±) = 0.683
in range ±2) = 0.9555
in range ±3) = 0.9973
outside range ±3) = 0.0027
outside range ±5) = 5.7x10-7

P(x
P(x
P(x
P(x
P(x

P(x in range ±0.6745) = 0.5




42
Gaussian Distribution
43
Confidence Intervals
Suppose you have a bag of black and
white marbles and wish to determine
the fraction f that are white. How
confident are you of the initial
composition? How does your
confidence change after extracting n
black balls?
Suppose you are tested for a disease.
The test is 100% accurate if you have
the disease. The test gives 0.2% false
positive if you do not. The test comes
back positive. What is the probability 44
Confidence Intervals
 Suppose you are searching for the Higgs and
have a well-known expected background of 3
events. What 90% confidence limit can you set
on the Higgs cross section
 if you observe 0 events?
 if you observe 3 events?
 if you observe 10 events?
 The ability to set confidence limits (or claim
discovery) is an important part of frontier physics
How to do this the “correct” way is
somewhat/very controversial
45
Confidence Intervals
 Questions



What is the mass of the top quark?
What is the mass of the tau neutrino
What is the mass of the Higgs
 Answers



Mt = 172.5 ± 2.3 GeV
Mv < 18.2 MeV
MH > 114.3 GeV
 More correct answers



Mt = 172.5 ± 2.3 GeV with CL = 0.683
0 < Mv < 18.2 MeV with CL = 0.95
Infinity > MH > 114.3 GeV with CL = 0.95
46
Confidence Interval
A confidence interval reflects the
statistical precision of the experiment
and quantifies the reliabiltiy of a
measurement
For a sufficiently large data sample, the
mean and standard deviation of the
mean provide a good provide a good
interval



What if the pdf isn’t Gaussian?
What if there are physical boundaries?
What if the data sample is small?
Here we run into problems
47
Confidence Interval
A dog has a 50% probability of being
100m from its master

You observe the dog, what can you say
about its master?
 With 50% probability, the master is within
100m of the dog
 But this assumes


The master can be anywhere around the dog
The dog has no preferred direction of travel
48
Confidence Intervals
Neyman’s construction



Consider a pdf f(x;θ) = P(x|θ)
For each value of θ, we construct a
horizontal line segment [x1,x2] such that
P(x [x1,x2|θ = 1-a
The union of such intervals for all values of
θ is called the confidence belt
49
Confidence Intervals
Neyman’s construction


After performing an experiment to measure
x, a vertical line is drawn through the
experimentally measured value x0
The confidence interval for θ is the set of
all values of θ for which the corresponding
line segment [x1,x2] is intercepted by the
vertical line
50
Confidence Intervals
51
Confidence Interval
 Notes

The coverage
condition is not
unique
 P(x<x1|θ) =
P(x>x2|θ) = a/2

Called central
confidence intervals
 P(x<x1|θ) = a

Called upper
confidence limits
 P(x>x2|θ) = a

Called lower
confidence limits
52
Poisson Confidence Interval
We previously mentioned that the
number of events produced in a
reaction with cross section σ and fixed
luminosity L follows a Poisson
distribution with mean n=σ∫Ldt


P(n;v) = e-n nn / n!
If the variables are discrete by convention
one constructs the confidence belt by
requiring P(x1<x<x2|θ) >= 1-a
Example: Measuring the Higgs
production cross section assuming no
background
53
Poisson Confidence Interval
54
Poisson Confidence Interval
Central Intervals - Poisson
20

u (n)
Poisson Distribution

e 
P(n |  ) 
n!
n
Parameter
15
l (n)
10
5
0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Count
Upper Limits
Lower Limits
n
55
Poisson Confidence Interval
56
Poisson Confidence Interval
 Assume signal s and background b
nobs
a  0.05  Pn  nobs ; s, b   
n 0
s  b 
n
n!
e
 s  b 
Solve numericall y for s  sup
This gives an upper limit on s at CL  1-α
In the special case that b  0 and nobs  0
a  0.05  e  sup  3
s
57
Poisson Confidence Interval
58
Confidence Intervals
Sometimes though confidence intervals




Are empty
Reduce in size when the background estimate
increases
Are smaller for a poorer experiment
Exclude parameters for which the experiment
is insensitive
Example



We know that P(x=0|v=2.3) = 0.1
v < 2.3 @ 90% CL
If the number of background events b is 3,
59
then since v = s + b, number of signal events
Confidence Intervals
60
Confidence Intervals
61
Confidence Interval
Experiment X uses a fit to extract the
neutrino mass


Mv = -4 ± 2 eV
=> P (Mv < 0 eV) = 0.98?
62
Confidence Interval
What is probability?

Frequentist approach
 Developed by Venn, Fisher, Neyman, von Mises
 The relative frequency with which something
happens
 number of successes / number of trials

Venn limit (n trials to infinity)
 Assumes success appeared in the past and will
occur in the future with the same probability

It will rain tomorrow in Tucson and P(S) =
0.01
 The relative frequency it rains on Mondays in
April is 0.01
63
Confidence Interval
 What is probability

Bayesian approach
 Developed by Bayes, Laplace, Gauss, Jeffreys, de Finetti
 The degree of belief or confidence of a statement or
measurement
 Closer to what is used in everyday life

Is the Standard Model correct
 Similar to betting odds
 Not “scientific”?

It will rain tomorrow in Tucson and P(S) = 0.01
 The plausibility of the above statement is 0.01 (ie the
same as if I were to draw a white ball out of a container
of 100 balls, 1 of which is white)
64
Confidence Interval
Usually


Confidence interval == frequentist
confidence interval
Credible interval == Bayesian posterior
probability interval
 But you’ll also hear Bayesian confidence
interval
Probability

P=1–a
 a = 0.05 => P = 95%
65
Confidence Interval
 Suppose you wish to determine a parameter
θ whose true value is θt is unknown
 Assume we make a single measurement of an
observable x whose pdf P(x|θ) depends on θ

Recall this is the probability of obtaining x given θ
 Say we measure x0, then we obtain P(x0|θ)
 Frequentist

Makes statements about P(x|θ)
 Bayesian


Makes statements about P(θt|x0)
P(θt|x0) = P(x0|θt) P(θt) / P(x0)
 We’ll stick with the frequentist approach for
the moment
66
Confidence Interval
 (Frequentist) confidence intervals are
constructed to include the true value of the
parameter (θt) with a probability of 1-α

In fact this is true for any value of θ
 A confidence interval [θ1,θ2] is a member of a
set, such that the set has the property that
P(θ [θ1,θ2])= 1-α


Perform an ensemble of experiments with fixed θ
The interval [θ1,θ2] will vary and cover the fixed
value θ in a fraction of 1-α of the experiments
 Presumably when we make a measurement
we are selecting it at random from the
ensemble that contains the true value of θ, θt
 Note we haven’t said anything about the
probability of θt being in the interval [θ1,θ2]
67
as a Bayesian would
Confidence Interval
If P(θ [θ1,θ2]) = 1-a is true we say
the intervals “cover” θ at the stated
confidence
If there are values of θ for which P(θ
[θ1,θ2]) < 1-a we say the intervals
“undercover” for that θ
If there are values of θ for which P(θ
[θ1,θ2]) > 1-a we say the intervals
“overcover” for that θ
Undercoverage is bad
68
Confidence Intervals
Neyman’s construction



Consider a pdf f(x;θ) = P(x|θ)
For each value of θ, we construct a
horizontal line segment [x1,x2] such that
P(x [x1,x2|θ = 1-a
The union of such intervals for all values of
θ is called the confidence belt
69
Confidence Intervals
Neyman’s construction


After performing an experiment to measure
x, a vertical line is drawn through the
experimentally measured value x0
The confidence interval for θ is the set of
all values of θ for which the corresponding
line segment [x1,x2] is intercepted by the
vertical line
70
Confidence Intervals
71
Confidence Interval
 Notes

The coverage
condition is not
unique
 P(x<x1|θ) =
P(x>x2|θ) = a/2

Called central
confidence intervals
 P(x<x1|θ) = a

Called upper
confidence limits
 P(x>x2|θ) = a

Called lower
confidence limits
72
Confidence Intervals
These confidence intervals have a
confidence level = 1-a
By construction, P(θ [θ1,θ2]) > 1-a is
satisfied for all θ including θt
Another method is to consider a test of
the hypothesis that the parameters true
value is θ
If the variables are discrete by
convention one constructs the
confidence belt by requiring
73
P(x <x<x |θ) >= 1-a
Examples
Data consisting of a single random
variable x that follows a Gaussian
distribution
Counting experiments
74
Poisson Confidence
Interval
We previously mentioned that the
number of events produced in a
reaction with cross section σ and fixed
luminosity L follows a Poisson
distribution with mean v=σ∫Ldt


P(n;v) = e-v vn / n!
If the variables are discrete by convention
one constructs the confidence belt by
requiring P(x1<x<x2|θ) >= 1-a
Example: Measuring the Higgs
production cross section assuming no
75
Poisson Confidence Interval
76
Poisson Confidence Interval
Central Intervals - Poisson
20

u (n)
Poisson Distribution

e 
P(n |  ) 
n!
n
Parameter
15
l (n)
10
5
0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Count
Upper Limits
Lower Limits
n
77
Poisson Confidence Interval
78
Poisson Confidence Interval
79
Confidence Intervals
Sometimes though confidence intervals




Are empty
Reduce in size when the background estimate
increases
Are smaller for a poorer experiment
Exclude parameters for which the experiment
is insensitive
Example



We know that P(x=0|v=2.3) = 0.1
v < 2.3 @ 90% CL
If the number of background events b is 3,
80
then since v = s + b, number of signal events
Confidence Intervals
81
Confidence Intervals
82