Download Hypothesis Tests, Significance

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Hypothesis Tests, Significance
Significance as hypothesis test
Pitfalls
Avoiding pitfalls
Goodness-of-fit
Context is frequency statistics.
1
Frank Porter, BaBar analysis school – Statistics, 12-15 February 2008
Significance as Hypothesis Test
When asking for the “significance” of an observation (of, perhaps a new
effect), you ask for a test of the hypotheses:
Null hypothesis H0 : There is no new effect;
against
Alternative hypothesis H1 : There is a new effect.
Reject the null hypothesis (that is, claim a new effect) if the observation falls in a region that is “unlikely” if the null hypothesis is correct.
“Significance” (as typically used in HEP, e.g., “a significance of 5σ”)
is the probability that we erroneously reject the null hypothesis. Also
called “confidence level”, or “P -value”, or the probability of a “Type I
error”.
2
Frank Porter, BaBar analysis school – Statistics, 12-15 February 2008
Significance
A 68% confidence interval does not always tell you much about significance. The tails may be non-normal. A separate analysis is generally
required, which models the tails appropriately.
No recommendation on when to label result as “significant”. Label
implies interpretation.
– No uniform prescription seems to make sense, involves judgement.
Eg, bizarre new particle vs. expected branching fraction.
– Not our most essential experimental role; up to consumer ultimately to decide what they want to believe.
Nevertheless, people insist on making qualitative statements (“observation of”, “evidence for”, “discovery of”, “not significant”, “consistent
with”)
Code: “observation of” ≡
3
> 4σ, “evidence for” ≡
> 3σ
Frank Porter, BaBar analysis school – Statistics, 12-15 February 2008
Significance (more semantics. . . )
From Physics Today, http://www.physicstoday.org/pt/vol-54/iss-9/p19.html
(coloring mine, references deleted)
nb: Just an observation on human nature, no criticism should be inferred.
“In March, back-to-back papers in Physical Review Letters reported the
measurement of CP symmetry violation in the decay of neutral B mesons
by groups in Japan and California. Now the word “measurement” has
been replaced by “observation” in the titles of two new back-to-back
reports by these same groups in the 27 August Physical Review Letters.
That is to say, with a lot more data and improved event reconstruction,
the BaBar collaboration at SLAC and the Belle collaboration at KEK in
Japan have at last produced the first compelling evidence of CP violation
in any system other than the neutral K mesons.”
Some people think a measurement should not be called a “measurement”
unless the result is significantly different from zero. A senior Assistant
Editor at a prominent journal suggested that “bounds on” might be more
appropriate than “measurement” in reference to a CP asymmetry angle
which was observed as consistent with zero.
Finding sin 2β = 0.00 ± 0.01 would be pretty exciting.
But it isn’t a “measurement”?
4
Frank Porter, BaBar analysis school – Statistics, 12-15 February 2008
Computation of significance: Example
Flat normal background (avg 100/bin) + gaussian signal (100 events,
fixed) with mean 0, σ = 1
160
140
Events/0.5
120
100
80
Generated signal = 100 events
_ 39 events
Cut&Count signal = 194 +
Significance = 5.0
Expected significance = 2.7
60
40
20
0
-10
-8
-6
-4
-2
0
x
2
4
6
8
10
Distribution is normal here, so P = 5.7 × 10−7 (two-tail).
Note: H0 is flat distribution with no signal.
H1 is Nsig = 0.
5
Frank Porter, BaBar analysis school – Statistics, 12-15 February 2008
Note: The significance is not obtained by dividing the signal estimate
(194) by the uncertainty in the signal (39), 194/39 = 5.0. That would
be akin to asking how likely a signal of the estimated size would be to
fluctuate to zero. It is a good approximation in this example, however,
because B/S is large.
The significance was estimated here with a simple cut-and-count method:
– Level of background estimated from region |x| > 3.
– Counts in “signal region” |x| < 3 added up.
– The excess in the signal region over the estimated background is divided by the square root of the estimated background in the signal region.
In this example, we assumed we knew the mean and width of the signal
we were looking for, as well as the background shape. The uncertainty
in the background estimate is negligible in this example.
A more sophisticated fit may yield a more powerful test by incorporating the known shape of the signal.
6
Frank Porter, BaBar analysis school – Statistics, 12-15 February 2008
What about Systematic Uncertainties?
B(e+e− → Nobel prize) = 10 ± 1 ± 5.
They may be important!
– Maybe the ±5 is a systematic uncertainty in the estimate of the
background expectation. A “10σ” statistical significance is really
only a “2σ” effect.
They may be irrelevant
– Maybe the ±5 is a systematic uncertainty on the efficiency, entering
as a multiplicative factor. It makes no difference to the significance
whether the result is 10 ± 1 or 5 ± 0.5.
They may be “fuzzy”
– E.g., how is the background expectation estimated? What is the
sampling distribution?
– E.g., Are “theoretical” uncertainties present?
7
Frank Porter, BaBar analysis school – Statistics, 12-15 February 2008
Systematic Uncertainties
“Blind checks” and “educated checks”:
– Blind check: testing for mistakes; no correction is expected. If
pass test, no contribution to systematic error. Eg, divide data into
chronological subsets and compare results.
– Educated check: measuring biases, corrections. May affect quoted
result. Always contributes to systematic error. Eg, dependence of
efficiency on model.
Quote systematic uncertainty separately from statistical
– Systematic uncertainty may contain statistical components, eg, MC
statistics in evaluation of efficiency.
8
Frank Porter, BaBar analysis school – Statistics, 12-15 February 2008
y’
Systematic Uncertainties Example: D mixing and DCSD revisited
– Want simple procedure. Willing to
accept approximation.
– Scale statistical-only contour uniformly
along ray from best-fit value.
– Factor is 1 + m2i , where mi is
an estimate of the effect of systematic uncertainty i measured in units
of the statistical uncertainty. This
estimate is obtained by determining
the effect of the systematic uncertainty on x
2, y.
0.04
0.02
-0
-0.02
2
Physical (x’ , y’)
2
Central (x’ , y’) No CPV
95% CL CPV allowed
95% CL CPV allowed, stat only
95% CL CP conserved
95% CL CP conserved, stat only
-0.04
-0.06
-0.5
0
0.5
1
1.5
2
2
x’ / 10
2.5
-3
Method conservative (≡ lazy) in sense that scaling for a given systematic in one direction is applied uniformly in all directions. On the other hand, a linear approximation
is being made.
9
Frank Porter, BaBar analysis school – Statistics, 12-15 February 2008
Aside: Significance as “nσ”
HEP parlance is to say an effect has, e.g., “5σ” significance.
At face value, this means the observation is “5 standard deviations” away
from the mean:
σ ≡ (x − x̄)2.
But we often don’t really mean this. Note that a 5σ effect of this sort
may not be improbable:
P(x)
0.8
P (|x − x̄| = 5σ ) = 20% !
0.1
0.1
-1
10
0
1
x
Frank Porter, BaBar analysis school – Statistics, 12-15 February 2008
Aside: Significance as “nσ” (continued)
Instead, we often mean that the probability (P -value) for the effect is
given by the probability of a fluctuation in a normal distribution 5σ from
the mean, i.e.,
P = P (|x| > 5), for x ∈ N (0, 1)
= 5.7 × 10−7
(two-tailed probability).
But sometimes we really do mean 5σ, usually presuming that the sampling distribution is approximately normal. [This may not be an accurate
presumption when far out in the tails!]
√
Also now popular to call −2Δ ln L the “n” in “nσ”.
2
From: L0(θ = 0; x) = √ 1 exp − 12 x/σ , Lmax(
θ = x; x) = √ 1 ,
2πσ
2πσ
√
giving −2Δ ln L = Δχ2 = x/σ = n.
Desirable to be more concise by quoting probabilities, or “P -values” as
is common in the statistics world. At least say what you mean!
11
Frank Porter, BaBar analysis school – Statistics, 12-15 February 2008
Estimating significance: Pitfalls
What are the dangers? In a nutshell:
Unknown or unknowable sampling distributions
Ways to not know the distribution:
The Improbable Tails
Systematic Unknowns
The Stopping Problem
The exploratory Bump Hunt
12
Frank Porter, BaBar analysis school – Statistics, 12-15 February 2008
The Improbable Tails
160
140
Our earlier example was known to be normal sampling.
Events/0.5
120
100
80
Generated signal = 100 events
_ 39 events
Cut&Count signal = 194 +
Significance = 5.0
Expected significance = 2.7
60
40
20
0
-10
-8
-6
-4
-2
0
x
2
4
6
8
10
Often, this is true approximately (central limit theorem).
But for significance, often interested in distribution far into the tails.
The normal approximation may be very bad here!
If there is any doubt, need to compute the actual distribution. Typically
this is done with a “toy Monte Carlo” to simulate the distribution of the
significance statistic.
To get to the tails, this may require a fair amount of computing time.
Still need to be wary of pushing calculation beyond its validity as a model
of the actual distribution.
BaBar (and others) now routinely performs these calculations.
13
Frank Porter, BaBar analysis school – Statistics, 12-15 February 2008
Systematic Unknowns
Nuisance parameters
– Unknown, but relevant parameters. Estimated somehow, but with
some uncertainty. Even if sampling distribution is known, cannot in
general derive exact P -values in lower dimensional parameter space.
– Central Limit Theorem is our friend.
– Can try other values besides best estimate of nuisance parameter.
– See our discussion of confidence intervals.
“Theoretical” systematic uncertainties. Guesses, no sampling distribution.
– Use worst case values when evaluating significance. requires understanding what is meant by the theory “errors”.
– Or, give the dependence, e.g., as a range.
14
Frank Porter, BaBar analysis school – Statistics, 12-15 February 2008
The Stopping Problem
There is a strong tendency to work on an analysis until we are convinced
that we got it “right”, then we stop.
Simple example: “Keep sampling” until we are satisfied.
Motivate our example:
• Ample historical evidence that experimental measurements are sometimes biased by some preconception of what the answer “should be”.
For example, a preconception could be based on the result of another
experiment, or on some theoretical prejudice.
• A model for such a biased experiment is that the experimenter works
“hard” until s/he gets the expected result, and then quits. Let’s Consider a simple example of a distribution which could result from such
a scenario.
15
Frank Porter, BaBar analysis school – Statistics, 12-15 February 2008
Stopping Problem: Normal likelihood function example
1
N (x; θ, 1)dx = √
2π
2/2
−(x−θ)
e
dx.
Frequency
• Consider an experiment in which a measurement of a parameter θ
corresponds to sampling from a Gaussian distribution of standard deviation one:
θ− 2
θ
x
θ+2
• Suppose the experimenter has a prejudice that θ is greater than one.
• Subconsciously, s/he makes measurements until the sample mean, m =
1 n x , is greater than one, or until s/he becomes convinced (or
i=1 i
n
tired) after a maximum of N measurements.
• The experimenter then uses the sample mean, m, to estimate θ.
16
Frank Porter, BaBar analysis school – Statistics, 12-15 February 2008
Stopping Problem: Normal likelihood function example
For illustration, assume that N = 2. In terms of the random variables
m and n, the pdf is:
⎧
1
2
⎪
n = 1, m > 1
⎨ √12π e− 2 (m−θ) ,
f (m, n; θ) = 0,
n = 1, m < 1
⎪
2
2
1
⎩ 1 e−(m−θ)
−(x−m)
dx n = 2
−∞ e
π
4000
Histogram of sampling distribution for m, with pdf
given by above equation,
for θ = 0.
Number of experiments
3000
2000
1000
0
-4
-2
0
2
4
Sample mean
The likelihood function, as a function of θ, has the shape of a normal
distribution, given any experimental result. The peak is at θ = m, so m
is the maximum likelihood estimator for θ.
17
Frank Porter, BaBar analysis school – Statistics, 12-15 February 2008
Stopping Problem: Normal likelihood function example
In spite of the normal form of the likelihood function, the sample
mean is not sampled from a normal distribution. The “4σ” tail is more
probable (for some θ) than the experimenter thinks.
Probability( θ < m - 4s )
0.00007
0.00006
0.00005
0.00004
0.00003
0.00002
Straight line is probability for a “4σ” fluctuation
of a normal distrinbution.
0.00001
0
-8
-6
-4
-2
0
2
4
θ
18
Frank Porter, BaBar analysis school – Statistics, 12-15 February 2008
Stopping Problem: Normal likelihood function example
The likelihood function, as a function of θ, has the shape of a normal
distribution, given any experimental result. The peak is at θ = m, so m
is the maximum likelihood estimator for θ. In spite of the normal form of
the likelihood function, the sample mean is not sampled from a normal
distribution. The interval defined by where the likelihood function falls
by e−1/2 does not correspond to a 68% CI:
Probability (θ-<θ<θ+)
0.75
0.70
0.65
0.60
-2
0
2
4
θ
19
Frank Porter, BaBar analysis school – Statistics, 12-15 February 2008
Stopping Problem: Normal likelihood function example
• The experimenter in this scenario thinks s/he is taking n samples from
a normal distribution, and makes probability statements (e.g., about
significance) according to a normal distribution.
• S/he gets an erroneous result because of the mistake in the distribution.
• If the experimenter realizes that sampling was actually from a nonnormal distribution, s/he can do a more careful analysis to obtain
more valid results.
Related effect: First “observations” tend to be biased high.
Example? BES early fD = 371 vs fD ∼ 220 MeV more recently. Nothing
“wrong” with this, but best estimates should average in earlier “null”
data. Importance of reporting null results and quoting combinable results, e.g., two-sided confidence intervals.
20
Frank Porter, BaBar analysis school – Statistics, 12-15 February 2008
The Exploratory Bump Hunt
Context is the typical bump-hunting approach in HEP
That is, analysis is developed concurrent with looking at the data
Events / 20 MeV/c2
We see a “bump”, how significant is it?
40
104
103
30
BaBar
e+e− → γISRπ +π −J/ψ
P (H0) ∼ 10−11 − 10−16
102
10
1 3.6 3.8
20
4
4.2 4.4 4.6 4.8
5
10
0
3.8
4
4.2
4.4
4.6
4.8
5
m(π+π-J/ψ) (GeV/c2)
If the analysis is our typical bump-hunt approach, it is impossible to
compute the significance!
We don’t know the sampling distribution (under null hypothesis).
Hence, we cannot compute probabilities (under null hypothesis, in particular).
21
Frank Porter, BaBar analysis school – Statistics, 12-15 February 2008
Bump Hunting – Example I
Recall our earlier example:
Flat normal background (avg 100/bin) + gaussian signal (100 events)
with mean 0, σ = 1
160
140
Events/0.5
120
100
80
Generated signal = 100 events
_ 39 events
Cut&Count signal = 194 +
Significance = 5.0
Expected significance = 2.7
60
40
20
0
-10
-8
-6
-4
-2
0
x
2
4
6
8
10
Distribution is normal here, so P = 5.7 × 10−7 (two-tail).
Note: H0 is flat distribution with no signal.
H1 is Nsig = 0.
As long as we knew at the outset that we were looking for an effect
with mean 0 and σ = 1, we make correct inferences.
22
Frank Porter, BaBar analysis school – Statistics, 12-15 February 2008
Bump Hunting – Example II
Looking for a mass bump.
Evidence for a threshold enhancement!
160
Events/10 MeV
140
120
100
80
Background level = 100 events/bin
_ 12 events
Signal = 40 +
Statistical significance = 4.0 σ
60
40
_
One-tailed probability (P-value) = 2.6 X 10 5
20
0
0
23
10 0
20 0
30 0
40 0
50 0
_
m(pp) - 2m(p)
60 0
70 0
80 0
90 0
(MeV)
Frank Porter, BaBar analysis school – Statistics, 12-15 February 2008
So, what does it mean?
Even though we cannot estimate probabilities, we still quote numbers
intended to relay an idea of significance, usually quoted as some number of “sigma”.
What we (usually) mean by this is:
“If I had done the analysis in a controlled manner, and had been
interested in the observed value for the mean and width, then the
null hypothesis would require a fluctuation of this number of standard
deviations of a Gaussian distribution to produce a bump as large as I
see”
24
Frank Porter, BaBar analysis school – Statistics, 12-15 February 2008
Is it useful?
With this understanding of the meaning, it is perhaps not completely
useless, since we can interpret it in the context of our experience.
However, our experience is that we are sometimes fooled!
pentaquarks example
Life is short, we have also made great discoveries with this approach.
Just remember that quoted significances are highly misleading.
Note: The “threshold enhancement” in Example II is a statistical fluctuation! The sampling distribution was the same N (100, 10) for all bins.
Evidence for a threshold enhancement!
160
Events/10 MeV
140
120
100
80
Background level = 100 events/bin
_ 12 events
Signal = 40 +
Statistical significance = 4.0 σ
60
40
_
One-tailed probability (P-value) = 2.6 X 10 5
20
0
0
25
10 0
20 0
300
40 0
50 0
_
m(pp) - 2m(p)
60 0
70 0
80 0
90 0
(MeV)
Frank Porter, BaBar analysis school – Statistics, 12-15 February 2008
Avoiding pitfalls
Model those tails!
“Conservatism”
– Based on our experience with the failings of our methodology, we
don’t claim new discoveries lightly.
Do it better: Be Blind
26
Frank Porter, BaBar analysis school – Statistics, 12-15 February 2008
It can be done “better”
Better means with more meaningful significance estimates, and avoidance of bias.
Do a “blind analysis”
– Blind means you design the experiment (analysis) before you look
at the results.
Goal is to know the sampling distribution.
BaBar routinely blinds it’s analyses, when there is a well-defined quantity
(e.g., CP asymmetry or branching fraction) being measured.
27
Frank Porter, BaBar analysis school – Statistics, 12-15 February 2008
Blinding Methodologies
There are several basic approaches to blinding an analysis, which may
be chosen according to the problem. Many variations on these themes!
Don’t look inside the box
You can look, but keep the answer hidden
Obscure the real data (e.g., adding simulated signal to data)
Design on a dataset that will be thrown away
Divide and conquer (e.g., BNL muon g − 2 result, ωp (magnetic field)
and ωa (muon precession) were analyzed independently, before combining to obtain g − 2; arXiv:hep-0401008v3)
[Reference: Klein & Roodman, Ann.Rev.Nucl.Part.Sci. 55 (2005) 141.]
28
Frank Porter, BaBar analysis school – Statistics, 12-15 February 2008
Blind Analysis
Many BaBar results are obtained in “blind analyses”. Purpose is to
avoid introduction of bias.
∗ (2317)+ .
Not all: Exploratory analyses not blind, eg, discovery of the DsJ
Example of a blind analysis: B ± → K ±e+e−
Data after unblinding
0.5
0.4
0.3
+
+ + -
B →K e e
Signal MC
0.2
signal region
0.1
-0
large sideband
-0.1
fit region
-0.2
$ % 'E6
Δ E (GeV)
After event selection, look for signal in distribution of two kinematic
variables ΔE and mES. Monte Carlo, control sample (usually a type of
data resembling signal) studies of entire sideband and fit region. Data
only looked at in large sideband region prior to unblinding.
+ pE E
-0.3
-0.4
-0.5
5
5.05
5.1
29
5.15
5.2
5.25
5.3
mES (GeV/c2 )
M%3 'E6 C
Frank Porter, BaBar analysis school – Statistics, 12-15 February 2008
Blind Analysis (continued)
Issue: Updating a blind analysis with additional data.
Simply add data, no change in analysis.
– May be impractical, undesirable, eg, re-reconstruction of entire dataset;
improvements in tools such as particle identification.
– Sometimes would like to work harder to optimize analysis, or optimize on different criterion (eg, for most precision instead of most
sensitivity).
Notion of “reblinding” and re-optimization.
– Considered safe to use variables which have not been inspected too
carefully in blind region.
– Not called a blind analysis.
“Pure” approach: Do a blind analysis only on the new dateset.
– OK to use whatever was learned from the old dataset.
– Can combine the results from the old and the new data.
30
Frank Porter, BaBar analysis school – Statistics, 12-15 February 2008
Blind Analysis – Hiding the answer
In this approach, no data is hidden, just the “answer” (typically, a
number).
For example, a hidden, possibly random, offset may be applied to
the real answer to prevent it from being seen before the analysis is
designed.
Δt(Hidden) =
1
−1
(TAG)Δt + Offset,
Top: Not hidden; Bottom: Hidden
BaBar, from Klein& Roodman, op. cit.
where TAG = ±1 according to the B flavor tag. This algorithm permits
viewing the decay time distributions without revealing the asymmetry.
31
Frank Porter, BaBar analysis school – Statistics, 12-15 February 2008
Blinding a bump hunt
No different in principle than our other blind analyses.
The more you know (about the signal you are interested in), the better
you can do.
Yes, it is harder. It takes discipline, and it might even cost “sensitivity”.
But it is at least worth thinking about.
32
Frank Porter, BaBar analysis school – Statistics, 12-15 February 2008
What cuts do I make?
Again, the more we know, the better off we are!
– If we have good simulation of background processes, or background
data, that’s great! We’ll use it.
– The less we know, the bigger the risk that we won’t be optimal.
That’s life. Being non-optimal doesn’t mean wrong.
One approach: Sacrifice a portion of data for cut optimization.
– Ignore any “signals”. Or tune on them. . .
– If signal is real, it should show up in remaining blinded data. If it is
a fluctuation, it will disappear.
Evidence for a threshold enhancement!
1 60
140
140
Events/10 MeV
120
120
100
100
80
80
⇒⇒
Background level = 100 events/bin
_ 12 events
Signal = 40 +
Statistical significance = 4.0 σ
60
40
_
One-tailed probability (P-value) = 2.6 X 10 5
20
60
40
20
0
0
0
0
10 0
20 0
300
40 0
50 0
_
m(pp) - 2m(p)
60 0
70 0
80 0
10
20
30
40
50
60
70
80
90
90 0
(MeV)
– Systematic effects will show up both places. But that’s another
matter.
33
Frank Porter, BaBar analysis school – Statistics, 12-15 February 2008
Significance/GOF: Counting Degrees of Freedom
The following situation arises with some frequency (with variations):
I do two fits to the same dataset (say a histogram with N bins):
Fit A has nA parameters, with χ2A [or perhaps −2 ln LA].
Fit B has a subset nB of the parameters in fit A, with χ2B , where the
nA − nB other parameters (call them θ) are fixed at zero.
What is the distribution of χ2B − χ2A?
See SWiG thread:
http://babar-hn.slac.stanford.edu:5090/HyperNews/get/Statistics/300.html
Luc Demortier in:
http://phystat-lhc.web.cern.ch/phystat-lhc/program.html
[this is recommended reading]
34
Frank Porter, BaBar analysis school – Statistics, 12-15 February 2008
Counting Degrees of Freedom (continued)
In the asymptotic limit (that is, as long as the normal sampling distribution is a valid approximation),
Δ χ2
≡ χ2B − χ2A
is the same as a likelihood ratio (−2 ln λ) statistic for the test:
H0 : θ = 0
against H1 : some θ = 0
is distributed according to a χ2(NDOF = nA − nB ) distribution as
long as:
Δ χ2
Parameter estimates in λ are consistent (converge to the correct values)
under H0.
Parameter values in the null hypothesis are interior points of the maintained hypothesis (union of H0 and H1).
There are no additional nuisance parameters under the alternative
hypothesis.
35
Frank Porter, BaBar analysis school – Statistics, 12-15 February 2008
Counting Degrees of Freedom (continued)
A commonly-encountered situation violates these requirements: We
compare fits with and without a signal component to estimate significance of the signal.
If the signal fit has, e.g., a parameter for mass, this constitutes an
additional nuisance parameter under the alternative hypothesis.
If the fit for signal constrains the yield to be non-negative, this violates
the interior point requirement.
5000
(1 dof )
2
4
6
8
Δχ 2 (H0-H1)
36
10
12
2
Blue: χ (1 dof )
Red: χ (2 dof )
1000
2
0
0
0
H0: Fit with no signal
H1: Float signal yield
and location
3000
2
5000
Curve: χ
Frequency
Frequency
4000
8000
2
Curve: χ (1 dof )
H0: Fit with no signal
H1: Float signal yield,
>0
constrained _
15000
12000
H0: Fit with no signal
H1: Float signal yield
0
Frequency
25000

0
2
4
6
8
Δχ 2 (H0-H1)
10
12
0
2
4
6
8
Δχ 2 (H0-H1)
Frank Porter, BaBar analysis school – Statistics, 12-15 February 2008
10
12
Counting Degrees of Freedom - Sample fits
H0 fit
H1 fit
Signal>=0
Fix mean
40
80
x
Any signal
Float mean
0
40
80
x
Fits to data with no signal.
37
500
H1 fit
Counts/bin
800
500
800
800
500
Counts/bin
500
800
Any signal
Fix mean
0
H0 fit
H1 fit
H1 fit
Any signal
Fix mean
H1 fit
H1 fit
Any signal
Float mean
Signal>=0
Fix mean
0
40
80
0
x
40
80
x
Fits to data with a signal.
Frank Porter, BaBar analysis school – Statistics, 12-15 February 2008
Significance – Conclusions
Evaluation of significance is a “hypothesis test”. It is essentially the
same problem as evaluating confidence intervals.
– Except for the more obvious role played by improbable tails.
Pitfalls amount to (not) knowing the sampling distribution.
Techniques exist to avoid pitfalls:
– Simulating the sampling distribution
– Vary the nuisance/theory parameters
– Blind your experiment
Doing it properly requires patience and discipline; the benefit is a more
meaningful, convincing result to yourself and to others.
38
Frank Porter, BaBar analysis school – Statistics, 12-15 February 2008
Goodness-of-Fit
No perfect general goodness-of-fit test:
– Given a dataset generated under null hypothesis, can usually find
a test which rejects the null hypothesis (ie, choosing the test after
you see the data is dangerous).
– Given a dataset generated under alternative hypothesis, can usually
find a test for which the null passes (ie, should think about what
you want to test for).
Nominal recommendation:
– If you have a specific question, test for that.
– χ2 test when valid.
– Consider more general likelihood ratio test, Kolmogorov-Smirnov,
etc., otherwise.
– Monte Carlo evaluation of distribution of test statistic.
39
Frank Porter, BaBar analysis school – Statistics, 12-15 February 2008
Goodness-of-Fit (continued)
But recognize when test may not answer desired question, eg, in sin 2β
analysis, a likelihood ratio (or a χ2) test on the time distribution may
Entries / 0.6 ps
have little sensitivity to testing goodness-of-fit of the asymmetry.
150
0
B tags
Background
−0
100
B tags
Raw Asymmetry
50
0
0.5
0
-0.5
-5
40
0
5
Δt (ps)
Frank Porter, BaBar analysis school – Statistics, 12-15 February 2008
Consistency of two correlated results
BaBar has encountered several times the question of whether a new
analysis is consistent with an old analysis.
Often, new analysis is a combination of additional data plus changed
(improved. . . ) analysis of original data.
The stickiest issue is handling the correlation in testing for consistency
in the overlapping data.
People sometimes have difficulty understanding that statistical differences can arise even comparing results based on the same events.
Given a sampling θ1 , θ2 from a bivariate normal distribution N (θ, σ1, σ2, ρ),
with θ1 = θ2 = θ, the difference Δθ ≡ θ2 − θ1 is N (0, σ)-distributed
with σ 2 = σ12 + σ22 − 2ρσ1σ2 .
If the correlation is unknown, all we can say is that the variance of the
difference is in the range (σ1 − σ2)2 . . . (σ1 + σ2)2. If we at least believe
ρ ≥ 0 then the maximum variance of the difference is σ12 + σ22.
41
Frank Porter, BaBar analysis school – Statistics, 12-15 February 2008
Consistency – Simple example of two analyses on same events
Suppose we measure a neutrino mass, m, in a sample of n = 10
independent events. The measurements are xi, i = 1, . . . , 10. Assume
the sampling distribution for xi is N (m, σi).
We may form unbiased estimator, m
1, for m:
m
1 =
1 n x
i=1 i
n
±
1
n2
n
2
i=1 σi .
The result (from a MC) is m
1 = 0.058 ± 0.039.
Then we notice that we have some further information which might be
useful: we know the experimental resolutions, σi for each measurement.
We form another unbiased estimator, m
2, for m:
n
m
2 =
xi
i=1 σ 2
n 1i
i=1 σ 2
i
± n1
1
i=1 σ 2
i
.
The result (from the same MC) is m
1 = 0.000 ± 0.016.
42
Frank Porter, BaBar analysis school – Statistics, 12-15 February 2008
Example continued
The results are certainly correlated, so question of consistency arises
(we know the error on the difference is between 0.023 and 0.055). In this
example, the difference between the results is 0.058 ± 0.036, where the
0.036 error includes the correlation (ρ = 0.41).
43
Frank Porter, BaBar analysis school – Statistics, 12-15 February 2008
Consistency – Evaluating the Correlation
Art Snyder developed an approximate formula for evaluating the correlation in a comparison of maximum likelihood analyses (eg, in one-dimensional case).
Suppose we perform two maximum likelihood analysis, with event likelihoods L1, L2, on
the same set of events [nb, may use different information in each analysis]. The results
are estimators θ1, θ2 for parameter θ. The correlation coefficient ρ may be estimated
according to:
N
d ln L2i
d ln L1i
i=1 Ri dθ |θ=θ1 dθ |θ=θ2
ρ ≈ ,
N d2 ln L1i
N d2 ln L2i
|
θ=θ
2
i=1 dθ
i=1 dθ2 |θ=θ0
0
where (θ0 is an expansion reference point)
d ln L
d ln L
2
2
ln
L
ln
L
d
d
1i
1i
2i
2i
Ri = 1 − (
1
−
(
θ
θ1 − θ0)
|
−
θ
)
|
|
| .
θ=θ
2
0
θ=θ
0
0
dθ2
dθ θ=θ0
dθ2
dθ θ=θ0
If θ0 ≈ θ1 ≈ θ2 , then
ρ ≈ σ̃θ1 σ̃θ2
N
d ln L1i
i=1
where
σ̃θ2k
≡ 1/
44
N dLki
i=1
dθ
|θ=θ0
2
dθ
|θ=θ0
d ln L2i
| ,
dθ θ=θ0
.
Frank Porter, BaBar analysis school – Statistics, 12-15 February 2008
Consistency – Example: sin 2β
32 × 106 B B̄ pairs – PRL, vol 87, 27 August 2001:
sin 2β = 0.59 ± 0.14(stat) ± 0.05(syst)
62 × 106 B B̄ pairs – SLAC-PUB-9153, March 2002:
sin 2β = 0.75 ± 0.09(stat) ± 0.04(syst)
Second result includes the earlier data, re-reconstructed. Analysis involves multivariate maximum likelihood fits; reprocessing changes, eg,
relative likelihood for an event to be signal or background. Not simply
counting events. Question: are the two results statistically consistent?
If these were independent data sets, a difference of 0.16 ± 0.17 would not
be a worry. The issue is the correlation.
A specialized analysis deriving from the previous formula is performed
on the events in common between the two analyses. A correlation of
ρ = 0.87 is deduced, yielding a difference of ∼ 2.2σ.
45
Frank Porter, BaBar analysis school – Statistics, 12-15 February 2008