Download tutorial-1.2

Document related concepts

Foundations of statistics wikipedia , lookup

Statistics wikipedia , lookup

History of statistics wikipedia , lookup

Transcript
Statistical methods
in LHC data analysis
part I.2
Luca Lista
INFN Napoli
Contents
• Hypothesis testing
• Neyman-Pearson lemma and likelihood
ratio
• Multivariate analysis (elements)
• Chi-square fits and goodness-of-fit
• Confidence intervals
• Feldman-Cousins ordering
Luca Lista
Statistical methods in LHC data analysis
2
Multivariate discrimination
• The problem:
– A signal and a background are characterized by n variables
with different distributions for the two cases: signal, and
background
• Generalization to more than two cases is an easy extension
– Given a measurement (event) of n variables with some
discriminating power, (x1, …, xn), identify (discriminate) the
event as signal or background
• Property of discriminator:
– Selection efficiency: probability of right answer
– Misidentification probability (for background)
– Purity: fraction of signal in a positively identified sample
• Depends on the signal and background composition! It is not a
property of the discriminator only
– Fake rate: fraction of background in a positively identified
sample, = 1 - Purity
Luca Lista
Statistical methods in LHC data analysis
3
Test of Hypothesis terminology
• Naming by statisticians is usually less natural for
physics applications than previous slide
• H0 = null hypothesis
– E.g.: a sample contains only background; a particle is a pion;
etc.
• H1 = alternative hypothesis
– E.g.: a sample contains background + signal; or a particle is
a muon; etc.
•  = significance level: probability to reject H1 if true
(error of first kind), i.e. assuming H1
–  = 1 – efficiency for signal
•  = probability to reject H0 if true (error of second
kind), i.e. assuming H0
–  = efficiency for background
Luca Lista
Statistical methods in LHC data analysis
4
Cut analysis
• Cut on one (or more) variables:
– If
x  xcut
– Else, if x  xcut


signal
background
Efficiency (1-)
Mis-id probability()
xcut
Luca Lista
x
Statistical methods in LHC data analysis
5
Efficiency vs mis-id
• Varying the cut both the efficiency and mis-id
change
Efficiency
1
0
Luca Lista
0
Mis-id
Statistical methods in LHC data analysis
1
6
Variations on cut analyses
• Cut on multiple variables
– AND/OR of single cuts
• Multi-dimensional cuts:
– Linear cuts
– Piece-wise linear cuts
– Non-linear combinations
• At some point, hard to find optimal cut values, or too
many cuts required
– How to determine the cuts, looking at control samples?
– Control samples could be MC, or selected data decays
– Note: cut selection must be done a-priori, before looking at
data, to avoid biases!
Luca Lista
Statistical methods in LHC data analysis
7
Straight cuts or something else?
• Straight cuts may not be optimal in all
cases
Luca Lista
Statistical methods in LHC data analysis
8
Neyman-Pearson lemma
• Fixing the signal efficiency (1- ), a selection based
on the likelihood ratio gives the lowest possible mis-id
probability ():
(x) = L(x|H1) / L(x|H0) > k
• If we can’t use the likelihood ratio, we can choose
other discriminators, or “test statistics”:
• A test statistic is any function of x (like (x)) that
allows to discriminate the two hypotheses
Luca Lista
Statistical methods in LHC data analysis
9
Likelihood ratio discriminator
• We make the ratio of likelihoods defined in the two
hypotheses:
• Q may also depend on a number of unknown
parameters (1,…,N)
• Best discriminator, if the multi-dimensional likelihood
is perfectly known (Neyman-Pearson lemma)
• Great effort in getting the correct ratio
– E.g.: Matrix Element Tecnhniques for top mass and singletop at Tevatron
Luca Lista
Statistical methods in LHC data analysis
10
Likelihood factorization
• We make the ratio of likelihoods defined in the two hypotheses
assuming PDF factorized as product of 1-D PDF:
If PDF can be factorized
into independent components
• Approximate in case of non perfectly factorized PDF
– E.g.: correlation
• A rotation or other judicious transformations in the variables’
space may be used to remove the correlation
– Sometimes even different for s and b hypotheses
Luca Lista
Statistical methods in LHC data analysis
11
Building projective PDF’s
• PDF’s for likelihood discriminator
– If not uncorrelated, need to find uncorrelated
variables first, otherwise plain PDF product is
suboptimal
Luca Lista
Statistical methods in LHC data analysis
12
Likelihood ratio output
• Good separation achieved in this case
TMVA
L > 0.5
Luca Lista
Statistical methods in LHC data analysis
13
Fisher discriminator
• Combine a number of variables into a
single discriminator
• Equivalent to project the distribution
along a line
• Use the linear combination of inputs that
maximizes the distance of the means of
the two classes while minimizing the
variance within each class:
Sir Ronald Aylmer Fisher
(1890-1962)
• The maximization problem can be solved with linear
algebra
Luca Lista
Statistical methods in LHC data analysis
14
Rewriting Fisher discriminant
•
•
•
m1, m2 are the two samples’ average vectors
1, 2 are the two samples’ covariance matrices
Transform with linear vector of coefficients w
– w is normal to the discriminator hyperplane
“between classes scatter matrix”
“within classes scatter matrix”
Luca Lista
Statistical methods in LHC data analysis
15
Maximizing the Fisher discriminant
• Either compute derivatives w.r.t. wi
• Equivalent to solve the eigenvalues
problem:
Luca Lista
Statistical methods in LHC data analysis
16
Fisher in the previous example
• Not always optimal: it’s linear cut, after all…!
F>0
Luca Lista
Statistical methods in LHC data analysis
17
Other discriminator methods
• Artificial Neural Networks
• Boosted Decision Trees
• Those topics are beyond the scope of
this tutorial
– A brief sketch will be given just for
completeness
• More details in TMVA package
– http://tmva.sourceforge.net/
Luca Lista
Statistical methods in LHC data analysis
18
Artificial Neural Networks
• Artificial simplified model of how neurons work
Input layer
x0
Hidden layers
Output layer
w01
w02
x1
w11
w12
w21
w22
x2
…
y
w2n
xp
Luca Lista
w0n
w1n
()
Activation function

Statistical methods in LHC data analysis
19
Network vs other discriminators
• Artificial neural network with a single hidden layer may
approximate any analytical function within a given approximation
if the number of neurons is sufficiently high
• Adding more hidden layers can make the approximation more
efficient
– i.e.: smaller total number of neurons
• Demonstration in:
– H. N. Mhaskar, Neural Computation, Vol. 8, No. 1, Pages 164-177
(1996), Neural Networks for Optimal Approximation of Smooth and
Analytic Functions:
“We prove that neural networks with a single hidden layer are capable of
providing an optimal order of approximation for functions assumed to
possess a given number of derivatives, if the activation function evaluated
by each principal element satisfies certain technical conditions”
Luca Lista
Statistical methods in LHC data analysis
20
(Boosted) Decision Trees
•
•
•
•
•
Select as usual a set of
discriminating variables
Progressively split the
sample according to
subsequent cuts o single
discriminating variables
Optimize the splitting cuts in
order to obtain the best
signal/background separation
Repeat splitting until the
sample contains mostly
signal or background, and
the statistics on the split
samples is too low to
continue
Many different trees are need
to be combined for a robust
and effective discrimination
(“forest”)
Branch
Branch
Leaf
Branch
Leaf
Leaf
Leaf
Decision tree
Luca Lista
Statistical methods in LHC data analysis
21
A strongly non linear case
y
x
Luca Lista
Statistical methods in LHC data analysis
22
Classifiers separation
Projective Likelihood ratio
Fisher
BDT
Neural Network
Luca Lista
Statistical methods in LHC data analysis
23
Cutting on classifiers output (I)
Fisher > 0
Luca Lista
Statistical methods in LHC data analysis
L > 0.5
24
Cutting on classifiers output (II)
NN > 0
Luca Lista
BDT > 0
Statistical methods in LHC data analysis
25
Jerzy Neyman’s confidence intervals
•
•
•
•
•
•
Scan an unknown
parameter 
Given , compute the
interval [x1, x2] that contain
x with a probability C.L. =
1-
Ordering rule needed!
Invert the confidence belt,
and find the interval [1, 2]
for a given experimental
outcome of x
A fraction 1- of the
experiments will produce x
such that the
corresponding interval
[1, 2] contains the true
value of  (coverage
probability)
Note that the random
variables are [1, 2], not 
Luca Lista
From PDG statistics review
RooStats::NeymanConstruction
Statistical methods in LHC data analysis
26
Ordering rule
• Different possible choices of the interval
giving the same are 1- are possible
• For fixed  = 0 we can have different
choices
f(x|0)
f(x|0)
/2

1-
1-
Upper limit choice
Luca Lista
x
Central interval
Statistical methods in LHC data analysis
/2
x
27
Feldman-Cousins ordering
• Find the contour of the likelihood ratio
that gives an area 
• R = {x : L(x|θ) / L(x|θbest) > k}
f(x|0)
RooStats::FeldmanCousins
f(x|0)/f(x| best(x))
1-
x
Luca Lista
Statistical methods in LHC data analysis
28
“Flip-flopping”
• When to quote a central value or upper limit?
• E.g.:
– “Quote a 90% C.L. upper limit of the
measurement is below 3;
quote a central value otherwise”
• Upper limit  central interval decided
according to observed data
• This produces incorrect coverage!
• Feldman-Cousins interval ordering
guarantees the correct coverage
Luca Lista
Statistical methods in LHC data analysis
29
“Flip-flopping” with Gaussian PDF
• Assume Gaussian with a fixed width: =1

90%
5%
90%
10%
10%
5%
5%
x
 = x 1.64485
 < x + 1.28155
x
Coverage is 85% for low !
3
Luca Lista
Statistical methods in LHC data analysis
x
From Feldman and Cousins’ paper
30
Feldman-Cousins approach
• Define range such that:
– P(x|) / P(x|best(x)) > k

best = max(x, 0)
best = x for x  0
Usual errors
Asymmetric errors
Upper limits
Solution can be found numerically
x
Luca Lista
Will see more when talking
about upper limits…
Statistical methods in LHC data analysis
31
Binomial Confidence Interval
• Using the proper Neyman belt inversion, e.g.
Feldman Cousins method, avoids odd problems, like
null errors when estimating efficiencies
equal to 0 or 1,
that would occur
using the central
limit formula:
Luca Lista
Statistical methods in LHC data analysis
32
Binned fits: minimum2
•
•
Bin entries can be approximated by Gaussian for sufficiently large
number of entries with r.m.s. equal to ni (Neyman):
The expected number of entries i is often approximated as the value
of a continuous function f at the center xi of the bin:
•
•
•
•
•
Denominator ni could be replaced by i=f(xi; 1, …, n) (Pearson)
Usually simpler to implement than un-binned ML fits
Analytic solution exists for linear and other simple problems
Un-binned ML fits unpractical for large sample size
Binned fits can give poor results for small number of entries
Luca Lista
Statistical methods in LHC data analysis
33
Fit quality
• The value of the Maximum Likelihood obtained in a fit w.r.t its
expected distributions don’t give any information about the
goodness of the fit
• Chi-square test
– The2 of a fit with a Gaussian underlying model should be
distributed according to a known PDF
n is the number of
degrees of freedom
– Sometimes this is not the case, if the model can’t be sufficiently
approximated with a Gaussian
– The integral of the right-most tail (P(2>X)=) is one example of socalled ‘p-value’
• Beware! p-values are not the “probability of the fit hypothesis”
– This would be a Bayesian probability, with a different meaning, and
should be computed in a different way ( next lecture)!
Luca Lista
Statistical methods in LHC data analysis
34
Binned likelihood
•
•
•
Assume our sample is a binned histogram from an event counting
experiment (obeying Poissonian statistics), with no need of a Gaussian
approximation
We can build a likelihood function multiplying Poisson distributions for
the number of entries in each bin, {ni} having expected number of
entries depending on some unknown parameters: i(1, …k)
We can minimize the following quantity:
Luca Lista
Statistical methods in LHC data analysis
35
Binned likelihood ratio
• A better alternative to the (Gaussian-inspired, Neyman and
Pearson’s) 2 has been proposed by Baker and Cousins using
the likelihood ratio:
• Same minimum value as previous slide, since a constant term
has been added to the log-likelihood
• It also provides a goodness-of-fit information, and asymptotically
obeys chi-squared distribution with k-n degrees of freedom
(Wilks’ theorem)
S. Baker and R. Cousins, Clarification of the Use of Chi-square and
Likelihood Functions in Fits to Histograms, NIM 221:437 (1984)
Luca Lista
Statistical methods in LHC data analysis
36
Combining measurements with2
• Two measurements with different uncorrelated
(Gaussian) errors:
• Build 2:
• Minimize 2:
• Estimate m as:
• Error estimate:
Luca Lista
Statistical methods in LHC data analysis
37
Covariance and cov. matrix
• Definitions:
– Covariance:
– Correlation:
• Correlated n-dimensional Gaussian:
• where:
Luca Lista
Statistical methods in LHC data analysis
38
Two-dimensional Gaussian
• Product of two independent Gaussians
with different 
• Rotation in the (x, y) plane
Luca Lista
Statistical methods in LHC data analysis
39
Two-dimensional Gaussian (cont.)
• Rotation preserves the metrics:
• Covariance in rotated coordinates:
Luca Lista
Statistical methods in LHC data analysis
40
Two-dimensional Gaussian (cont.)
• A pictorial view of an iso-probability contour
y
y
x

y
Luca Lista
x
Statistical methods in LHC data analysis
x
41
1D projections
• PDF projections are (1D) Gaussians:
• Areas of 1 and 2
contours differ
in 1D and 2D!
y
P1D
P2D
1
0.6827
0.3934
2
0.9545
0.8647
3
0.9973
0.9889
1.515
0.8702
0.6827
2.486
0.9871
0.9545
3.439
0.9994
0.9973
Luca Lista
2
1
x
1
2
Statistical methods in LHC data analysis
42
Generalization of 2 to n dimensions
• If we have n measurements, (m1, …, mn) with
a nn covariance matrix (Cij) , the chi-squared
can be generalized as follows:
• More details on the PDG statistics review
Luca Lista
Statistical methods in LHC data analysis
43
Combining correlated measurements
• Correlation coefficient   0:
• Build 2 including correlation terms:
• Minimization gives:
Luca Lista
Statistical methods in LHC data analysis
44
Correlated errors
• The “common error” C is defined as:
H. Greenlee, Combining CDF and D0
Physics Results, Fermilab Workshop
on Confidence Limits, March 28, 2000
• Using error propagation, this also implies that:
• The previous formulas now become:
Luca Lista
Statistical methods in LHC data analysis
45
Toy Monte Carlo
• Generate a large number of experiments according to the fit
model, with fixed parameters ()
• Fit all the toy samples as if they where the real data samples
• Study the distributions of the fit quantities
• Parameter pulls: p = (est - )/
– Verify the absence of bias: p = 0
– Verify the correct error estimate : (p) = 1
• Statistical uncertainty will depend on number of the Toy Monte
Carlo experiments
• Distribution of maximum likelihood (or -2lnL) gives no
information about the quality of the fit
• Goodness of fit for ML in more than one dimension is still an
open and debated issue
• Often preferred likelihood ratio w.r.t. a null hypothesis
– Asymptotically distributed as a chi-square
– Determine the C.L. of the fit to real data as fraction of toy cases
with worse value of maximum log-likelihood-ratio
Luca Lista
Statistical methods in LHC data analysis
46
Kolmogorov - Smirnov test
• Assume you have a sample {x1, …, xn}, you want to test if the
set is compatible with being produced by random variables
obeying a PDF f(x)
• The test consists in building the cumulative distribution for the
set and the PDF:
• The distance between the two cumulative distribution is
evaluated as:
Luca Lista
Statistical methods in LHC data analysis
47
Kolmogorov-Smirnov test in a picture
1
D
n
F(x)
Fn(x)
0
x
x x
1
Luca Lista
2
…
Statistical methods in LHC data analysis
x
n
48
Kolmogorov distribution
• For large n:
– Dn converges to zero (small Dn = good agreement)
– K=n Dn has a distribution that is independent on f(x) known
as Kolmogorov distribution (related to Brownian motion)
• Kolmogorov distribution is:
• Caveat with KS test:
– Very common in HEP, but not always appropriate
– If the shape or parameters of the PDF f(x) are determined
from the sample (i.e.: with a fit) the distribution of nDn may
deviate from the Kolmogorov distribution.
– A toy Monte Carlo method could be used in those case to
evaluate the distribution of n Dn
Luca Lista
Statistical methods in LHC data analysis
49
Two sample KS test
• We can test whether two samples {x1, …, xn},
{y1, …, ym}, follow the same distribution using
the distance:
• The variable that follows asymptotically the
Kolmogorov distribution is, in this case:
Luca Lista
Statistical methods in LHC data analysis
50
A concrete 2 example
Electro-Weak precision tests
Electro-Weak precision tests
•
SM inputs from LEP (Aleph,
Delphi, L3, Opal), SLC
(SLD), Tevatron (CDF, D0).
Luca Lista
Statistical methods in LHC data analysis
52
Higgs mass prediction
•
•
•
Global 2 analysis, using ZFitter for detailed SM calculations
Correlation terms not negligible, even cross-experiment (LEP energy…)
Higgs mass prediction from indirect effect on radiative corrections
Luca Lista
Statistical methods in LHC data analysis
53
References
•
•
•
•
•
Gary J. Feldman, Robert D. Cousins, “Unified approach to the classical
statistical analysis of small signals”, Phys. Rev. D 57, 3873 - 3889 (1998)
J. Friedman, T. Hastie and R. Tibshirani, “The Elements of Statistical Learning”,
Springer Series in Statistics, 2001.
A. Webb, “Statistical Pattern Recognition”, 2nd Edition, J. Wiley & Sons Ltd,
2002.
L.I. Kuncheva, “Combining Pattern Classifiers”, J. Wiley & Sons, 2004.
Artificial Neural Networks
–
–
•
Bing Cheng and D. M. Titterington, Statist. Sci. Volume 9, Number 1 (1994), 2-30,
Neural Networks: A Review from a Statistical Perspective
Robert P.W. Duin, Learned from Neural Networks
http://ict.ewi.tudelft.nl/~duin/papers/asci_00_NNReview.pdf
Boosted decision trees
–
–
–
–
–
R.E. Schapire, The boosting approach to machine learning: an overview, MSRI
Workshop on Nonlinear Estimation and Classification, 2002.
Y. Freund, R.E. Schapire, A short introduction to boosting, J. Jpn. Soc. Artif. Intell. 14
(5) (1999) 771
Byron P. Roe et al, Boosted Decision Trees as an Alternative to Artificial Neural
Networks for Particle Identification
B.P Roe et al., Nucl.Instrum.Meth. A543 (2005) 577-584 Boosted decision trees as an
alternative to artificial neural networks for particle identification
http://arxiv.org/abs/physics/0408124
Bauer and Kohavi, Machine Learning 36 (1999),
“An empirical comparison of voting classification algorithms”
Luca Lista
Statistical methods in LHC data analysis
54