Download Data Analysis

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Business intelligence wikipedia , lookup

Data analysis wikipedia , lookup

Forecasting wikipedia , lookup

Experimental uncertainty analysis wikipedia , lookup

Transcript
Data Analysis
Wouter Verkerke
(NIKHEF)
Wouter Verkerke, UCSB
HEP and data analysis
— General introduction
Wouter Verkerke, NIKHEF
Particle physics
Bestudeert de natuur op afstanden < 10-15 m
10-15 m
atom
nucleus
Looking at the smallest constituents of matter  Building a
consistent theory that describe matter and elementary forces
Zwaartekracht en Electromagnetisme
“zelfde krachten … nieuwe modellen”
Theory of Relativity
Newton
Einstein
Quantum Mechanics
Maxwell
Bohr
High Energy Physics
• Working model: ‘the Standard Model’ (a Quantum Field
Theory)
– Describes constituents of matter, 3 out of 4 fundamental forces
Wouter Verkerke, NIKHEF
The standard model has many open issues
Wouter Verkerke, NIKHEF
Link with Astrophysics
Temperature fluctuations
Rotation Curves
in Cosmic Microwave Background
Gravitational
Lensing
What is
dark matter?
Particle Physic today – Large Machines
Wouter Verkerke, NIKHEF
Detail of Large Hadron Collider
Wouter Verkerke, NIKHEF
And large experiments underground
Wouter Verkerke, NIKHEF
One of the 4 LHC experiments – ATLAS
Wouter Verkerke, NIKHEF
View of ATLAS during construction
Wouter Verkerke, NIKHEF
Collecting data at the LHC
Wouter Verkerke, NIKHEF
Data reduction, processing and storage are big issues
Wouter Verkerke, NIKHEF
Analyzing the data – The goal
What we see in the detector
Fundamental physics picture
proton
proton
Extremely difficult
(and not possible on an event-by-event basis anyway due to QM)
Wouter Verkerke, NIKHEF
Analyzing the data – in practice
event reconstruction
data analysis
detector
simulation
Physics simulation
ATLAS
detector
?
The LHC
Wouter Verkerke, NIKHEF
‘Easy stuff’
Wouter Verkerke, NIKHEF
‘Difficult stuff’
Wouter Verkerke, NIKHEF
Quantify what we observe and what we expect to see
• Methods and details are important – for certain physics
we only expect a handful of events after years of data
taking
Tools for data analysis in HEP
• Nearly all HEP data analysis happens in a single platform
– ROOT (1995-now)
– And before that PAW (1985-1995)
• Large project with many developers,
contributors, workshops
Wouter Verkerke, NIKHEF
Choice of working environment R vs. ROOT
• ROOT has become de facto HEP standard analysis environment
– Available and actively used for analyses in running experiments
(Tevatron, B factories etc..)
– ROOT is integrated LHC experimental software releases
– Data format of LHC experiments is (indirectly) based on ROOT  Several
experiments have/are working on summary data format directly usable in
ROOT
– Ability to handle very large amounts of data
• ROOT brings together a lot of the ingredients needed for
(statistical) data analysis
– C++ command line, publication quality graphics
– Many standard mathematics, physics classes: Vectors, Matrices, Lorentz
Vectors Physics constants…
• Line between ‘ROOT’ and ‘external’ software not very sharp
– Lot of software developed elsewhere, distributed with ROOT (TMVA, RooFit)
– Or thin interface layer provided to be able to work with external library (GSL,
FFTW)
– Still not quite as nice & automated as ‘R’ package concept
Wouter Verkerke, NIKHEF
(Statistical) software repositories
• ROOT functions as moderated repository for statistical &
data analysis tools
– Examples TMVA, RooFit
• Several HEP repository initiatives, some contain statistical
software
– PhyStat.org (StatPatternRecognition, TMVA,LepStats4LHC)
– HepForge (mostly physics MC generators),
– FreeHep
• Excellent summary of non-HEP statistical repositories on
Jim Linnemans statistical resources web page
– From Phystat 2005
– http://www.pa.msu.edu/people/linnemann/stat_resources.html
Wouter Verkerke, NIKHEF
Roadmap for this course
•
In this course I aim to follow the ‘HEP workflow’ to organize the
discussion of statistics issues
–
•
•
•
•
•
My experience is that this most intuitive for HEP Phd students
HEP Introduction & Basics
–
A short overview of HEP
–
Distributions, the Central Limit Theorem
Event classification
–
Hypothesis testing
–
Machine learning
Parameter estimation
–
Estimators: Maximum Likelihood and Chi-squared
–
Mathematical tools for model building
–
Practical issues arising with minimization
Confidence intervals, limits, significance
–
Hypothesis testing (again), Bayes Theorem
–
Frequentist statistics, Likelihood-based intervals
–
Likelihood principle and conditioning
Discovery, exclusion and nuisance parameters
–
Formulating discovery and exclusion
–
Systematic uncertainties as nuisance parameters
–
Treatment of nuisance parameters in statistical inference
Wouter Verkerke, NIKHEF
Basic Statistics
—
—
—
—
—
—
Mean, Variance, Standard Deviation
Gaussian Standard Deviation
Covariance, correlations
Basic distributions – Binomial, Poisson, Gaussian
Central Limit Theorem
Error propagation
Wouter Verkerke, NIKHEF
Describing your data – the Average
• Given a set of unbinned data (measurements)
{ x1, x2, …, xN}
then the mean value of x is
1
x
N
N
x
i 1
i
• For binned data
x
1
x
N
N
n x
i 1
ni
i i
xi
– where ni is bin count and xi is bin center
– Unbinned average more accurate due to rounding
Wouter Verkerke, NIKHEF
Describing your data – Spread
• Variance V(x) of x expresses how much x is liable to
vary from its mean value x
V ( x) 



1
2
(
x

x
)
 i
N i
2
1
2
(
x

2
x
x

x
)

i
i
N i
1
1
1 2
2
x i  2 x  xi  x 1)

N i
N
N
i
i
x2  2x 2  x 2
 x2  x 2
• Standard deviation   V ( x ) 
x2  x 2
Wouter Verkerke, NIKHEF
Different definitions of the Standard Deviation

1
N
2
2
(
x

x
)

is the S.D. of the data sample
i
• Presumably our data was taken from a parent
distributions which has mean m and S.F. 
Data Sample

Parent Distribution
(from which data sample was drawn)

m
x
x – mean of our sample
m – mean of our parent dist
 – S.D. of our sample
 – S.D. of our parent dist
Beware Notational Confusion!Wouter Verkerke, NIKHEF
Different definitions of the Standard Deviation
• Which definition of  you use, data or parent, is matter of
preference, but be clear which one you mean!
Data Sample
Parent Distribution
(from which data sample was drawn)
data
x
parent
m
• In addition, you can get an unbiased estimate of parent from a
given data sample using
ˆ parent 
1
N
2
2
ˆ
(
x

x
)



data
N 1 i
N 1

  data 


1
N

2
2

(
x

x
)
i


Wouter Verkerke, NIKHEF
More than one variable
• Given 2 variables x,y and a dataset consisting of pairs
of numbers
{ (x1,y1), (x2,y2), … (xN,yN) }
• Definition of x, y, x, y as usual
• In addition, any dependence between x,y described by
the covariance
1
cov( x, y )   ( xi  x )( yi  y )
N i
 ( x  x )( y  y )
 xy  x y
(has dimension D(x)D(y))
• The dimensionless
cov( x, y )
correlation coefficient is defined as  
 x y
 [1,1]
Wouter Verkerke, NIKHEF
Visualization of correlation
= 0
 = 0.1
 = 0.5
 = -0.7
 = -0.9
 = 0.99
Correlation & covariance in >2 variables
• Concept of covariance, correlation is easily extended to
arbitrary number of variables
cov( x( i ) , x( j ) )  x( i ) x( j )  x( i ) x( j )
• so that Vij  cov( x( i ) , x( j ) ) takes the form of
a n x n symmetric matrix
• This is called the covariance matrix, or error matrix
• Similarly the correlation matrix becomes
ij 
cov( x( i ) , x( j ) )
 ( i ) ( j )
Vij  ij i j
Wouter Verkerke, NIKHEF
Linear vs non-linear correlations
• Correlation coeffients used here are (linear)
Pearson product-moment correlation coefficient
– Data can have more subtle (non-linear correlations) that
expressed in these coefficient
Wouter Verkerke, UCSB
Basic Distributions – The binomial distribution
• Simple experiment – Drawing marbles from a bowl
– Bowl with marbles, fraction p are black, others are white
– Draw N marbles from bowl, put marble back after each drawing
– Distribution of R black marbles in drawn sample:
Probability of a
specific outcome
e.g. ‘BBBWBWW’
P( R; p, N )  p (1  p)
P(R;0.5,4)
R
p=0.5
N=4
R
N R
Number of equivalent
permutations for that
outcome
N!
R!( N  R)!
Binomial distribution
Wouter Verkerke, UCSB
Properties of the binomial distribution
• Mean:
• Variance:
r  n p
V (r)  np(1  p)    np(1  p)
p=0.1, N=4
p=0.1, N=1000
p=0.5, N=4
p=0.5, N=1000
p=0.9, N=4
p=0.9, N=1000
Wouter Verkerke, UCSB
HEP example – Efficiency measurement
• Example: trigger efficiency
– Usually done on simulated data so that also untriggered events
are available
Wouter Verkerke, NIKHEF
Basic Distributions – the Poisson distribution
• Sometimes we don’t know the equivalent of the number
of drawings
– Example: Geiger counter
– Sharp events occurring in a (time) continuum
• What distribution to we expect in measurement over
fixed amount of time?
– Divide time interval l in n finite chunks,
– Take binomial formula with p=l/n and let n
P(r ; l / n, n) 
lr
l nr
n!
(
1

)
nr
n
r!(n  r )!
e l lr
P(r; l ) 
r!
lim n
n!
 nr ,
r!(n  r )!
l
lim n (1  ) n  r  e l
n
Poisson distribution
Wouter Verkerke, UCSB
Properties of the Poisson distribution
l=0.1
l=0.5
l=1
l=2
l=5
l=10
l=20
l=50
l=200
Wouter Verkerke, UCSB
More properties of the Poisson distribution
e  l lr
P(r; l ) 
r!
r l
• Mean, variance:
V (r )  l
  l
• Convolution of 2 Poisson distributions is also a Poisson
distribution with lab=la+lb
P( r ) 
r
 P( r ; l
A
rA 0
e
 l A  lB
e
A
) P( r  rA ; lB )
lrA lrBr
 r !(r  r )!
A
A
A
( l  lB )
 e ( l A  lB ) A
r!
A
r
r!  l A 



(
r

r
)!
l

l
rA 0
A
B 
 A
r
( l  lB )  l A
lB 


 e ( l A  lB ) A

r!
 l A  lB l A  l B 
r
 ( l A  l B ) ( l A  lB )
e
r!
r
rA
 lB 


l

l
B 
 A
r  rA
r
Wouter Verkerke, UCSB
HEP example – counting experiment
• Any distribution plotted on data (particularly in case of
low statistics
Wouter Verkerke, NIKHEF
Basic Distributions – The Gaussian distribution
• Look at Poisson distribution in limit of large N
P ( r; l )  e
l
lr
r!
l=1
Take log, substitute, r = l + x,
and use ln( r! )  r ln r  r  ln 2r
ln( P( r; l ))  l  r ln l  ( r ln r  r )  ln 2r
x 

 l  r ln l  ln( l (1  ))   (l  x )  ln 2l
l 

ln( 1  z )  z  z / 2
 x x2 
 x  (l  x )  2   ln( 2l)
 l 2l 
 x2

 ln( 2l)
Take exp
2l2
e x / 2l
Familiar Gaussian distribution,
P( x ) 
(approximation reasonable for N>10)
2l
l=10
2
l=200
Wouter Verkerke, UCSB
Properties of the Gaussian distribution
1
( x  m ) 2 / 2 2
P( x; m ,  ) 
e
2
• Mean and Variance

x 
 xP( x; m , )dx  m


V ( x)   ( x  m ) 2 P( x; m ,  )dx   2

 
• Integrals of Gaussian
68.27% within 1
90%  1.645
95.43% within 2
95%  1.96
99.73% within 3
99%  2.58
99.9%
 UCSB
3.29
Wouter
Verkerke,
HEP example – high statistics counting expt
• High statistics distributions from data
Wouter Verkerke, NIKHEF
Errors
• Doing an experiment  making measurements
• Measurements not perfect  imperfection quantified in
resolution or error
• Common language to quote errors
– Gaussian standard deviation = sqrt(V(x))
– 68% probability that true values is within quoted errors
[NB: 68% interpretation relies strictly on Gaussian sampling distribution,
which is not always the case, more on this later]
• Errors are usually Gaussian if they quantify a result that
is based on many independent measurements
Wouter Verkerke, UCSB
The Gaussian as ‘Normal distribution’
• Why are errors usually Gaussian?
• The Central Limit Theorem says
– If you take the sum X of N independent measurements xi,
each taken from a distribution of mean mi, a variance Vi=i2,
the distribution for x
(a) has expectation value
X   mi
i
(b) has variance
V ( X )  Vi   i2
i
i
(c ) becomes Gaussian as N  
– Small print: tails converge very slowly in CLT, be careful in assuming
Gaussian shape beyond 2
Wouter Verkerke, UCSB
Demonstration of Central Limit Theorem
 5000 numbers taken at random from a
uniform distribution between [0,1].
N=1
– Mean = 1/2, Variance = 1/12
 5000 numbers, each the sum of 2
random numbers, i.e. X = x1+x2.
– Triangular shape
N=2
 Same for 3 numbers,
X = x1 + x2 + x3
N=3
 Same for 12 numbers, overlaid curve is
exact Gaussian distribution
N=12
Wouter Verkerke, UCSB
Central Limit Theorem – repeated measurements
• Common case 1 : Repeated identical measurements
i.e. mi = m, i   for all i
C.L.T
X   mi
 Nm 
i
1
V ( x )   Vi ( x )  2
N
i
 (x) 

N
X
x  m
N
N 2  2
i Vi ( X )  N 2  N
 Famous sqrt(N) law
Wouter Verkerke, UCSB
Central Limit Theorem – repeated measurements
• Common case 2 : Repeated measurements with
identical means but different errors
(i.e weighted measurements, mi = m)
x
2
x
/

 i i
2
1
/

 i
Weighted average
1
V (x) 
  (x) 
2
1 /  i
1
2
1
/

 i
‘Sum-of-weights’ formula for
error on weighted measurements
Wouter Verkerke, UCSB
Error propagation – one variable
f ( x)  ax  b
• Suppose we have
• How do you calculate V(f) from V(x)?
V( f )  f
2
 f
2
 (ax  b)  ax  b
2
2
 a 2 x 2  2ab x  b2  a x  2ab x  b2
2

 a2 x2  x
 a 2V ( x )
2

 i.e. f = |a|x
2
• More general:
df
 df 
V ( f )    V ( x) ;  f 
x
dx
 dx 
– But only valid if linear approximation is good in range
of error
Wouter Verkerke, UCSB
Error Propagation – Summing 2 variables
• Consider f  ax  by  c
V( f )  a  x
2
2
 x
2
 b  y
2
2
 y
2
 2ab xy  x
y

 a 2V ( x)  b 2V ( y )  2ab cov( x, y )
Familiar ‘add errors in quadrature’
only valid in absence of correlations,
i.e. cov(x,y)=0
• More general
2
 df 
 df 
 df  df 
V ( f )    V ( x )    V ( y )  2   cov( x, y )
 dx 
 dx  dy 
 dy 
2
2
 df 
 df 
 df  df 
 2f     x2     y2  2    x y
 dx 
 dx  dy 
 dy 
2
But only valid if linear approximation The correlation coefficient
Wouter
Verkerke,
UCSB
 [-1,+1] is
0 if x,y
uncorrelated
is good in range of error
Error propagation – multiplying, dividing 2 variables
• Now consider f  x  y
V ( f )  y 2V ( x)  x 2V ( y )
 f

 f
 x   y 
      
  x   y 
2
2
– Result similar for
(math omitted)
2
f  x/ y
• Other useful formulas
1/ x
1/ x

x
x
Relative error on
x,1/x is the same
;  ln( x ) 
x
x
Error on log is just
fractional error
Wouter Verkerke, UCSB
Coffee break
Wouter Verkerke, NIKHEF