Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Probability and Statistics Basic concepts I (from a physicist point of view) Benoit CLEMENT – Université J. Fourier / LPSC [email protected] Bibliography Kendall’s Advanced theory of statistics, Hodder Arnold Pub. volume 1 : Distribution theory, A. Stuart et K. Ord volume 2a : Classical Inference and and the Linear Model, A. Stuart, K. Ord, S. Arnold volume 2b : Bayesian inference, A. O’Hagan, J. Forster 2 The Review of Particle Physics, K. Nakamura et al., J. Phys. G 37, 075021 (2010) (+Booklet) Data Analysis: A Bayesian Tutorial, D. Sivia and J. Skilling, Oxford Science Publication Statistical Data Analysis, Glen Cowan, Oxford Science Publication Analyse statistique des données expérimentales, K. Protassov, EDP sciences Probabilités, analyse des données et statistiques, G. Saporta, Technip Analyse de données en sciences expérimentales, B.C., Dunod Sample and population SAMPLE Finite size Selected through a random process eg. Result of a measurement POPULATION Potentially infinite size eg. All possible results Characterizing the sample, the population and the drawing procedure Probability theory (today’s lecture) Using the sample to estimate the characteristics of the population 3 Statistical inference (tomorrow’s lecture) Random process A random process (« measurement » or « experiment ») is a process whose outcome cannot be predicted with certainty. It will be described by : Universe: Ω = set of all possible outcomes. Event : logical condition on an outcome. It can either be true or false; an event splits the universe in 2 subsets. 4 An event A will be identified by the subset A for which A is true. Probability A probability function P is defined by : (Kolmogorov, 1933) P : {Events} → [0:1] A → P(A) satisfying : P(Ω)=1 P(A or B) = P(A) + P(B) if A and B = Ø Interpretation of this number : - Frequentist approach : if we repeat the random process a great number of times n , and count the number of times the outcome satisfy event A, nA then the ratio : nA lim P(A) defines a probability n n 5 - Bayesian interpretation : a probability is a measure of the credibility associated to the event. Simple logic Event « not A » is associated with the complement A. P(A) = 1–P(A) P(Ø) = 1-P(Ω) = 0 Event « A and B » Event « A or B » P(A or B) = P(A)+P(B)–P(A and B) 6 Conditional probability If an event B is known to be true, one can restrain the universe to Ω’=B and define a new probability function on this universe, the conditional probability. P(A|B) = « probability of A given B » 7 P(A and B) P(A | B) P(B) Incompatibility and Indpendance Two incompatible events cannot be true simultaneously, then : P(A and B) = 0 P(A or B) = P(A)+P(B) Two events are independent, if the realization of one is not linked in any way to the realization of the other : P(A|B)=P(A) and P(B|A) = P(B) 8 P(A and B) = P(A).P(B) Bayes theorem The definition of conditional probability leads to : P(A and B) = P(A|B).P(B) = P(B|A).P(A) Hence relating P(A|B) to P(B|A) by the Bayes theorem : P(A | B).P(B) P(B | A) P(A) Or, using a partition {Bi} : P(A | Bi ).P(Bi ) P(A | Bi ).P(Bi ) P(Bi | A) P(A and Bi ) P(A | Bi ).P(Bi ) i i This theorem will play a major role in Bayesian inference : given data and a set of models, it translates into : 9 P(data | modeli ).P(modeli ) P(modeli | data) P(data | modeli ).P(modeli ) i Application of Bayes 100 dices in a box : 70 are equiprobable (A) 30 have a probability 1/3 to get 6 (B) You pick one dice, throw it until you reach 6 and count the number of try. Repeating the process thrice, you get 2, 4 and 1. What’s the probability that the dice is equilibrated ? 5n1 2n1 n1 P(n | B) n For one throw : P(n | A) (1 p6 ) p6 n 6 3 Combining several throw: (for one dice, throws are independent) 5n1 n2 n3 3 P(n1 and n2 and n3 | A) P(n1 | A)P(n2 | A)P(n3 | A) n1 n2 n3 6 2n1 n2 n3 3 P(n1 and n2 and n3 | B) n1 n2 n3 3 P(A | n1, n2 , n3 ) 10 P(n1, n2 , n3 | A)P(A) P(n1, n2 , n3 | B)P(B) P(n1, n2 , n3 | A)P(A) 5 n1 n 2 n 3 3 54 0.7 0.7 n1 n 2 n 3 7 6 6 n1 n 2 n 3 3 4 0.42 n1 n 2 n 3 3 4 2 5 2 5 0.3 0.7 0.3 0.7 37 67 3 n1 n 2 n 3 6 n1 n 2 n 3 Random variable When the outcome of the random process is a number (real or integer), we associate to the random process a random variable X. Each realization of the process leads to a particular result : X=x. x is a realization of X. For a discrete variable : Probability law : p(x) = P(X=x) For a real variable : P(X=x)=0, Cumulative density function : F(x) = P(X<x) dF = F(x+dx)-F(x) = P(X < x+dx) - P(X < x) = P(X < x or x < X < x+dx) - P(X < x) = P(X < x) + P(x < X < x+dx) - P(X < x) = P(x < X < x+dx) = f(x)dx 11 dF Probability density function (pdf) : f(x) dx Density function Probability density function F(x) 1 f(x) f(x)dx P(Ω ) 1 x Note : discrete variables can also be described by a probability density function using Dirac distributions: f(x) p(i)δ(i - x) i 12 Cumulative density function p(i) 1 i x By construction : F(-) P(Ø) 0 F() P( Ω ) 1 a F(a) f(x)dx - b P(a X b) F(b) - F(a) f(x)dx a Change of variable Probability density function of Y = φ(X) For φ bijective •φ increasing : X<x Y<y P(X x) FX (x) P(Y y) FY (Y) FY ( (x)) fY (y) •φ decreasing : X<x Y>y dF(x) f(x) dy ' (x) P(X x) FX (x) P(Y y) 1 FY (Y) 1 FY ( (x)) fY (y) f(x) f( 1(y)) in both case : fY (y) ' (x) ' ( 1(y)) dF(x) f(x) dy ' (x) If φ not bijective : split into several bijective parts φi 1 f(i (y)) f(x) fY (y) 1 i i ' (x) i i ' (i (y)) 13 Very useful for Monte-Carlo : if X is uniformly distributed between 0 and 1 then Y=F-1(X) has F for cumulative density Multidimensional PDF (1) Random variables can be generalized to random vectors : X (X1, X2 ,..., Xn ) the probability density function becomes : f(x)dx f(x 1, x 2 ,, x n )dx1dx 2 dx n P(x1 X1 x 1 dx 1 and x 2 X 2 x 2 dx 2 ... and x n X n x n dx n ) and b d a c P(a X b and c Y d) dx dy f(x, y) Marginal density : probability of only one of the component fX (x)dx P(x X x dx and - Y ) f(x, y)dxdy 14 fX (x) f(x, y)dy Multidimensional PDF (2) For a fixed value of Y=y0: f(x|y0)dx = « Probability of x<X<x+dx knowing that Y=y0 » is , a conditional density for X. It is proportional to f(x,y), so f(x | y) f(x, y) f(x | y)dx 1 f(x, y) f(x, y) f(x | y) f(x, y)dx fY (y) The two random variables X and Y are independent if all events of the form x<X<x+dx are independent from y<Y<y+dy f(x|y)=fX(x) and f(y|x)=fY(y) hence f(x,y)= fX(x).fY(y) Translated in term of pdf’s, Bayes’ theorem becomes: 15 A.Caldwell’s f(x | y)fY (y) f(x | y)fY (y) See lecture for detailed f(y | x) use of this formula fX (x) f(x | y)fY (y)dy for statistical inference Sample PDF A sample is obtained from a random drawing within a population, described by a probability density function. We’re going to discuss how to characterize, independently from one another: - a population - a sample To this end, it is useful, to consider a sample as a finite set from which one can randomly draw elements, with equipropability We can the associate to this process a probability density, the empirical density or sample density 1 fsample(x) δ(x - i) n i 16 This density will be useful to translate properties of distribution to a finite sample. Characterizing a distribution How to reduce a distribution/sample to a finite number of values ? Measure of location: Reducing the distribution to one central value Result Measure of dispersion: Spread of the distribution around the central value Uncertainty/Error Higher order measure of shape 17 Frequency table/histogram (for a finite sample) Measure of location sample (size n) population Mean value : Sum (integral) of all possible values weighted by the probability of occurrence: 1 n μ x xi n i 1 μ x xf(x)dx Median : Value that split the distribution in 2 equiprobable parts med(x) f(x)dx med(x) f(x)dx 1 2 x1 x 2 ... xn , odd n x(n 1)/2 med(x) 1 2 (xn/2 x1n/2 ), even n Mode : The most probable value = maximum of pdf 18 df d2 f 0, dx x mod(x) dx 2 0 x mod(x) ? Measure of dispersion sample (size n) population Standard deviation (σ) and variance (v= σ²) : Mean value of the squared deviation to the mean : n v σ (x μ) f(x)dx 2 2 Koenig’s theorem : 1 v σ (xi μ)2 n i 1 2 σ 2 x 2 f(x)dx μ 2 f(x)dx 2μ xf(x)dx σ2 x2 μ 2 x2 x 2 19 Interquartile difference : generalize the median by splitting the distribution in 4 : med(x) q2 q1 q2 q3 1 f(x)dx q1 f(x)dx q2 f(x)dx q3 f(x)dx 4 δ q3 q1 Others… Bienaymé-Chebyshev Consider the interval : Δ=]-∞,µ-a[]µ+a,+∞[ 2 2 Then for xΔ : x μ xμ 1 f(x) f(x) a a 2 x μ f(x)dx Δ f(x)dx Δ a 2 xμ f(x)dx Δ f(x) a σ2 2 P X μ a a Finally Bienaymé-Chebyshev’s inequality P X μ aσ 1 It gives a bound on the confidence level of the interval μ±aσ 20 a 1 2 3 4 5 Chebyshev’s bound 0 0.75 0.889 0.938 0.96 Normal distribution 0.683 0.954 0.997 0.99996 0.9999994 1 a2 Multidimensional case A random vector (X,Y) can be treated as 2 separate variables mean and variance for each variable : μX μY σX σY Doesn’t take into account correlations between the variables ρ=-0.5 ρ=0 ρ=0.9 Generalized measure of dispersion : Covariance of X and Y Cov(X,Y) (x μ X )(y - μ Y )f(x, y)dxdy ρσ X σ Y μ XY μ Xμ Y 1 n Cov(X,Y) (xi μ X )(yi - μ y ) n i 1 Cov(X,Y) Correlation : ρ Uncorrelated : ρ=0 σ Xσ Y 21 Independent Uncorrelated ρ=0 only quantify linear correlation Regression 22 Measure of location: • a point : (μX , μY) • a curve : line closest to the points linear regression Minimizing the dispersion between the curve « y=ax+b » and the distribution : 1 w(a,b) (y ax b)2 f(x, y)dxdy (yi axi b)2 n i w a 0 x(y ax b)f(x, y)dxdy w 0 (y ax b)f(x, y)dxdy b a(σ 2X μ 2X ) bμ X ρσ X σ Y μ Xμ Y aμ X b μY σY a ρ Fully correlated ρ=1 σX Fully anti-correlated ρ=-1 σY b μ Y ρ μ X Then Y = aX+b σX Decorrelation Covariance matrix for n variables Xi: σ12 ρ12σ1σ 2 σ 22 ρ12σ1σ 2 Σij Cov(Xi , X j ) Σ ρ σ σ ρ σ σ 2n 2 n 1n 1 n 23 ρ1nσ1σn ρ 2nσ 2 σn 2 σn For uncorrelated variables Σ is diagonal Real and symmetric matrix: can de diagonalized One can define n new uncorrelated variables Yi σ'12 0 0 2 0 σ'2 0 -1 Σ' B ΣB, Y BX 2 0 0 σ ' n σ’i2 are the eigenvalues of Σ, B contains the orthonormal eigenvectors. The Yi are the principal components. Sorted for the larger to the smaller σ’ they allow dimensional reduction Moments For any function g(x), the expectation of g is : E[g(X)] g(x)f(x)dx It’s the mean value of g Moments μk are the expectation of Xk. 0th moment : μ0=1 (pdf normalization) 1st moment : μ1=μ (mean) X’ = X-μ1 is a central variable 2nd central moment : μ’2=σ2 (variance) ixt ixt 1 Characteristic function : (t) E[e ] f(x)e dx FT [f] (itx)k (it)k From Taylor expansion : (t) f(x)dx μk k! k! k k 24 k d μ k ik dt k t 0 Pdf entirely defined by its moments CF : useful tool for demonstrations Skewness and kurtosis Reduced variable : X’’ = (X-μ)/σ = X’/ σ Measure of asymmetry : 3rd reduced moment : μ’’3 = √β1 = 1 : skewness 1=0 for symmetric distribution. Then mean = median Measure of shape : 4th reduced moment : μ’’4 = β2 = 2 + 3 : kurtosis For the normal distribution β2 =3 and 2 =0 Generalized Koenig’s theorem n! μ'n (1) (1 n)μ (μ 1)nk μ k k 2 k! (n k)! n 25 1 μ' 'n μ '2 n 1 n 2 μ 'n n Skewness and kurtosis (2) 26 Discrete distributions Binomial distribution: randomly choosing K objects within a finite set of n, with a fixed drawing probability of p Variable :K p = 0.65 Parameters : n,p n = 10 n! k nk Law : P(k;n,p) p (1 p) k! (n k)! Mean : np Variance : np(1-p) 27 Poisson distribution : limit of the binomial when n→+∞,p →0,np=λ Counting events with fixed probability per time/space unit. Variable :K λ = 6.5 Parameters : λ -λ k e λ P(k; λ ) Law : k! Mean :λ Variance :λ Real distributions Uniform distribution : equiprobability over a finite range [a,b] Parameters : a,b 1 Law : f(x; a, b) if a x b ba Mean Variance μ (a b)/2 : : v σ 2 (b a) 2 /12 Normal distribution (Gaussian) : limit of many processes Parameters : μ, σ (xμ ) 1 Law : f(x;μ, σ) e 2σ 2 2 σ 2π Chi-square distribution : sum of the square of n normal reduced variables2 n X -μ Variable : C k X σX k 1 Parameters : n n c n 1 1 n 2 2 2 Law : f(c;n) c e 2 Γ k k 28 2 Mean :n Variance : 2n Convergence 29 Multidimensional Pdfs Multinomial distribution : randomly choosing K1 , K2 ,… Ks objects within a finite set of n, with a fixed drawing probability for each category p1, p2,… ps with ΣKi=n and Σpi=1 Parameters : n, p1, p2,… ps n! Law :P(k; n,p) pk1 pk2 pks k1!k 2 !k s ! Mean : μi=npi Variance : σi2=npi(1-pi) Cov(Ki,Kj)=-npipj 1 2 s Rem : variables are not independent. The binomial, correspond to s=2, but has only one independent variable. Multinormal distribution : Parameters : μ, Σ 1 (x μ )Τ Σ 1 (x μ ) 1 Law : f(x; μ, Σ) e 2 2π Σ if uncorrelated : f(x; μ, Σ) 30 Independent 1 e σi 2 π (x i μ i )2 2 σi 2 Uncorrelated Sum of random variables The sum of several random variable is a new random variable S n S Xi i 1 Assuming the mean and variance of each variable exists, Mean value of S : n n μ S xi f(x1,..., xn )dx1...dxn xi fXi (xi )dxi i1 i1 n μ i1 i The mean is an additive quantity Variance of S : σ x i μ Xi i 1 n 2 S 2 f(x 1,..., x n )dx1...dx n n σ 2Xi 2 Cov(Xi , X j ) i ji i 1 31 For uncorrelated variables, the variance is additive -> used for error combinations n σ σ 2Xi 2 S k 1 Sum of random variables Probability density function of S : fS(s) Using the characteristic function : it xi S (t) fS (s)e ds fX (x)e dx ist For independent variables S (t) fX (xk )eitx dxk X (t) k k i The characteristic function factorizes. Finally the pdf is the Fourier transform of the cf, so : fS fX1 fX2 fXn 32 The pdfs of the sum is a convolution. Sum of Normal variables Normal Sum of Poisson variables (λ1 and λ2) Poisson, λ = λ1 + λ2 Sum of Khi-2 variables (n1 and n2) Khi-2, n = n1 + n2 Sum of independent variables Weak law of large numbers Sample of size n = realization of n independent variables, with the same distribution (mean μ, variance σ2). S 1 The sample mean is a realization of M Xi n n Mean value of M : μM=μ Variance of M : σM2 = σ2/n Central-Limit theorem n independent random variables of mean μi and variance σi2 Xi μ i 1 Sum of the reduced variables : C σi n The pdfs of C converge to a reduced normal distribution : 33 1 fC (c) n e 2π c2 2 The sum of many random fluctuation is normally distributed Central limit theorem Naive demonstration: For each Xi : X’’i has mean 0 and variance 1. So its characteristic function is : t2 X (t) 1 '' i 2 o(t 2 ) Hence the characteristic function of C : n t t t C (t) Xi'' o 1 2n n n 2 2 n For n large : lim C (t) lim 1 n 34 n n t e 2n 2 t2 2 FT 1[fC ] This is a naive demonstration, because we assumed that the moments were defined. For CLT, only mean and variance are required (much more complex) Central limit theorem 700 700 600 600 Gauss 500 -4 -3 -2 -1 X1 400 400 300 300 200 200 100 100 0 0 -100 0 1 2 -4 -3 -2 -1 3 4 5 -4 -3 -2 -1 -100 700 700 600 600 Gauss (X1+X2+X3)*racine(3) 500 35 Gauss 500 400 300 300 200 200 100 100 0 0 0 1 2 3 4 0 1 5 -4 -3 -2 -1 -100 2 3 4 5 Gauss (X1+X2+X3+X4+X5)*racine(5) 500 400 -100 (X1+X2)*racine(2) 0 1 2 3 4 5 Dispersion and uncertainty Any measure (or combination of measure) is a realization of a random variable. - Measured value : θ - True value : θ0 Uncertainty = quantifying the difference between θ and θ0 : measure of dispersion We will postulate : Δθ = ασθ Absolute error, always positive 36 Usually one differentiate - Statistical error : due to the measurement Pdf. - Systematic errors or bias fixed but unknown deviation (equipment, assumptions,…) Systematic errors can be seen as statistical error in a set a similar experiences. Error sources Observation error : ΔO Position error : ΔP Scaling error: ΔS θ = θ0+δO+δS+δP Each δi is a realization of a random variable : mean 0 (negligible) and variance σi2. For uncorrelated error sources : Δ O ασ O 2 2 2 Δ S ασ S Δ 2tot (α σ tot )2 α 2 (σ O σ S σP ) Δ 2O Δ 2S ΔP2 ΔP ασ P 37 Choice of α ? If many sources, from central-limit normal distribution α=1 gives (approximately) a 68% confidence interval α=2 gives 95% CL (and at least 75% from Bienaymé-Chebyshev) Error propagation Measure : x±Δx Compute : f(x) Δf ? f'(x) f(x) df dx Δf Δf Assuming small errors, using Taylor expansion : Δx Δx x df 1d f 1d f 1 d4 f 2 3 4 f(x Δx) f(x) Δx Δx Δx Δx 2 3 4 dx 2 dx 24 dx 6 dx 2 3 df 1 d2 f 1 d3 f 1 d4 f 2 3 4 f(x Δx) f(x) Δx Δ x Δ x Δ x 2 3 4 dx 2 dx 24 dx 6 dx 38 1 d3 f 1 df 3 Δf f(x Δx) f(x Δx) Δx Δ x 3 2 dx 6 dx Error propagation Measure : x±Δx, y±Δy,… Compute : f(x,y,…) -> Δf ? Idea : treat the effect of each variable as separate error sources zm=f(xm,ym) f df(x, ym ) x dx Curve z=f(x,ym), fixed ym 2Δfx ym f f Δx f Δx , Δ y f Δy x y Surface z=f(x,y) xm 2Δx Then 2Δy 2 2 f f f f Δf 2 Δ x f 2 Δ Y f 2 ρ xyΔ x fΔ y f Δx Δy ρ xy ΔxΔy x y x y 2 f f f Δf 2 Δxi ρxix j ΔxiΔx j x x x i i i j i,ji uncorrelated 39 f Δf 2 Δx i i x i 2 correlated Δf f f Δx Δy x y anticorrelated Δf f f Δx Δy x y