Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Summary of last lecture • We continued to discuss interval estimation and nuisance parameters. • We made the connection between the Fisher information and confidence intervals • 68.3% two-sided confidence intervals can be estimated by evaluating the second derivative of the log-likelihood function in the best fit point (e.g. HESSE within MINUIT) Jan Conrad, FK8006, Hypothesis testing Summary last lecture • We then turned to discuss nuisance parameters in the Neyman construction. • The Neyman construction can be combined with marginalisation and with profiling (the latter not so often used). P(n | s ) P(n | s, b)G(b | b est )db R L( X | o ,best|0 ) P( X | best ,best ) Jan Conrad, FK8006, Hypothesis testing Summary last lecture • The marginalisation (”hybrid-Bayesian”) method results in over-coverage (in general). • Profiling not checked afaik. Jan Conrad, FK8006, Hypothesis testing Summary last letcure • Recommendation: likelihood based intervals with profiling over nuisance parameters is working well for most cases • In doubt: Monte Carlo. Jan Conrad, FK8006, Hypothesis testing Hypothesis testing Jan Conrad, Fysikum [email protected], 08-553 7 8769, A5:1027 FK 8006: Statistical Methods in Physics Last update: 28.12.2015 Jan Conrad, FK8006, Hypothesis testing Hypothesis tests. • The goal of hypothesis tests is to test if a hypothesis is compatible to data when compared with a defined (or undefined, in this case: ”goodness-of fit”) alternative. • Conventionally, one hypothesis is called the null hypothesis (H0), the other one is called the alternative hypothesis (H1). • The test proceeds by finding the critical region, i.e. the region in data space for which: p(d w | H 0 ) • and at the same time maximizing: p(d w | H1 ) 1 Jan Conrad, FK8006, Hypothesis testing Composite hypothesis and simple hypothesis • Simple hypothesis: H0, H1completely specified (no free parameters) • Composite hypothesis:H0, H1 have one or more free parameters • Separate Families of hypotheses: H0, H1 have one or more free parameters, but different functions, e.g.: exp( ), E Jan Conrad, FK8006, Hypothesis testing Size, error of first and second kind • The probability to reject H0 , i.e. p(d w | H 0 ) is called the size of the test. In physics we call this the level of signficance. Jan Conrad, FK8006, Hypothesis testing Power, test statistic • The more sensitive the test, the better it can discriminate between the null and the alternative hypothesis, quantitatively, maximal power p(d w | H1 ) 1 • In order to achieve this goal, especially in many dimensions the observables are often replaced by a one dimensional function of the obervables, called the test statistic p(t (d ) wt | H 0 ) Jan Conrad, FK8006, Hypothesis testing Example: Jan Conrad, FK8006, Hypothesis testing Power and size Jan Conrad, FK8006, Hypothesis testing How to choose a test? • Consider: H 0 : 0 H1 : 1 • Power pow(1 ) p(d w | 1 ) 1 • Consider pow as function of θ power function pow( ) • Still simple hypothesis. Jan Conrad, FK8006, Hypothesis testing Uniformly most powerful test (UMP) Jan Conrad, FK8006, Hypothesis testing Consistency • A test is called consistent if the power tends to 1 as the number of observations increase Jan Conrad, FK8006, Hypothesis testing Bias • Consider a power curve, where the probability to reject the null hypothesis θ0 is smaller if θ1 is true, than if θ0 is true. This is called a biased test. Jan Conrad, FK8006, Hypothesis testing Choice of test Jan Conrad, FK8006, Hypothesis testing The Neyman-Pearson test • The best test for given size α is the one which has the best critical region in the data space, i.e. the region which has maximal power (1-β) 12-03-09 Jan Conrad, FK8006, Hypothesis testing and goodness of fit Neyman-Pearson test • This expectation value will be maximal for the fraction of the data space where the above ratio is maximal. • This ratio is called the likelihood ratio. The criterion for the best critical region is • As this ratio needs to be calculable for all points in data space, i.e. only applicable for simple hypothesis. In this case, the likelihood ratio represent the uniformly most powerful test statistic. 12-03-09 Jan Conrad, FK8006, Hypothesis testing and goodness of fit 12-03-09 Jan Conrad, FK8006, Hypothesis testing and goodness of fit (Maximum) likelihood ratio test • The likelihood ratio test is the generalization of the Neyman-Pearson test to composite hypotheses. • Composite hypotheses: H 0 : H 1 : • Example: New Physics signal: s=0 (no signal), s>0 (signal present), maximize over for example b and s. • In this situation you define the likelihood ratio: 12-03-09 Jan Conrad, FK8006, Hypothesis testing and goodness of fit Likeloood ratio test continued • The ratio can be reformulated as: • i.e. you maximize w.r.t to θj, j = 1,….,s while fixing θi = θi0 i = 1, ….,r, in the numerator, and maximize with respect to all parameters for the denominator. 12-03-09 Jan Conrad, FK8006, Hypothesis testing and goodness of fit Null distribution • This test is not necessarily UMP • In hypothesis tests, traditionally, a very important property was the knowledge of the null distribution. • If H0 imposes r constraints on s+r parameters in H1 and H0 then 2 ln ~ (r ) 2 under H0. for ninfty Nowadays you would probably check this property with a Monte Carlo simulation. • In this example, the hypotheses are nested. The asymptotic property for nested hypotheses is called Wilks theorem. Simple example: new physics signal over known background. H1: s+b, H0: s=0. 12-03-09 Jan Conrad, FK8006, Hypothesis testing and goodness of fit Applicability of chi-squared approximation (Wilks theorem) • • • • Nested hypotheses Ninfty Gaussian errors Restricted region not on the boundary of the parameter space (but then there is a generalisation, Chernoff’s theorem). • Regularity conditions on the likelihood (Fisher information existing, MLE attains minimum variance bound). 12-03-09 Jan Conrad, FK8006, Hypothesis testing and goodness of fit Summary last lecture • We started to discuss hypothesis tests. • Size of the test = probability to find result in the critical region under the null hypothesis rejection of null hypothesis • Power of the test: probability of the result to be in the critical region under the alternative hypothesis rejection of null hypothesis Jan Conrad, FK8006, Hypothesis testing Summary last lecture • Error of first kind: reject null hypothesis despite that it is true (probability == significance level) • Error of second kind: accept null hypothesis though it is false (probability == 1 minus the power) • Consistency: power tends to 1 when N infty • Bias: probability to reject null hypothesis smaller if alternative is true than if null is true. • If you have the choice between several tests, pick the one which provides highest power. 12-03-09 Jan Conrad, FK8006, Hypothesis testing and goodness of fit Summary last lecture • ”test” is often used equivalently to mean the test statistics, i.e. a (scalar) function of the data. • The uniformly most powerful test statistic is given by the likelihood ratio (Neyman-Pearson lemma), for simple hypotheses. • For composite hypothesis the (locally most powerful) test statistic is the maximum likelihood ratio. 12-03-09 Jan Conrad, FK8006, Hypothesis testing and goodness of fit • Under some conditions (foremost nestedness) the Wilks theorem tells us that the likelihood ratio is distributed as: 2 ln ~ (r ) 2 Jan Conrad, FK8006, Hypothesis testing 2 • Instead of a likelihood fit, you can apply a chi-square fit. • The p-value can then be obtained from: ~ min ( H1 ) min ( H 0 ) 2 2 2 • Where d.o.f = number of parameters (H1) – number of parameters (H0). • Under same conditions as Wilks: significan ce 2 12-03-09 Jan Conrad, FK8006, Hypothesis testing and goodness of fit p-values 12-03-09 Jan Conrad, FK8006, Hypothesis testing and goodness of fit P-value distribution. 12-03-09 Jan Conrad, FK8006, Hypothesis testing and goodness of fit Using p-values to find a test. 12-03-09 Jan Conrad, FK8006, Hypothesis testing and goodness of fit Jan Conrad, FK8006, Hypothesis testing Jan Conrad, FK8006, Hypothesis testing • Usually, the null hypothesis is the current paradigm, which is why α is usually chosen to be small (say 10-7) N.B: Usually one talks about for example ”5 σ detection”. The convention is to express α as if the probability corresponding to a Gaussian process in terms of the number of standard deviations. Jan Conrad, FK8006, Hypothesis testing Another example … 12-03-09 Jan Conrad, FK8006, Hypothesis testing and goodness of fit Trial factor and look elsewhere effect. • In the previous example we new the ”position” of the excess. • It is quite common that we do not know the position of the excess, but we scan a spectrum to check whether there is an excess anywhere: – Example: Mass resonance in particle search – Example: source of radiation on the sky. • This introduces the ”look elsewhere effect” or ”trial factor correction” Jan Conrad, FK8006, Hypothesis testing Uncorrelated trials • The corrected p-value (pglobal) is calculated by estimating the probability to get at least one excess of significance plocal in a series of N trials • Can anyone guess, what distribution can be used to calculate the corrected p-value? 12-03-09 Jan Conrad, FK8006, Hypothesis testing and goodness of fit Trial correction Jan Conrad, FK8006, Hypothesis testing Correlated trials • Often search regions will overlap. In that case we have to rely entirely on Monte Carlo simulations, and/or turn to recent literature: Looking for a Needle in a Haystack? Look Elsewhere! A statistical comparison of approximate global pvalues Sara Algeri, Jan Conrad David A. van Dyk Brandon Anderson e-Print: arXiv:1602.03765 [physics.data-an] Trial factors or the look elsewhere effect in high energy physics Eilam Gross, Ofer Vitells (Weizmann Inst.) Eur.Phys.J. C70 (2010) 525-530 e-Print: arXiv:1005.1891 [physics.data-an] | PDF Jan Conrad, FK8006, Hypothesis testing Bayes factors • In Bayesian methodology you calculate the ratio of probabilities for two hypothesis (Bayes factor) • This is a ratio of believes: if R=5, this means we believe H0 more than H1 • This gives us the odds we would accept for betting on H0 and against H1 • Certainly not a prior independent result. Jan Conrad, FK8006, Hypothesis testing 12-03-09 Jan Conrad, FK8006, Hypothesis testing and goodness of fit Another science example Astrophys.J. 747 (2012) 121 12-03-09 Jan Conrad, FK8006, Hypothesis testing and goodness of fit Maxmimum likelihood test statistic 12-03-09 Jan Conrad, FK8006, Hypothesis testing and goodness of fit …this one is not chi-squared .. • TS>24 ~ 4.9 sigma detection if chi-squared, here: ~ 3 sigma´: 12-03-09 Jan Conrad, FK8006, Hypothesis testing and goodness of fit … not there yet … 12-03-09 Jan Conrad, FK8006, Hypothesis testing and goodness of fit … but that is not all … 12-03-09 Jan Conrad, FK8006, Hypothesis testing and goodness of fit Summary hypothesis test. • The goal of hypothesis tests is to test if a hypothesis is compatible to data when compared with a defined (or undefined, in this case: ”goodness-of fit”) alternative. • The properties of the test are size (level of signficance), power (probability that alternative hypothesis is detected), consistency (power of test tends to 100% with number of samples) and bias (probability to accept null hypothesis is larger if alternative hypothesis is true). • A test that is the most powerful in the entire parameter space is called ”unformly most powerful”, or UMP. • For simple hypothesis such a test always exists, i.e. the Neyman-Pearson (likelihood ratio) test Jan Conrad, FK8006, Hypothesis testing Hypothesis test • For composite hypotheses, we can use the maximum likelihood ratio test. • Often the likelihood ratio can not be easily calculated, espically in case of many paramaters (multi-variates) • In this case we will have to resort to special (multi-variate) methods, e.g. machine learning. • We will discuss this in one of the next lectures. Jan Conrad, FK8006, Hypothesis testing Summary last lecture • We discussed further about hypothesis testing. • In particular we presented: 1) Delta-chi-squared ~ min ( H1 ) min ( H 0 ) 2 2 2 2) We introduced the p-value (probability, under assumption of the null hypothesis, to observe data with equal or lesser compatibility with the null hypothesis relative to the data we actually measured. Jan Conrad, FK8006, Hypothesis testing Summary last lecture 2) p-values are a function of the data and as such they themselves random variates. We noted that p-values are uniformly distributed under the null. 3) We briefly introduced Bayesian hypothesis testing, in particular: Bayes factors: Represents the odds that you are willing to bet on the null hypothesis. Jan Conrad, FK8006, Hypothesis testing Jeffrey’s scale • It is not quite clear (to me at least) how to intereprete the Bayes factor. But there is a conventional scale that is used • The Jeffrey’s scale: R Strength of Evidence <1 Negative 1 – 101/2 barely worth mentioning 101/2 - 101 substantial 101- 103/2 strong 103/2- 102 Very strong >102 decisive Jan Conrad, FK8006, Hypothesis testing Hypothesis testing and goodness of fit SPECIALIZED TESTS Jan Conrad, FK8006, Hypothesis testing Addition to last lecture • I have mentioned in passing that the maximum likelihood test is most power-ful in the neighbourhood of the null hypothesis • This can be formulated as: • Then in terms of the likelihood: • Applying the Neyman-Pearson lemma: Jan Conrad, FK8006, Hypothesis testing toc • • • • • • Students t test Run Test AOV Wilcoxon Two sample test. Kruskal-Wallis multi sample test Spearman Rank test 12-03-09 Jan Conrad, FK8006, Hypothesis testing and goodness of fit Student’s t test . 12-03-09 Jan Conrad, FK8006, Hypothesis testing 12-03-09 Jan Conrad, FK8006, Hypothesis testing Jan Conrad, FK8006, Hypothesis testing 12-03-09 Jan Conrad, FK8006, Hypothesis testing and goodness of fit Run test Jan Conrad, FK8006, Hypothesis testing 12-03-09 Jan Conrad, FK8006, Hypothesis testing Combination with a chi-squared Run: sequence with same sign Jan Conrad, FK8006, Hypothesis testing Analysis of Variances I 12-03-09 Jan Conrad, FK8006, Hypothesis testing Analysis of Variances II 12-03-09 Jan Conrad, FK8006, Hypothesis testing and goodness of fit Jan Conrad, FK8006, Hypothesis testing What do I mean by ”rank” • E.g. (3, 5, 5, 9) – R = (1, 2.5, 2.5, 4) • What do I mean by rank-sum: add up ranks coming from sample x rank sum for sample 2 determined. 12-03-09 Jan Conrad, FK8006, Hypothesis testing and goodness of fit The U statistics f(i) tabulated Jan Conrad, FK8006, Hypothesis testing Kruskal-Wallis Rank Test 12-03-09 Jan Conrad, FK8006, Hypothesis testing 12-03-09 Jan Conrad, FK8006, Hypothesis testing and goodness of fit Spearman rank test d= difference in ranks Jan Conrad, FK8006, Hypothesis testing 12-03-09 Jan Conrad, FK8006, Hypothesis testing Summary last lecture • We discussed two or multi-sample tests: – Run test: ordered list of two samples, run is a sequence from one sample number of runs test statistic can be calculated explicitely – Mann-Whitney rank test: ranked list of two samples U (function for rank-sum) test statistic tabulated – Kruskal-Wallis rank test: ranked list of samples h statstics chi-squared 12-03-09 Jan Conrad, FK8006, Hypothesis testing and goodness of fit Summary last lecture • Student’s test (mostly useful for testing whether two samples are from Gaussian) • AOV: set of tests for samples drawn from Gaussians under different conditions (test for location/variance, known/unknown variance, known/unknown location) • We turned to goodness of fit. 12-03-09 Jan Conrad, FK8006, Hypothesis testing and goodness of fit • Test of a null hypothesis with a test statistic t, but in this case the alternative hypothesis is the set of all possible alternative hypotheses. • Thus the statement being aimed at is: If H0 were true and the experiment were repeated many times, one would obtain data as likely (or less likely) than the observed data with probability p. • p is the p-value that we already encountered. As we noted then it makes no reference to the alternative hypothesis (the test statistic might, however). 12-03-09 Jan Conrad, FK8006, Hypothesis testing and goodness of fit Statistical methods of physics GOODNESS OF FIT Jan Conrad, FK8006, goodness of fit Goodness of fit. • Test of a null hypothesis with a test statistic t, but in this case the alternative hypothesis is the set of all possible alternative hypotheses. • Thus the statement being aimed at is: If H0 were true and the experiment were repeated many times, one would obtain data as likely (or less likely) than the observed data with probability p. • p is the p-value that we already encountered. As we noted then it makes no reference to the alternative hypothesis (the test statistic might, however). 12-03-09 Jan Conrad, FK8006, Hypothesis testing and goodness of fit Example Poisson counting rate. 12-03-09 Jan Conrad, FK8006, Hypothesis testing and goodness of fit Distribution-free tests • In this case, we knew the distribution of the test statistic, but that is not generally true. For many cases, it can be quite complicated to calculate. One therefore considers distribution-free tests, i.e. a test whose distribution is known independent of the null hypothesis. • In that case, it is sufficient to calculate the distribution of the test statistic once and then look up the value for your particular problem. • The most commonly applicable null-distribution for such tests is the χ2 distribution (--used for mapping t to the p-value). 12-03-09 Jan Conrad, FK8006, Hypothesis testing and goodness of fit Pearsons Chi-square test • The data consists of measurements Y= Y1 ….Yk under 0 they are equal to f = f1 …. fk • Where Y denotes the data point, f the expected value under H0 and V is the covariance matrix. • This is called ”Pearson’s” chi-square, since for k-data points it behaves like χ2 (k) if Y ~ Gaussian. Jan Conrad, FK8006, goodness of fit Chi-square test for histograms • In this case you use the asymptotic Normality of a multinomial PDF to find the distribution of: where N is the total number of events in the histogram, V the covariance matrix and n the vector of bin contents. The most usual case looks a little simpler: this distribution behaves like a χ2 (k-1). This requires Gaussianity for Npi with the empirical requirement of number of expected events (Npi > 5) 12-03-09 Jan Conrad, FK8006, Hypothesis testing and goodness of fit Chi-square test with estimation of parameters • If you use the data to estimate the parameters of our parent distribution the Pearson test statistics T does not any more behave like χ2 (k-1). • In this case, the distribution is between χ2 (k-1) and χ2 (k-r-1), where r is the number of parameters that had been estimated from the data. • Usually: χ2 (k-r-1) holds (e.g. for maximum likelihood). 12-03-09 Jan Conrad, FK8006, goodness of fit Neyman’s chi-square • Instead of the expected number of events in the denominator, you consider the observed number of events. • Easy, and asymptotically equivalent to Pearson’s chi-square. 12-03-09 Jan Conrad, FK8006, Hypothesis testing and goodness of fit Choosing bin size Jan Conrad, FK8006, goodness of fit Jan Conrad, FK8006, goodness of fit Some more details on choosing optimal bin size 12-03-09 Jan Conrad, FK8006, goodness of fit Likelihood Chi-square • Instead of assuming Gaussianity, you can use the actual distribution of number of events in a bin. This is known: – Poisson, if variable total number of events. – Multinomial, if fixed total number • In this case you can use the binned likelihood as a test statistic. Jan Conrad, FK8006, goodness of fit Binned likelihood con’t • Define likelihood for perfect fit (ni = µi ) • Then the likelihood ratio becomes: • and we set the last term to 0 if n = 0. Jan Conrad, FK8006, goodness of fit Binned likelihood con’t • The test statistic • obeys asymptotically a chi-squre with χ2(r-1), r the number of bins. • My recommendation is to use it for both parameter fitting and GOF testing. • The unbinned likelihood (and the likelihood function itself) is usually is not a good GOF. Jan Conrad, FK8006, goodness of fit Binned and Unbinned data • Binning data always leads to loss of information, so in general tests on unbinned data should be superior. • The most commonly used tests for unbinned data (that are distribution-free) are based on the order statistics. • Given N independent data points x1,……,xN of the random variable X, consider the ordered sample x(1) ≤ x(2) ….≤x(N). This is called the order statistics, with distribution function (empirical distribution function): Jan Conrad, FK8006, goodness of fit Example • Difference between two EDFs, used with different norms (for different tests) is now used as a test statistics Jan Conrad, FK8006, goodness of fit Kolmogorov-Smirnov test • Maximum deviation of the EDF from F(X) (expected distribution under H0). • For this test-statistics a null distribution can be found: 12-03-09 Jan Conrad, FK8006, Hypothesis testing and goodness of fit Kolmogorov test con’t. • Exercise: show that 12-03-09 behaves as χ2 (2) Jan Conrad, FK8006, Hypothesis testing and goodness of fit p-value and score distribution (null hypothesis) 12-03-09 Jan Conrad, FK8006, Hypothesis testing and goodness of fit p-value and score distribution 12-03-09 Jan Conrad, FK8006, Hypothesis testing and goodness of fit Kolmogorov test WARNING! • The Kolmogorov test is NOT good for binned data (the option unluckily exists in some popular analysis tools, e.g. ROOT). Jan Conrad, FK8006, goodness of fit Kolmogorov Smirnov Two sample test 12-03-09 Jan Conrad, FK8006, goodness of fit 12-03-09 Jan Conrad, FK8006, goodness of fit Summary last lecture • Last time we discussed goodness of fit. • Test of a null hypothesis with a test statistic t, but in this case the alternative hypothesis is the set of all possible alternative hypotheses. • Thus the statement being aimed at is: If H0 were true and the experiment were repeated many times, one would obtain data as likely (or less likely) than the observed data with probability p. 12-03-09 Jan Conrad, FK8006, Hypothesis testing and goodness of fit Summary last lecture • When the distribution of the test statistic is known p-value can be calculated (example: counting experiment, Poisson distribution) • Often we would like to have distribution-free tests (ie. where the null distribution is known independent of the null hypothesis) • Neyman’s chi-squared • Pearson’s chi-squared • Behave like a chi-sqaured with (k-1) degrees of freedom if Np > 5. Jan Conrad, FK8006, goodness of fit Summary last lecture Jan Conrad, FK8006, goodness of fit Summary last lecture 12-03-09 Jan Conrad, FK8006, Hypothesis testing and goodness of fit Chi-squared per degree of freedom Jan Conrad, FK8006, goodness of fit Chi-squared for fitted hypothesis 12-03-09 Jan Conrad, FK8006, goodness of fit Summary last lecture • If you use the likelihood to do parameter inference on binned data, you can also calculate a goodness of fit from it (likelihood chi-squared). • Binning: rule of thumb says you want to have bins which give the same probability under null we introduced a graphical method on how to do this from the cummulative distribution function • Unbinned: g.o.f. tests for unbinned data commonly rely on the order statistcis (empirical distribution function). Jan Conrad, FK8006, goodness of fit Summary last lecture Jan Conrad, FK8006, goodness of fit Combination of tests 12-03-09 Jan Conrad, FK8006, goodness of fit Combination of tests Jan Conrad, FK8006, goodness of fit Combination of tests 12-03-09 Jan Conrad, FK8006, goodness of fit 12-03-09 Jan Conrad, FK8006, goodness of fit