Download Hypothesis Testing

Document related concepts
no text concepts found
Transcript
Summary of last lecture
• We continued to discuss interval estimation and nuisance
parameters.
• We made the connection between the Fisher information and
confidence intervals
• 68.3% two-sided confidence intervals can be estimated by
evaluating the second derivative of the log-likelihood function
in the best fit point (e.g. HESSE within MINUIT)
Jan Conrad, FK8006, Hypothesis testing
Summary last lecture
• We then turned to discuss nuisance parameters in the Neyman
construction.
• The Neyman construction can be combined with
marginalisation and with profiling (the latter not so often
used).

P(n | s ) 
 P(n | s, b)G(b | b
est
)db

R
L( X |  o ,best|0 )
P( X |  best ,best )
Jan Conrad, FK8006, Hypothesis testing
Summary last lecture
• The marginalisation (”hybrid-Bayesian”) method results in
over-coverage (in general).
• Profiling not checked afaik.
Jan Conrad, FK8006, Hypothesis testing
Summary last letcure
• Recommendation: likelihood based intervals with profiling
over nuisance parameters is working well for most cases
• In doubt: Monte Carlo.
Jan Conrad, FK8006, Hypothesis testing
Hypothesis testing
Jan Conrad, Fysikum
[email protected], 08-553 7 8769, A5:1027
FK 8006: Statistical Methods in Physics
Last update: 28.12.2015
Jan Conrad, FK8006, Hypothesis testing
Hypothesis tests.
• The goal of hypothesis tests is to test if a hypothesis is
compatible to data when compared with a defined (or undefined,
in this case: ”goodness-of fit”) alternative.
• Conventionally, one hypothesis is called the null hypothesis (H0),
the other one is called the alternative hypothesis (H1).
• The test proceeds by finding the critical region, i.e. the region in
data space for which:

p(d  w | H 0 )  
• and at the same time maximizing:

p(d  w | H1 )  1  
Jan Conrad, FK8006, Hypothesis testing
Composite hypothesis and simple
hypothesis
• Simple hypothesis: H0, H1completely specified
(no free parameters)
• Composite hypothesis:H0, H1 have one or more
free parameters
• Separate Families of hypotheses: H0, H1 have
one or more free parameters, but different
functions, e.g.:
exp( ), E
Jan Conrad, FK8006, Hypothesis testing

Size, error of first and second kind

• The probability to reject H0 , i.e. p(d  w | H 0 )  
is called the size of the test. In physics we call this the
level of signficance.
Jan Conrad, FK8006, Hypothesis testing
Power, test statistic
• The more sensitive the test, the better it can discriminate between
the null and the alternative hypothesis, quantitatively, maximal
power

p(d  w | H1 )  1  
• In order to achieve this goal, especially in many dimensions the
observables are often replaced by a one dimensional function of the
obervables, called the test statistic

p(t (d )  wt | H 0 )  
Jan Conrad, FK8006, Hypothesis testing
Example:
Jan Conrad, FK8006, Hypothesis testing
Power and size
Jan Conrad, FK8006, Hypothesis testing
How to choose a test?
• Consider:
H 0 :   0
H1 :   1
• Power

pow(1 )  p(d  w | 1 )  1  
• Consider pow as function of θ  power function pow( )
• Still simple hypothesis.
Jan Conrad, FK8006, Hypothesis testing
Uniformly most powerful test (UMP)
Jan Conrad, FK8006, Hypothesis testing
Consistency
• A test is called consistent if the power tends to 1 as the number
of observations increase
Jan Conrad, FK8006, Hypothesis testing
Bias
• Consider a power curve, where the probability to reject the null
hypothesis θ0 is smaller if θ1 is true, than if θ0 is true. This is called
a biased test.
Jan Conrad, FK8006, Hypothesis testing
Choice of test
Jan Conrad, FK8006, Hypothesis testing
The Neyman-Pearson test
• The best test for given size α is the one which has the best
critical region in the data space, i.e. the region which has
maximal power (1-β)
12-03-09
Jan Conrad, FK8006, Hypothesis testing and goodness of fit
Neyman-Pearson test
• This expectation value will be maximal for the fraction of the
data space where the above ratio is maximal.
• This ratio is called the likelihood ratio. The criterion for the
best critical region is
• As this ratio needs to be calculable for all points in data space,
i.e. only applicable for simple hypothesis. In this case, the
likelihood ratio represent the uniformly most powerful test
statistic.
12-03-09
Jan Conrad, FK8006, Hypothesis testing and goodness of fit
12-03-09
Jan Conrad, FK8006, Hypothesis testing and goodness of fit
(Maximum) likelihood ratio test
• The likelihood ratio test is the generalization of the
Neyman-Pearson test to composite hypotheses.
• Composite hypotheses:
H 0 :  
H 1 :    
• Example: New Physics signal: s=0 (no signal), s>0 (signal
present), maximize over for example b and s.
• In this situation you define the likelihood ratio:
12-03-09
Jan Conrad, FK8006, Hypothesis testing and goodness of fit
Likeloood ratio test continued
• The ratio can be reformulated as:
• i.e. you maximize w.r.t to θj, j = 1,….,s while fixing θi = θi0 i =
1, ….,r, in the numerator, and maximize with respect to all
parameters for the denominator.
12-03-09
Jan Conrad, FK8006, Hypothesis testing and goodness of fit
Null distribution
• This test is not necessarily UMP
• In hypothesis tests, traditionally, a very important property was the
knowledge of the null distribution.
• If H0 imposes r constraints on s+r parameters in H1 and H0 then
 2 ln  ~  (r )
2
under H0. for ninfty
Nowadays you would probably check this property with a Monte
Carlo simulation.
• In this example, the hypotheses are nested. The asymptotic property
for nested hypotheses is called Wilks theorem. Simple example:
new physics signal over known background. H1: s+b, H0: s=0.
12-03-09
Jan Conrad, FK8006, Hypothesis testing and goodness of fit
Applicability of chi-squared approximation (Wilks theorem)
•
•
•
•
Nested hypotheses
Ninfty
Gaussian errors
Restricted region not on the boundary of the parameter space
(but then there is a generalisation, Chernoff’s theorem).
• Regularity conditions on the likelihood (Fisher information
existing, MLE attains minimum variance bound).
12-03-09
Jan Conrad, FK8006, Hypothesis testing and goodness of fit
Summary last lecture
• We started to discuss hypothesis tests.
• Size of the test = probability to find result in the critical region
under the null hypothesis  rejection of null hypothesis
• Power of the test: probability of the result to be in the critical
region under the alternative hypothesis  rejection of null
hypothesis
Jan Conrad, FK8006, Hypothesis testing
Summary last lecture
• Error of first kind: reject null hypothesis despite that it is true
(probability == significance level)
• Error of second kind: accept null hypothesis though it is false
(probability == 1 minus the power)
• Consistency: power tends to 1 when N  infty
• Bias: probability to reject null hypothesis smaller if alternative
is true than if null is true.
• If you have the choice between several tests, pick the one
which provides highest power.
12-03-09
Jan Conrad, FK8006, Hypothesis testing and goodness of fit
Summary last lecture
• ”test” is often used equivalently to mean the test statistics, i.e.
a (scalar) function of the data.
• The uniformly most powerful test statistic is given by the
likelihood ratio (Neyman-Pearson lemma), for simple
hypotheses.
• For composite hypothesis the (locally most powerful) test
statistic is the maximum likelihood ratio.
12-03-09
Jan Conrad, FK8006, Hypothesis testing and goodness of fit
• Under some conditions (foremost nestedness) the Wilks
theorem tells us that the likelihood ratio is distributed as:
 2 ln  ~  (r )
2
Jan Conrad, FK8006, Hypothesis testing

2
• Instead of a likelihood fit, you can apply a chi-square fit.
• The p-value can then be obtained from:
 ~  min ( H1 )   min ( H 0 )
2
2
2
• Where d.o.f = number of parameters (H1) – number of
parameters (H0).
• Under same conditions as Wilks:
significan ce   2  
12-03-09
Jan Conrad, FK8006, Hypothesis testing and goodness of fit
p-values
12-03-09
Jan Conrad, FK8006, Hypothesis testing and goodness of fit
P-value distribution.
12-03-09
Jan Conrad, FK8006, Hypothesis testing and goodness of fit
Using p-values to find a test.
12-03-09
Jan Conrad, FK8006, Hypothesis testing and goodness of fit
Jan Conrad, FK8006, Hypothesis testing
Jan Conrad, FK8006, Hypothesis testing
• Usually, the null hypothesis is the current paradigm, which is why α is
usually chosen to be small (say 10-7)
N.B: Usually one talks about for example ”5 σ detection”. The
convention is to express α as if the probability corresponding to a
Gaussian process in terms of the number of standard deviations.
Jan Conrad, FK8006, Hypothesis testing
Another example …
12-03-09
Jan Conrad, FK8006, Hypothesis testing and goodness of fit
Trial factor and look elsewhere effect.
• In the previous example we new the ”position” of the excess.
• It is quite common that we do not know the position of the
excess, but we scan a spectrum to check whether there is an
excess anywhere:
– Example: Mass resonance in particle search
– Example: source of radiation on the sky.
• This introduces the ”look elsewhere effect” or ”trial factor
correction”
Jan Conrad, FK8006, Hypothesis testing
Uncorrelated trials
• The corrected p-value (pglobal) is calculated by estimating the
probability to get at least one excess of significance plocal in a
series of N trials
• Can anyone guess, what distribution can be used to calculate
the corrected p-value?
12-03-09
Jan Conrad, FK8006, Hypothesis testing and goodness of fit
Trial correction
Jan Conrad, FK8006, Hypothesis testing
Correlated trials
• Often search regions will overlap. In that case we have to rely
entirely on Monte Carlo simulations, and/or turn to recent
literature:
Looking for a Needle in a Haystack? Look Elsewhere! A statistical comparison of approximate global pvalues
Sara Algeri, Jan Conrad David A. van Dyk Brandon Anderson
e-Print: arXiv:1602.03765 [physics.data-an]
Trial factors or the look elsewhere effect in high energy physics
Eilam Gross, Ofer Vitells (Weizmann Inst.)
Eur.Phys.J. C70 (2010) 525-530
e-Print: arXiv:1005.1891 [physics.data-an] | PDF
Jan Conrad, FK8006, Hypothesis testing
Bayes factors
• In Bayesian methodology you calculate the ratio of
probabilities for two hypothesis (Bayes factor)
• This is a ratio of believes: if R=5, this means we believe H0
more than H1
• This gives us the odds we would accept for betting on H0 and
against H1
• Certainly not a prior independent result.
Jan Conrad, FK8006, Hypothesis testing
12-03-09
Jan Conrad, FK8006, Hypothesis testing and goodness of fit
Another science example
Astrophys.J. 747 (2012) 121
12-03-09
Jan Conrad, FK8006, Hypothesis testing and goodness of fit
Maxmimum likelihood test statistic
12-03-09
Jan Conrad, FK8006, Hypothesis testing and goodness of fit
…this one is not chi-squared ..
•
TS>24 ~ 4.9 sigma detection if chi-squared, here: ~ 3 sigma´:
12-03-09
Jan Conrad, FK8006, Hypothesis testing and goodness of fit
… not there yet …
12-03-09
Jan Conrad, FK8006, Hypothesis testing and goodness of fit
… but that is not all …
12-03-09
Jan Conrad, FK8006, Hypothesis testing and goodness of fit
Summary hypothesis test.
• The goal of hypothesis tests is to test if a hypothesis is
compatible to data when compared with a defined (or undefined,
in this case: ”goodness-of fit”) alternative.
• The properties of the test are size (level of signficance), power
(probability that alternative hypothesis is detected), consistency
(power of test tends to 100% with number of samples) and bias
(probability to accept null hypothesis is larger if alternative
hypothesis is true).
• A test that is the most powerful in the entire parameter space is
called ”unformly most powerful”, or UMP.
• For simple hypothesis such a test always exists, i.e. the
Neyman-Pearson (likelihood ratio) test
Jan Conrad, FK8006, Hypothesis testing
Hypothesis test
• For composite hypotheses, we can use the maximum
likelihood ratio test.
• Often the likelihood ratio can not be easily calculated,
espically in case of many paramaters (multi-variates)
• In this case we will have to resort to special (multi-variate)
methods, e.g. machine learning.
• We will discuss this in one of the next lectures.
Jan Conrad, FK8006, Hypothesis testing
Summary last lecture
• We discussed further about hypothesis testing.
• In particular we presented:
1) Delta-chi-squared
 ~  min ( H1 )   min ( H 0 )
2
2
2
2) We introduced the p-value (probability, under assumption of
the null hypothesis, to observe data with equal or lesser
compatibility with the null hypothesis relative to the data we
actually measured.
Jan Conrad, FK8006, Hypothesis testing
Summary last lecture
2) p-values are a function of the data and as such they themselves
random variates. We noted that p-values are uniformly distributed
under the null.
3) We briefly introduced Bayesian hypothesis testing, in
particular: Bayes factors:
Represents the odds that you are willing to bet on the null
hypothesis.
Jan Conrad, FK8006, Hypothesis testing
Jeffrey’s scale
• It is not quite clear (to me at least) how to intereprete the
Bayes factor. But there is a conventional scale that is used
• The Jeffrey’s scale:
R
Strength of Evidence
<1
Negative
1 – 101/2
barely worth mentioning
101/2 - 101
substantial
101- 103/2
strong
103/2- 102
Very strong
>102
decisive
Jan Conrad, FK8006, Hypothesis testing
Hypothesis testing and goodness of fit
SPECIALIZED TESTS
Jan Conrad, FK8006, Hypothesis testing
Addition to last lecture
• I have mentioned in passing that the maximum likelihood test
is most power-ful in the neighbourhood of the null hypothesis
• This can be formulated as:
• Then in terms of the likelihood:
• Applying the Neyman-Pearson lemma:
Jan Conrad, FK8006, Hypothesis testing
toc
•
•
•
•
•
•
Students t test
Run Test
AOV
Wilcoxon Two sample test.
Kruskal-Wallis multi sample test
Spearman Rank test
12-03-09
Jan Conrad, FK8006, Hypothesis testing and goodness of fit
Student’s t test .
12-03-09
Jan Conrad, FK8006, Hypothesis testing
12-03-09
Jan Conrad, FK8006, Hypothesis testing
Jan Conrad, FK8006, Hypothesis testing
12-03-09
Jan Conrad, FK8006, Hypothesis testing and goodness of fit
Run test
Jan Conrad, FK8006, Hypothesis testing
12-03-09
Jan Conrad, FK8006, Hypothesis testing
Combination with a chi-squared
Run: sequence with same sign
Jan Conrad, FK8006, Hypothesis testing
Analysis of Variances I
12-03-09
Jan Conrad, FK8006, Hypothesis testing
Analysis of Variances II
12-03-09
Jan Conrad, FK8006, Hypothesis testing and goodness of fit
Jan Conrad, FK8006, Hypothesis testing
What do I mean by ”rank”
• E.g. (3, 5, 5, 9) – R = (1, 2.5, 2.5, 4)
• What do I mean by rank-sum: add up ranks coming from
sample x  rank sum for sample 2 determined.
12-03-09
Jan Conrad, FK8006, Hypothesis testing and goodness of fit
The U statistics
f(i)  tabulated
Jan Conrad, FK8006, Hypothesis testing
Kruskal-Wallis Rank Test
12-03-09
Jan Conrad, FK8006, Hypothesis testing
12-03-09
Jan Conrad, FK8006, Hypothesis testing and goodness of fit
Spearman rank test
d= difference in ranks
Jan Conrad, FK8006, Hypothesis testing
12-03-09
Jan Conrad, FK8006, Hypothesis testing
Summary last lecture
•
We discussed two or multi-sample tests:
– Run test: ordered list of two samples, run is a sequence
from one sample  number of runs test statistic  can be
calculated explicitely
– Mann-Whitney rank test: ranked list of two samples  U
(function for rank-sum) test statistic  tabulated
– Kruskal-Wallis rank test: ranked list of samples  h
statstics  chi-squared
12-03-09
Jan Conrad, FK8006, Hypothesis testing and goodness of fit
Summary last lecture
• Student’s test (mostly useful for testing whether two samples
are from Gaussian)
• AOV: set of tests for samples drawn from Gaussians under
different conditions (test for location/variance,
known/unknown variance, known/unknown location)
• We turned to goodness of fit.
12-03-09
Jan Conrad, FK8006, Hypothesis testing and goodness of fit
• Test of a null hypothesis with a test statistic t, but in this case
the alternative hypothesis is the set of all possible alternative
hypotheses.
• Thus the statement being aimed at is:
If H0 were true and the experiment were repeated many times,
one would obtain data as likely (or less likely) than the
observed data with probability p.
• p is the p-value that we already encountered. As we noted then
it makes no reference to the alternative hypothesis (the test
statistic might, however).
12-03-09
Jan Conrad, FK8006, Hypothesis testing and goodness of fit
Statistical methods of physics
GOODNESS OF FIT
Jan Conrad, FK8006, goodness of fit
Goodness of fit.
• Test of a null hypothesis with a test statistic t, but in this case
the alternative hypothesis is the set of all possible alternative
hypotheses.
• Thus the statement being aimed at is:
If H0 were true and the experiment were repeated many times,
one would obtain data as likely (or less likely) than the
observed data with probability p.
• p is the p-value that we already encountered. As we noted then it
makes no reference to the alternative hypothesis (the test
statistic might, however).
12-03-09
Jan Conrad, FK8006, Hypothesis testing and goodness of fit
Example Poisson counting rate.
12-03-09
Jan Conrad, FK8006, Hypothesis testing and goodness of fit
Distribution-free tests
• In this case, we knew the distribution of the test statistic, but
that is not generally true. For many cases, it can be quite
complicated to calculate. One therefore considers
distribution-free tests, i.e. a test whose distribution is known
independent of the null hypothesis.
• In that case, it is sufficient to calculate the distribution of the
test statistic once and then look up the value for your particular
problem.
• The most commonly applicable null-distribution for such tests
is the χ2 distribution (--used for mapping t to the p-value).
12-03-09
Jan Conrad, FK8006, Hypothesis testing and goodness of fit
Pearsons Chi-square test
• The data consists of measurements Y= Y1 ….Yk under 0 they
are equal to f = f1 …. fk
• Where Y denotes the data point, f the expected value under H0
and V is the covariance matrix.
• This is called ”Pearson’s” chi-square, since for k-data points
it behaves like χ2 (k) if Y ~ Gaussian.
Jan Conrad, FK8006, goodness of fit
Chi-square test for histograms
• In this case you use the asymptotic Normality of a multinomial PDF
to find the distribution of:
where N is the total number of events in the histogram, V the
covariance matrix and n the vector of bin contents.
The most usual case looks a little simpler:
this distribution behaves like a χ2 (k-1). This requires Gaussianity
for Npi with the empirical requirement of number of expected events
(Npi > 5)
12-03-09
Jan Conrad, FK8006, Hypothesis testing and goodness of fit
Chi-square test with estimation of parameters
• If you use the data to estimate the parameters of our parent
distribution the Pearson test statistics T does not any more
behave like χ2 (k-1).
• In this case, the distribution is between χ2 (k-1) and χ2 (k-r-1),
where r is the number of parameters that had been estimated
from the data.
• Usually: χ2 (k-r-1) holds (e.g. for maximum likelihood).
12-03-09
Jan Conrad, FK8006, goodness of fit
Neyman’s chi-square
• Instead of the expected number of events in the denominator,
you consider the observed number of events.
• Easy, and asymptotically equivalent to Pearson’s chi-square.
12-03-09
Jan Conrad, FK8006, Hypothesis testing and goodness of fit
Choosing bin size
Jan Conrad, FK8006, goodness of fit
Jan Conrad, FK8006, goodness of fit
Some more details on choosing optimal bin size
12-03-09
Jan Conrad, FK8006, goodness of fit
Likelihood Chi-square
• Instead of assuming Gaussianity, you can use the actual
distribution of number of events in a bin. This is known:
– Poisson, if variable total number of events.
– Multinomial, if fixed total number
• In this case you can use the binned likelihood as a test statistic.
Jan Conrad, FK8006, goodness of fit
Binned likelihood con’t
• Define likelihood for perfect fit (ni = µi )
• Then the likelihood ratio becomes:
• and we set the last term to 0 if n = 0.
Jan Conrad, FK8006, goodness of fit
Binned likelihood con’t
• The test statistic
• obeys asymptotically a chi-squre with χ2(r-1), r the number of
bins.
• My recommendation is to use it for both parameter fitting and
GOF testing.
• The unbinned likelihood (and the likelihood function itself) is
usually is not a good GOF.
Jan Conrad, FK8006, goodness of fit
Binned and Unbinned data
• Binning data always leads to loss of information, so in general
tests on unbinned data should be superior.
• The most commonly used tests for unbinned data (that are
distribution-free) are based on the order statistics.
• Given N independent data points x1,……,xN of the random
variable X, consider the ordered sample x(1) ≤ x(2) ….≤x(N). This
is called the order statistics, with distribution function
(empirical distribution function):
Jan Conrad, FK8006, goodness of fit
Example
• Difference between two
EDFs, used with different
norms (for different tests)
is now used as a test
statistics
Jan Conrad, FK8006, goodness of fit
Kolmogorov-Smirnov test
• Maximum deviation of the EDF from F(X) (expected
distribution under H0).
• For this test-statistics a null distribution can be found:
12-03-09
Jan Conrad, FK8006, Hypothesis testing and goodness of fit
Kolmogorov test con’t.
• Exercise: show that
12-03-09
behaves as χ2 (2)
Jan Conrad, FK8006, Hypothesis testing and goodness of fit
p-value and score distribution (null hypothesis)
12-03-09
Jan Conrad, FK8006, Hypothesis testing and goodness of fit
p-value and score distribution
12-03-09
Jan Conrad, FK8006, Hypothesis testing and goodness of fit
Kolmogorov test WARNING!
• The Kolmogorov test is NOT good for binned data (the option
unluckily exists in some popular analysis tools, e.g. ROOT).
Jan Conrad, FK8006, goodness of fit
Kolmogorov Smirnov Two sample test
12-03-09
Jan Conrad, FK8006, goodness of fit
12-03-09
Jan Conrad, FK8006, goodness of fit
Summary last lecture
• Last time we discussed goodness of fit.
• Test of a null hypothesis with a test statistic t, but in this case
the alternative hypothesis is the set of all possible alternative
hypotheses.
• Thus the statement being aimed at is:
If H0 were true and the experiment were repeated many times,
one would obtain data as likely (or less likely) than the
observed data with probability p.
12-03-09
Jan Conrad, FK8006, Hypothesis testing and goodness of fit
Summary last lecture
• When the distribution of the test statistic is known p-value can
be calculated (example: counting experiment, Poisson
distribution)
• Often we would like to have distribution-free tests (ie. where
the null distribution is known independent of the null
hypothesis)
• Neyman’s chi-squared
• Pearson’s chi-squared
• Behave like a chi-sqaured with (k-1) degrees of freedom if Np
> 5.
Jan Conrad, FK8006, goodness of fit
Summary last lecture
Jan Conrad, FK8006, goodness of fit
Summary last lecture
12-03-09
Jan Conrad, FK8006, Hypothesis testing and goodness of fit
Chi-squared per degree of freedom
Jan Conrad, FK8006, goodness of fit
Chi-squared for fitted hypothesis
12-03-09
Jan Conrad, FK8006, goodness of fit
Summary last lecture
• If you use the likelihood to do parameter inference on binned
data, you can also calculate a goodness of fit from it
(likelihood chi-squared).
• Binning: rule of thumb says you want to have bins which give
the same probability under null  we introduced a graphical
method on how to do this from the cummulative distribution
function
• Unbinned: g.o.f. tests for unbinned data commonly rely on the
order statistcis (empirical distribution function).
Jan Conrad, FK8006, goodness of fit
Summary last lecture
Jan Conrad, FK8006, goodness of fit
Combination of tests
12-03-09
Jan Conrad, FK8006, goodness of fit
Combination of tests
Jan Conrad, FK8006, goodness of fit
Combination of tests
12-03-09
Jan Conrad, FK8006, goodness of fit
12-03-09
Jan Conrad, FK8006, goodness of fit