Download Prezentacja programu PowerPoint

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Probability wikipedia , lookup

History of statistics wikipedia , lookup

Statistics wikipedia , lookup

Foundations of statistics wikipedia , lookup

Transcript
To P or not to P???:
Does a statistical test holds what it promises?
There is increasing concern that in modern research, false findings may be the majority or
even the vast majority of published research claims.โ€ ,โ€ Ioannidis (2005, PLoS Medicine)
Probably more than 70% of all medical and biological scientific studies are irreproducible!
We compare the effects We test the effect of a drug
two drugs to control blood to control blood pressure
pressure in two groups of against a null control group
patients
๐‘’๐‘“๐‘“๐‘’๐‘๐‘ก ๐‘ ๐‘–๐‘ง๐‘’
๐‘ƒ ๐‘ก = ๐‘ƒ(๐‘ก =
) < 0.001
๐‘ ๐‘ก๐‘Ž๐‘›๐‘‘๐‘Ž๐‘Ÿ๐‘‘ ๐‘’๐‘Ÿ๐‘Ÿ๐‘œ๐‘Ÿ
We compare an observation
We compare two observations
against a null expectation
We use the t-test to calculate
We use the t-test to assess the
a probability of difference.
validity of a hypothesis.
We test for a significant
correlation
๐‘ƒ ๐‘ก = ๐‘ƒ(๐‘ก =
๐‘Ÿ
1โˆ’๐‘Ÿ 2
๐‘› โˆ’ 2)
๐‘ƒ ๐‘ก < 0.001
We compare an observed
statistic against an unobserved
null assumption.
An intuitive significance test
Formally, we test H1: r2 = 0.57 against the alternative H0 : r2 = 0
We test an observation against a specific null assumption.
We compare two specific values of r2.
This is not the same as to test the hypothesis that X and Y are correlated against the
hypothesis that X and Y are not correlated.
X and Y might not be correlated but have a r2 โ‰  0.
This appears if X and Y are jointly constraint by marginal settings.
Number of storks and reproductive rates in Europe (Matthews 2000)
Country
Albania
Belgium
Bulgaria
Denmark
Germany
France
Greece
Netherlands
Italy
Austria
Poland
Portugal
Spain
Switzerland
Turkey
Hungary
Storks
Area
No. stork
pairs
28750
30520
111000
43100
357000
544000
132000
41900
301280
83860
312680
92390
504750
41290
779450
93000
100
1
5000
9
3300
140
2500
4
5
300
30000
1500
8000
150
25000
5000
Stork density Inhabitants
0.00348
0.00003
0.04505
0.00021
0.00924
0.00026
0.01894
0.00010
0.00002
0.00358
0.09594
0.01624
0.01585
0.00363
0.03207
0.05376
3200000
9900000
9000000
5100000
78000000
56000000
10000000
15000000
57000000
7600000
38000000
10000000
39000000
6700000
56000000
11000000
Annual no.
births
Annual birth
rate
83000
87000
117000
59000
901000
774000
106000
188000
551000
87000
610000
120000
439000
82000
1576000
124000
0.026
0.009
0.013
0.012
0.012
0.014
0.011
0.013
0.010
0.011
0.016
0.012
0.011
0.012
0.028
0.011
Excel gets plots
at log-scales
wrong.
r2 = 0.25; P < 0.05
Pseudocorrelations between
X and Y arise when
X = f(U)
Y = g(U)
f = h(g)
Birth
rate
Storks
Birth
rate
Urbanisation
The sample spaces of both
variables are constraint by one or
more hidden variables that are
itself correlated.
๐‘˜
๐‘ ๐‘˜โˆ’1 โ†
Some basic knowledge
๐‘๐‘–
๐‘˜
1
1
The sum of squared normal distributed
variates is approximately c2 distributed with
k degrees of freedom.
The sum of differently distributed variates is
approximately normally distributed with k-1
degrees of freedom (central limit theorem).
๐œ™ 0; 1 =
1
2๐œ‹
๐‘๐‘– 2
๐œ’2 ๐‘˜ โ†
1
โˆ’2๐‘๐‘– 2
๐‘’
๐‘˜
๐œ’2 ๐‘˜ โ†
1
๐‘๐‘– 2
(๐‘ฅ๐‘– โˆ’ ๐œ‡
๐œŽ
๐‘˜
2
=
1
(๐‘ฅ๐‘– โˆ’ ๐œ‡)2
=
๐œŽ2
A Poisson distribution has s2 = m
๐‘˜
1
(๐‘ฅ๐‘– โˆ’ ๐œ‡)
=
๐œŽ
2
(๐‘ฅ๐‘– โˆ’ ๐œ‡)2
๐œ‡
The c2 test
The likelihood of a function equals the probability to obtain the observed data with
respect to certain parameter values.
The maximum likelihood refers to those parameter values of a function (model) that
maximize the function given that data.
๐ฟ ฮ˜ ๐‘‹ = ๐‘ƒ(๐‘‹|ฮ˜)
Likelihood ratios (odds)
๐‘˜
๐œ’2
๐‘˜ โ†
๐‘˜
๐‘๐‘–
1
2
๐œ’2
๐‘˜ โ†
1
๐‘1 2
๐‘2 2
โ†’
๐น
๐‘˜
The quotient of two normally distributed
random variates is c2 distributed.
Sir Ronald Fisher
๐œ’ 2 ๐‘˜ =๐œ’ 2 ๐‘˜1 +๐œ’ 2 ๐‘˜2 โ†’ ๐œ’ 2 ๐‘˜1 =๐œ’ 2 ๐‘˜ โˆ’๐œ’ 2 ๐‘˜2
The sum of two c2 distributions has k = k1+k2 degrees of freedom.
1
l is normally distributed. -2ln(l) is c2 distributed.
๐‘0
2
๐œ’ ๐‘˜1 โˆ’ ๐‘˜0 = โˆ’2 ln
๐‘1
The log-quotient of two normally distributed random variates is asymptotically c2
distributed (theorem of Wilk).
๐œ†=๐‘’
โˆ’2๐‘ 2
โ†’-2ln(๐œ†)=๐‘ 2
Fisher argued that hypotheses can be tested using likelihoods.
ฮ› = โˆ’2๐‘™๐‘›
๐‘š๐‘Ž๐‘ฅ๐‘–๐‘š๐‘ข๐‘š ๐‘™๐‘–๐‘˜๐‘’๐‘™๐‘–โ„Ž๐‘œ๐‘œ๐‘‘ ๐‘œ๐‘“ ๐‘กโ„Ž๐‘’ ๐‘›๐‘ข๐‘™๐‘™ ๐‘Ž๐‘ ๐‘ ๐‘ข๐‘š๐‘๐‘ก๐‘–๐‘œ๐‘›
๐‘ƒ ๐‘‹ = ๐‘ฅ |๐‘ž0
= โˆ’2๐‘™๐‘›
๐‘š๐‘Ž๐‘ฅ๐‘–๐‘š๐‘ข๐‘š ๐‘™๐‘–๐‘˜๐‘’๐‘™๐‘–โ„Ž๐‘œ๐‘œ๐‘‘ ๐‘œ๐‘“ ๐‘กโ„Ž๐‘’ ๐‘œ๐‘๐‘ ๐‘’๐‘Ÿ๐‘ฃ๐‘Ž๐‘ก๐‘–๐‘œ๐‘›
๐‘ƒ ๐‘‹ = ๐‘ฅ |๐‘ž1
๐‘ƒ ฮ› = ๐‘ƒ(๐œ’ 2 ; ๐‘‘๐‘“๐œ’2 = ๐‘‘๐‘“ ๐‘œ๐‘๐‘  โˆ’ ๐‘‘๐‘“ ๐‘›๐‘ข๐‘™๐‘™ )
Classical frequentist hypothesis testing
We throw 100 times a coin and get 59 times the head. Is the coin fair?
Fisher would contrast two probabilities of a binomial process given the outcome of 59 heads.
P = 1/2
P = 59/100
Likelihood with P = ½ and P = 59/100 estimates
1
2
1
100
๐‘ ๐‘ฅ = 59
=
2
59
59
100
๐‘ ๐‘ฅ = 59
=
100
59
59
100
59
100
= 0.016
59
1โˆ’
100
100โˆ’59
= 0.081
ฮ› = โˆ’2ln
0.016
= โˆ’2ln 0.20 = 3.26
0.081
๐‘ƒ ฮ› = ๐‘ƒ ๐œ’ 2 = 3.26; ๐‘‘๐‘“๐œ’2 = 1 = 0.93
The odds in favour of H1 is 0.016/0.081 = 0.2. H1 is five time more probable than H0.
The probability that the observed binomial probability q1 = 59/100 differs from q1 = ½ given
the observed data is Pobs = 0.93.
The probability in favour of the null assumption is therefore P0 = 1-0.93 = 0.07.
According to Fisher the test failed to reject the hypothesis of P = 59/100.
Fisher:
โ€ข The significance P of a test is the probability of the hypothesis given the data!
โ€ข The significance P of a test refers to a hypothesis to be falsified.
โ€ข It is the probability to obtain an effect in comparison of a random assumption.
โ€ข As a hypothesis P is part of the discussion of a publication.
In Fisherโ€™s view a test should falsify a hypothesis with respect to a null assumption
given the data.
This is in accordance with the Popper - Lakatos approach to scientific methodology.
The Pearson โ€“ Neyman framework
Egon Pearson
100
100
๐‘ƒ ๐‘ฅ โ‰ฅ 59 =
๐‘–=59
๐‘–
1
2
Jerzy Neyman
The likelihood result
100
= 0.04
๐‘ƒ 1 โˆ’ ฮ› = 1 โˆ’ ๐‘ƒ ๐œ’ 2 = 3.26; ๐‘‘๐‘“๐œ’2 = 1 = 0.07
Pearson-Neyman asked what is the probability of the data given the model!
The significance value a of a test is the probability (the evidence) against the null
assumption.
Type I error
H1 true
H0 true
P is the probability to reject H0 given that H0
is true (the type I error rate).
1-P
P
Reject H0
It is not allow to equal P and Q, the
probability to reject H1 given that H1 is true
Reject H1
Q
1-Q
(the type II error rate).
Type II error
Classical frequentist hypothesis testing
P
Distribution of b under H0
0.04
0.03
0.02
H1
0.01
H0
0
0
1
2
3
4
b
P
1
P(H1) 0.8
0.6
0.4
0.2
0
0
5
Test value
Cumulative distribution of b under H0
1
2
3
4
5
b
For 50 years Pearson and Neyman won because their approach is simpler in most applications.
Pearson-Neyman:
โ€ข The significance P of a test is the probability that our null hypothesis is true in comparison
a to precisely defined alternative hypothesis.
โ€ข This approach does not raise concerns if we have two and only two contrary hypotheses
(tertium non datur).
โ€ข As a result P is part of the results section of a publication.
In the view of Pearson and Neyman a test should falsify a null hypothesis with respect
to the observation.
Fisher:
Pearson-Neyman:
A test aims at falsifying a hypothesis.
A test aims at falsifying a null assumption.
We test for differences in the model
parameters.
We test against assumed data that have not
been measured.
P values are part of the hypothesis
development.
P values are central to hypothesis testing.
We test the observed data.
We test against something that has not
been measured.
P is not the probability that H0 is true!!
1-P is not the probability that H1 is true!!
Rejecting H0 does not mean that H1 is true.
Rejecting H1 does not mean that H0 is true.
โ€ข The test does not rely on prior information.
โ€ข It does not consider additional hypothesis.
โ€ข The result is invariant of the way of testing.
A word on logic
Modus tollens
๐ด โ†’ ¬๐ต
¬(¬๐ต)
¬๐ด
If Ryszard is from Poland he is probably not a member of Sejm.
Probably Ryszard is a member of Sejm.
Thus he is probably not a citizen of Poland.
If P(H1) > 0.95 H0 is probably false.
H0 is probably true.
P(H1) < 0.95.
This does not mean that H1 is probably false.
It only means that we donโ€™t know.
If multiple null assumptions are possible the results of classical hypothesis testing
are difficult to interpret.
If multiple hypotheses are contrary to a single null hypothesis the results of
classical hypothesis testing are difficult to interpret.
Pearson-Neyman and Fisher testing works always properly if there are two and only two
truly contrasting alternatives.
Examples
The pattern of co-occurrences of the two species appeared to be random (P(H0) > 0.3).
(we cannot test for randomness)
We reject our hypothesis about antibiotic resistences in the Bacillus thuringiensis strains P(H1) >
0.1.
(we can only reject null hypotheses)
The two enzymes did not differ in substrate binding efficacy (P > 0.05).
(we do not know)
Time of acclimation and type of injection signi๏ฌcantly affected changes in Tb within 30 min after
injection (three-way ANOVA: F5;461 = 2:29; P<0.05). (with n = 466, time explains 0.5% variation)
The present study has clearly con๏ฌrmed the hypothesis that non-native gobies are much more
aggressive ๏ฌsh than are bullheads of comparable size... This result is similar to those obtained
for invasive round goby in its interactions with the native North American cottid.
(F1,14 = 37.83); (if others have found the same, we rather should test for lack of difference.
The present null assumption is only a straw man).
The Bayesian philosophy
๐‘ ๐ดโ‹€๐ต = ๐‘(๐ตโ‹€๐ด))
๐‘ ๐ดโ‹€๐ต = ๐‘ ๐ด ๐ต ๐‘(๐ต)
๐‘ ๐ด ๐ต ๐‘(๐ต) = ๐‘ ๐ต ๐ด ๐‘(๐ด)
๐‘ ๐ตโ‹€๐ด = ๐‘ ๐ต ๐ด ๐‘(๐ด)
The law of conditional
probability
๐‘ ๐ด๐ต =
๐‘ ๐ต ๐ด ๐‘(๐ด) ๐‘ ๐ต ๐ด
=
๐‘(๐ด)
๐‘(๐ต)
๐‘(๐ต)
Theorem of Bayes
posterior ๏€ฝ
conditional ๏€ช priori(A)
priori(B)
Theorem of Bayes
Thomas Bayes
(1702-1761)
Abraham de Moivre
(1667-1754)
A frequentist test provides a precise estimate of probability
P
๐‘ ๐‘๐‘œ๐‘ ๐‘ก ๐‘๐‘Ÿ๐‘–๐‘œ๐‘Ÿ = ๐‘(๐‘๐‘œ๐‘ ๐‘ก)
P
Post is independent of prior
0
0.1
0.5
0.9
0.99
Under a frequentist interpretation a statistical test provides an estimate of the
probability in favour of our null hypothesis.
In the frequentist interpretation probability is an objective reality.
๐‘ ๐‘๐‘œ๐‘ ๐‘ก ๐‘๐‘Ÿ๐‘–๐‘œ๐‘Ÿ =
๐‘ ๐‘๐‘Ÿ๐‘–๐‘œ๐‘Ÿ ๐‘๐‘œ๐‘ ๐‘ก
๐‘(๐‘๐‘œ๐‘ ๐‘ก)
๐‘(๐‘๐‘Ÿ๐‘–๐‘œ๐‘Ÿ)
DP
P
P
Post is mediated by prior
๐‘ ๐‘๐‘œ๐‘ ๐‘ก ๐‘๐‘Ÿ๐‘–๐‘œ๐‘Ÿ โ‰ค ๐‘(๐‘๐‘œ๐‘ ๐‘ก)
0
0.1
0. 5
0.9
0.99
A Bayesian interpretation of probability
Under a Bayesian interpretation a statistical test provides an estimate of how much a
test shifted an initial assumption about the level of probability in favour of our
hypothesis towards statistical significance.
Significance is the degree of belief based on prior knowledge.
The earth is round: P < 0.05 (Goodman 1995)
Often null hypotheses serve as straw man only
to โ€žsupportโ€ or hypothesis (fictional testing)
P
0
0.9
0. 5
P
0.1 0.01
We perform a test in our bathroom to
look whether the water in the filled
bathtub is curved according to a globe
or to a three-dimensional heart.
Our test gives P = 0.98 in favour of
earth like curvature (P(H0) < 0.05).
Does this change our view about the
geometry of earth?
Does this mean that a heart model
has 2% support?
The frequentist probability level in
favour of H0 that the earth is a heart
0.001 0.0001 0.00001
0.00000001
The Bayesian probability level in favour of H0
The higher our initial guess about the probability of our hypothesis is, the less
does any new test contribute to further evidence.
Frequentist tests are not as strong as we think.
P P
Confirmatory studies
A study reports that potato chips increase the risk of cancer. P < 0.01.
Tests in confirmatory studies must consider prior information.
๐‘ ๐‘๐‘œ๐‘ ๐‘ก ๐‘๐‘Ÿ๐‘–๐‘œ๐‘Ÿ =
๐‘ ๐‘๐‘Ÿ๐‘–๐‘œ๐‘Ÿ ๐‘๐‘œ๐‘ ๐‘ก
๐‘(๐‘๐‘œ๐‘ ๐‘ก)
๐‘(๐‘๐‘Ÿ๐‘–๐‘œ๐‘Ÿ)
Our test provides a significance level independent of prior information only if we are quite
sure about the hypothesis to be tested.
P(H1) = 0.99
However, previous work did not find a relationship. Thus we believe that p(H1) < 0.5.
Our test returns a probability of P = (0.0 < P < 0.5) * 0.99 < 0.5
The posterior test is not as significant as we believe.
Bayesian prior and conditional probabilities are often not known and have to be guessed.
Frequentist inference did a good job, we have scientific progress.
Bayesian inference
๐‘(๐ด) ๐‘ ๐ด ๐ต
๐ต๐น =
=
๐‘(๐ต) ๐‘ ๐ต ๐ด
Bayes factor, odds
๐ต๐น =
๐‘ ๐‘ก ๐ป1
๐‘ ๐‘ก ๐ป0
We have 59 heads and 41 numbers.
Does this mean that head has a higher probability?
The Bayes approach asks what is the probability of
our model with respect to any other possible
model.
The frequentist approach
1
๐‘ ๐‘ฅ โ‰ฅ 59
=
2
100
๐‘–=59
100
๐‘–
1
2
100
= 0.044
๐‘ 59 ๐‘ฅ
๐พ=
Under Bayesian logic the observed
result is only 5 times less probable
than any other result.
100
=
59
1
๐‘ฅ 41 1 โˆ’ ๐‘ฅ
59
0
0.044
= 4.44
0.0099
The odds for a deviation is 4.44.
1/4.44 = 0.23
๐‘‘๐‘ฅ = 0.0099
How to recalculate frequentist probabilities in a Bayesian framework
The Bayesian factor give
the odds in favour of H0
A factor of 1/10 means that H0 is ten times
less probable than H1.
Bayes factor in
favour of H0
0.5
0.1
0.05
0.01
0.001
0.0001
0.00001
Z-score
1.177
2.146
2.448
3.035
3.717
4.292
4.799
Parametric
frequentist
probability
0.239032
0.031876
0.014375
0.002407
0.000202
0.000018
0.000002
For tests approximately based on the
normal distribution (Z, t, F, c2) Goodman
defined the minimal Bayes factor BF as:
๐‘ ๐‘ก ๐ป1
ฮ› = ๐œ’ 2 = โˆ’2ln(
= โˆ’2 ln ๐ต๐น
๐‘ ๐‘ก ๐ป0 )
โˆ’๐œ’2
๐‘ ๐‘ก ๐ป1
๐ต๐น =
=๐‘’ 2
๐‘ ๐‘ก ๐ป0
For large n, c2 is approximately
normally distributed
๐ต๐น =
Z
โˆ’๐‘2
๐‘’ 2
๐‘=
โˆ’2ln(๐ต๐น)
p(Z)
For a hypothesis to be 100 times more probable than the alternative model we need a
parametric significance level of P < 0.0024!
Bayesian statisticians call for using P < 0.001 has the upper limit of significance!!
All models are wrong but some are useful.
3300
y = -0.375x4 + 14.462x3 - 164.12x2 + 609.02x
- 356.84
R² = 0.9607
2800
Y
2300
1800
1300
Hirotugo Akaike
Wiliam Ockham
Occamโ€™s razor
Pluralitas non est
ponenda sine
necessitate
The sample size corrected Akaike criterion
of model choice
2๐‘˜(๐‘˜ + 1)
๐ด๐ผ๐ถ๐‘ = 2๐‘˜ โˆ’ 2 ln ฮ› +
๐‘›โˆ’๐‘˜โˆ’1
800
y = 90.901x
R² = 0.5747
300
-200
0
5
10
15
20
X
Any test for goodness of fit will eventually
become significant if we only enlarge the
number of free parameters.
Bias
Optimum
Maximum information
content
k: total number of model parameters +1
n: sample size
L: maximum likelihood estimate of the model
Many
Few
Variables
Explained
variance
Significance
2๐‘˜(๐‘˜ + 1
๐‘›โˆ’๐‘˜โˆ’1
Maximum likelihood estimated
๐ด๐ผ๐ถ = 2๐‘˜ โˆ’ 2 ln ฮ› +
by c2
๐ด๐ผ๐ถ๐‘ = 2๐‘˜ + ๐œ’ 2 +
2๐‘˜(๐‘˜ + 1)
๐‘›โˆ’๐‘˜โˆ’1
by r2
1 โˆ’ ๐‘Ÿ2
2๐‘˜(๐‘˜ + 1)
๐ด๐ผ๐ถ๐‘ = 2๐‘˜ + ๐‘™๐‘›
+
๐‘›
๐‘›โˆ’๐‘˜โˆ’1
The lower is AIC, the more parsimonious is the model
DAIC ๏€ฝ AIC1 ๏€ญ AIC2
We choose the model with the lowest AIC (โ€žthe most useful modelโ€).
This is often not the model with the lowest P-value.
AIC model selection serves to find the best descriptor of observed structure.
It is a hypothesis generating method.
It does not test for significance.
Model selection using significance levels is a hypothesis testing method.
When to apply AIC:
General linear modelling (regression models, ANOVA, MANCOVA)
Regression trees
Path analysis
Time series analysis
Null model analysis
3300
y = -0.375x4 + 14.462x3 - 164.12x2 + 609.02x
- 356.84
R² = 0.9607
2800
Y
2300
1800
1300
800
y = 90.901x
R² = 0.5747
300
-200
0
5
10
15
20
X
๏ƒฆ 1 ๏€ญ 0.9607 ๏ƒถ 12(6 ๏€ซ 1)
AICC r 2 ๏€ฝ 12 ๏€ซ ln๏ƒง
๏€ฝ 12.81
๏ƒท๏€ซ
19
19
๏€ญ
6
๏€ญ
1
๏ƒจ
๏ƒธ
๏ƒฆ 1 ๏€ญ 0.5747 ๏ƒถ 4(2 ๏€ซ 1)
AICC r 2 ๏€ฝ 4 ๏€ซ ln๏ƒง
๏€ฝ 0.95
๏ƒท๏€ซ
19
๏ƒจ
๏ƒธ 19 ๏€ญ 2 ๏€ญ 1
Model selection using significance levels is a hypothesis testing method.
Significance levels and AIC must not be used together.
AIC should be used together with r2.
Large data sets
The relationship between P, r2, and sample size
F-test
P=0.9999
๐‘Ÿ2
๐น=
(๐‘› โˆ’ 2) โ†’ ๐‘(๐น, 1, ๐‘› โˆ’ 2)
1 โˆ’ ๐‘Ÿ2
r2=0.01
P=0.95
Using an F-test at r2 = 0.01 (regression analysis) we need 385 data to get at significant
result at P < 0.05.
At very large sample sizes (N >> 100) classical statistical tests break down.
Any statistical test will eventually become significant if we only enlarge the sample size.
100 pairs of Excel random numbers
1
2
3
4
5
โ€ฆ
99
100
Ran1
0.008328
0.820474
0.648093
0.935418
0.406203
โ€ฆ
0
1
Ran2
r
F
p
0.107104 -0.051 2.68 0.90
0.309694
r2
0.798087 0.003
0.164762
0.178282
โ€ฆ
0
1
N = 100, one pair of
zeroes and ones
7.5% significant
correlations
3000 replicates
N = 1000, 10 pairs of
zeroes and ones.
16% significant
correlations
N = 10000, 100 pairs of zeroes and ones.
99.9% significant
correlations
Number of species co-occurrences in comparison to a null expectaction
(data are simple random numbers)
The null model relies on a randomisation of 1s and 0s in the matrix
1
2
3
4
5
6
7
8
9
10
A
1
0
0
1
1
0
0
0
1
0
B
1
1
0
1
0
1
0
0
1
0
C
1
0
1
0
0
1
0
1
0
0
D
0
0
0
0
1
1
1
0
1
0
1
2
3
4
5
6
7
8
9
10
A
1
0
0
1
1
0
0
0
1
0
B
1
1
0
1
0
1
0
0
1
0
C
1
0
1
0
0
1
0
1
0
0
D
0
0
0
0
1
1
1
0
1
0
E
1
1
1
1
1
0
1
0
0
0
F
0
1
1
0
1
1
0
1
1
0
G
1
1
1
1
1
1
1
1
1
1
H
0
0
0
1
1
1
0
0
0
1
1
2
3
4
5
6
7
8
9
10
A
1
0
0
1
1
0
0
0
1
0
B
1
1
0
1
0
1
0
0
1
0
C
1
0
1
0
0
1
0
1
0
0
D
0
0
0
0
1
1
1
0
1
0
E
1
1
1
1
1
0
1
0
0
0
F
0
1
1
0
1
1
0
1
1
0
G
1
1
1
1
1
1
1
1
1
1
H
0
0
0
1
1
1
0
0
0
1
I
0
1
0
1
0
0
0
1
1
0
J
1
1
0
1
0
1
0
0
1
0
K
1
0
1
0
0
1
0
1
0
1
L
0
0
0
0
1
1
1
0
1
1
M
1
1
1
1
0
0
1
0
0
0
N
0
1
1
0
0
1
0
1
1
1
O
1
1
1
1
0
1
0
0
0
1
Null
distribution
Nobs
The variance of the null space decreases due to statistical averaging.
๐‘’๐‘“๐‘“๐‘’๐‘๐‘ก ๐‘ ๐‘–๐‘ง๐‘’ โ€ข Any test that evolves randomisation of a compound
๐‘ก=
metric will eventually become significant due to the
๐‘†๐ธ
decrease of the standard error.
๐‘†๐ธ โ†’ 0 โ†’ ๐‘ก โ†’ โˆž โ€ข This reduction is due to statistical averaging.
At very large sample sizes (N >> 100) classical statistical tests break down.
Instead of using a predefined significance level use a predefined effect size or r2 level.
P
0
1
0
1
1
1
1
1
0
1
The T-test of Wilcoxon revealed a
statistically significant difference in pH of
surface water between the lagg site
(Sphagno-Juncetum) and the two other
sites.
Every statistical analysis must at least
present sample sizes, effect sizes, and
confidence limits.
Multiple independent testing needs
independent data.
Pattern seeking or P-fishing
Blood presure Gender
Person
1
2
3
4
5
6
7
80
133
64
139
63
105
114
m
f
m
f
m
f
f
Age class Smoker
30
40
60
40
80
70
60
y
y
n
y
n
y
y
Variables
Gender
Age class
Smoker
Gender*Age class
Gender*Smoker
Age class*Smoker
Gender*Age
class*Smoker
Error
Simple linear random numbers
SS
1
15183
4062
6507
1168
8203
df
1
8
1
7
1
7
MS
1.37
1897.85
4061.61
929.57
1167.74
1171.81
F
0.00
2.32
4.97
1.14
1.43
1.43
P
0.97
0.02
0.03
0.34
0.23
0.19
4083
5
816.58
1.00
0.42
790913 968
817.06
Of 12 trials four gave significant results
False discovery rates (false detection error rates):
The proportion of erroneously declared significances.
Using the same test several times with the same data needs a Bonferroni correction.
Single test
n independent tests
p(nsig ) ๏€ฝ 1 ๏€ญ p( sig )
pExp (nsig ) ๏€ฝ (1 ๏€ญ ptest ( sig )) n
pExp ( sig ) ๏€ฝ 1 ๏€ญ (1 ๏€ญ ptest ( sig )) n ๏‚ป
๏ก Exp ๏€ฝ 0.05 ๏€ฝ n๏ก Test ๏‚ฎ ๏ก Test ๏€ฝ
0.05
n
The Bonferroni correcton is very conservative.
๏‚ป 1 ๏€ญ (1 ๏€ญ nptest ( sig ))
pExp ( sig ) ๏‚ป nptest ( sig )
False discovery rates (false detection error rates):
The proportion of erroneously declared significances.
A sequential Bonferroni correction
Test
7
6
5
4
3
2
1
Significances
0.03
0.14
0.45
0.001
0.012
0.007
0.06
๐›ผ๐‘›๐‘’๐‘ค
Ranked
significance
0.001
0.007
0.012
0.03
0.06
0.14
0.45
Significance
cut-off level
0.01
0.01
0.01
0.01
0.01
0.01
0.01
๐›ผ
๐‘– =
; ๐‘– = 1, . . , ๐‘˜
๐‘›โˆ’๐‘–โˆ’1
What is multiple testing?
โ€ข Single analysis?
โ€ข Single data set?
โ€ข Single paper?
โ€ข Single journal
โ€ข Lifetime work?
Corrected cutoff level
0.001429
0.001667
0.002
0.0025
0.003333
0.005
0.01
Significance
Sig
Nsig
Nsig
Nsig
Nsig
Nsig
Nsig
K is the number of contrasts.
There are no established rules!
A data set on health status and reproductive success of Polish storks
N: 866 stork chicken
K: 104 physiological and environmental variables
Tot
Nu
Bo
Bill
St Ca
Rin
Ag
W
Mo Ur Cho Tryg al HD
as Al
M
mb Chi
dy Wei len
LDL
Cd
ud mp Ge
g
e
Ht Hb RBC B M M M czni in lest licer pro L
pA A
Ca g
Zn
Si
er cke
we ght
gt
[m
N
F
C MC (m P
y ylo nd
nu
[d
Sr [g/d [T/l C C C CH k e erol ids tei [m
T T K (mg (m (mg
te
of n
igh /ag
h
g/d
a
e
u n o g/l b
ye bac er
mb
ay
% l] ] [G V H C [mg aci [mg [mg n g/d
[U [U
/l) g/l /l)
chic No.
t e
[m
l]
)
ar ter
er
s]
/l]
/dl] d /dl] /dl] [g/ l]
/l] /l]
)
ken
[g]
m]
dl]
fe
P
89.
1 1 25
19 5 1
20
29
7.75 1.5
5
16. 19 218. 203.
115
258. 68 4 1.84 1 3 0.3
1
0 mal 2 1 215
393 33 96 30
3. 9
.8
3.4 63
6. 1. 2 8
3
4
06
50
505 5
0
8 .5 9 4
.3
9 6 1 7 6 0 92
e
1
9
1 4
5
4 3 2
fe
P
83.
1 2 5 23
22 5 1
20
30
10
1.5
15. 13 184. 187.
40. 106
404. 22 4 0.77 3 0.3
2
1 mal 2 2 215
611 36
33 7.9
2. 1 0. .9
3.2
9. 1. 2 6
2 3
3
06
10
1
6
1 .1 6 9
3 .7
6 3 1 1
0 54
e
2
1
9 2 6 4
9 2 2
โ€ฆ โ€ฆ โ€ฆ โ€ฆ โ€ฆ โ€ฆ โ€ฆ โ€ฆ โ€ฆ โ€ฆ โ€ฆ โ€ฆ โ€ฆ โ€ฆ โ€ฆ โ€ฆ โ€ฆ โ€ฆ โ€ฆ โ€ฆ โ€ฆ โ€ฆ โ€ฆ โ€ฆ โ€ฆ โ€ฆ โ€ฆ โ€ฆโ€ฆ โ€ฆ โ€ฆ โ€ฆ โ€ฆ โ€ฆโ€ฆโ€ฆ โ€ฆ โ€ฆ
1
P
71.
1 2 5 24
14 4
20
mal
26
10
1.3
14. 16 153. 186.
43. 72.
6 562. 35 4 1.14 3 0.3
3
0
3 152 215
621 37
30 7.3
6. 2 3. .3
3.3
6. 6. 7
2 3
3
12
e
50
4
6
6 .2 2 3
4 5
7 9 5 0 3
0 69
9
4
6
9 1 7 3
7 7
No clear hypothesis
Bill
Numb
WB
Mocz
Choles Tryglic Total
asp AlA
Stu Camp
Chick Ring Body Weig Age leng
Urin
HDL LDL
Mg
Cd
Sit
Gend er of
Ht Hb RBC C MC MC MC nik
terol erids prote
AT T
Ca
Zn C M C
P
dy ylobac
en numb weig ht/ag [day th
e
[mg/ [mg/
Na K
(mg/ Fe
(mg/
e
er chicke
Sr% [g/dl] [T/l] [G/ V H HC [mg/d
[mg/d [mg/dl in
[U/l [U/
(mg/l)
(mg/l) u n o
b
year ter
No. er ht [g] e
s] [mm
acid
dl] dl]
l)
l)
n
l]
l]
l]
] [g/dl]
] l]
]
P < 0.000001
1
2006
0
female
2
1
P 2151
2950
89.3939
2
2006
1
female
2
2
P 2152
3010
83.6111
36
101
33
7.9
1.56
โ€ฆ
โ€ฆ
โ€ฆ
โ€ฆ
โ€ฆ
โ€ฆ
โ€ฆ
โ€ฆ
โ€ฆ
โ€ฆ
โ€ฆ
โ€ฆ
โ€ฆ
โ€ฆ
2012
0
male
3
152
P 2154
2650
71.6216
37
104
30
7.3
1.36
139
33
96
30
7.75505
1.55
13.1 194
50
25.85
16.8
19.5
Possibly data are nonindependent due to sampling
sequence
12.9 212 50.6 23.94
โ€ฆ
โ€ฆ
โ€ฆ
โ€ฆ
16.9 221 53.7 24.33
218.9
203.4
3.4
63
115.3
196.4 51.3 122 8
258.9
686
41
1.847
16 30 3 0.392
4
229.9 51.2 122 6
404.6
223
41
0.771
2 30 3 0.354
3
โ€ฆ
โ€ฆ
โ€ฆ
โ€ฆ
โ€ฆ
โ€ฆ
562.9
355
40
1.143
15.1
13.1
184.6
187.9
3.2
40.3
106.7
โ€ฆ
โ€ฆ
โ€ฆ
โ€ฆ
โ€ฆ
โ€ฆ
โ€ฆ
14.6
16.2
153.2
186.3
3.3
43.4
72.5
โ€ฆ
โ€ฆ
โ€ฆ โ€ฆ
146.7 46.7 67 7
P-fishing
โ€ข Common practise is to screen the data for significant relationships
and publish these significances.
โ€ข The respective paper does not mention how many variables have
been tested.
โ€ข Hypotheses are constructed post factum to match the โ€žfindingsโ€.
โ€ข โ€žResultsโ€ are discussed as if they corroborate the hypotheses.
โ€ข Hypotheses must come from theory (deduction), not from the data.
โ€ข Inductive hypothesis testing is critical.
โ€ข If the hypotheses are intended as being a simple description, donโ€™t use P-values.
If the data set is large
โ€ข Divide the records at random into two or more parts.
โ€ข Use one part for hypothesis generation, use the other parts for testing.
โ€ข Use always multiple testing corrected corrected significance levels.
โ€ข Take care of non-independence of data. Try reduced degrees of freedom.
โ€ฆ
โ€ฆ
โ€ฆ
2 30 3 0.369
3
Final guidelines
Donโ€™t mix data description, classification and hypotheses testing.
Provide always sample sizes and effect sizes . If possible provide confidence limits.
Data description and model selection:
Hypothesis testing:
โ€ข Rely on AIC, effect sizes, and r2 only.
โ€ข Be careful with hypothesis induction.
โ€ข Do not use P-values.
Hypotheses should stem from theory not
โ€ข Check for logic and reason.
from the data.
โ€ข Do not develop and test hypotheses using
the same data.
โ€ข Do not use significance testing without a
priori defined and theory derived
hypotheses.
โ€ข Check for logic and reason.
โ€ข Check whether results can be reproduced.
โ€ข Do not develop hypotheses post factum
(telling just so stories)
Testing for simple differences and relationships:
โ€ข Be careful in the interpretation of P-values. P does not provide the probability that a
certain observation is true.
โ€ข P does not provide the probability that the alternative observation is true.
โ€ข Check for logic and reason.
โ€ข Donโ€™t use simple tests in very large data sets. Use effect sizes only.
โ€ข Use predefined effect sizes and explained variances.
โ€ข If possible use a Bayesian approach.