Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
To P or not to P???: Does a statistical test holds what it promises? There is increasing concern that in modern research, false findings may be the majority or even the vast majority of published research claims.โ ,โ Ioannidis (2005, PLoS Medicine) Probably more than 70% of all medical and biological scientific studies are irreproducible! We compare the effects We test the effect of a drug two drugs to control blood to control blood pressure pressure in two groups of against a null control group patients ๐๐๐๐๐๐ก ๐ ๐๐ง๐ ๐ ๐ก = ๐(๐ก = ) < 0.001 ๐ ๐ก๐๐๐๐๐๐ ๐๐๐๐๐ We compare an observation We compare two observations against a null expectation We use the t-test to calculate We use the t-test to assess the a probability of difference. validity of a hypothesis. We test for a significant correlation ๐ ๐ก = ๐(๐ก = ๐ 1โ๐ 2 ๐ โ 2) ๐ ๐ก < 0.001 We compare an observed statistic against an unobserved null assumption. An intuitive significance test Formally, we test H1: r2 = 0.57 against the alternative H0 : r2 = 0 We test an observation against a specific null assumption. We compare two specific values of r2. This is not the same as to test the hypothesis that X and Y are correlated against the hypothesis that X and Y are not correlated. X and Y might not be correlated but have a r2 โ 0. This appears if X and Y are jointly constraint by marginal settings. Number of storks and reproductive rates in Europe (Matthews 2000) Country Albania Belgium Bulgaria Denmark Germany France Greece Netherlands Italy Austria Poland Portugal Spain Switzerland Turkey Hungary Storks Area No. stork pairs 28750 30520 111000 43100 357000 544000 132000 41900 301280 83860 312680 92390 504750 41290 779450 93000 100 1 5000 9 3300 140 2500 4 5 300 30000 1500 8000 150 25000 5000 Stork density Inhabitants 0.00348 0.00003 0.04505 0.00021 0.00924 0.00026 0.01894 0.00010 0.00002 0.00358 0.09594 0.01624 0.01585 0.00363 0.03207 0.05376 3200000 9900000 9000000 5100000 78000000 56000000 10000000 15000000 57000000 7600000 38000000 10000000 39000000 6700000 56000000 11000000 Annual no. births Annual birth rate 83000 87000 117000 59000 901000 774000 106000 188000 551000 87000 610000 120000 439000 82000 1576000 124000 0.026 0.009 0.013 0.012 0.012 0.014 0.011 0.013 0.010 0.011 0.016 0.012 0.011 0.012 0.028 0.011 Excel gets plots at log-scales wrong. r2 = 0.25; P < 0.05 Pseudocorrelations between X and Y arise when X = f(U) Y = g(U) f = h(g) Birth rate Storks Birth rate Urbanisation The sample spaces of both variables are constraint by one or more hidden variables that are itself correlated. ๐ ๐ ๐โ1 โ Some basic knowledge ๐๐ ๐ 1 1 The sum of squared normal distributed variates is approximately c2 distributed with k degrees of freedom. The sum of differently distributed variates is approximately normally distributed with k-1 degrees of freedom (central limit theorem). ๐ 0; 1 = 1 2๐ ๐๐ 2 ๐2 ๐ โ 1 โ2๐๐ 2 ๐ ๐ ๐2 ๐ โ 1 ๐๐ 2 (๐ฅ๐ โ ๐ ๐ ๐ 2 = 1 (๐ฅ๐ โ ๐)2 = ๐2 A Poisson distribution has s2 = m ๐ 1 (๐ฅ๐ โ ๐) = ๐ 2 (๐ฅ๐ โ ๐)2 ๐ The c2 test The likelihood of a function equals the probability to obtain the observed data with respect to certain parameter values. The maximum likelihood refers to those parameter values of a function (model) that maximize the function given that data. ๐ฟ ฮ ๐ = ๐(๐|ฮ) Likelihood ratios (odds) ๐ ๐2 ๐ โ ๐ ๐๐ 1 2 ๐2 ๐ โ 1 ๐1 2 ๐2 2 โ ๐น ๐ The quotient of two normally distributed random variates is c2 distributed. Sir Ronald Fisher ๐ 2 ๐ =๐ 2 ๐1 +๐ 2 ๐2 โ ๐ 2 ๐1 =๐ 2 ๐ โ๐ 2 ๐2 The sum of two c2 distributions has k = k1+k2 degrees of freedom. 1 l is normally distributed. -2ln(l) is c2 distributed. ๐0 2 ๐ ๐1 โ ๐0 = โ2 ln ๐1 The log-quotient of two normally distributed random variates is asymptotically c2 distributed (theorem of Wilk). ๐=๐ โ2๐ 2 โ-2ln(๐)=๐ 2 Fisher argued that hypotheses can be tested using likelihoods. ฮ = โ2๐๐ ๐๐๐ฅ๐๐๐ข๐ ๐๐๐๐๐๐โ๐๐๐ ๐๐ ๐กโ๐ ๐๐ข๐๐ ๐๐ ๐ ๐ข๐๐๐ก๐๐๐ ๐ ๐ = ๐ฅ |๐0 = โ2๐๐ ๐๐๐ฅ๐๐๐ข๐ ๐๐๐๐๐๐โ๐๐๐ ๐๐ ๐กโ๐ ๐๐๐ ๐๐๐ฃ๐๐ก๐๐๐ ๐ ๐ = ๐ฅ |๐1 ๐ ฮ = ๐(๐ 2 ; ๐๐๐2 = ๐๐ ๐๐๐ โ ๐๐ ๐๐ข๐๐ ) Classical frequentist hypothesis testing We throw 100 times a coin and get 59 times the head. Is the coin fair? Fisher would contrast two probabilities of a binomial process given the outcome of 59 heads. P = 1/2 P = 59/100 Likelihood with P = ½ and P = 59/100 estimates 1 2 1 100 ๐ ๐ฅ = 59 = 2 59 59 100 ๐ ๐ฅ = 59 = 100 59 59 100 59 100 = 0.016 59 1โ 100 100โ59 = 0.081 ฮ = โ2ln 0.016 = โ2ln 0.20 = 3.26 0.081 ๐ ฮ = ๐ ๐ 2 = 3.26; ๐๐๐2 = 1 = 0.93 The odds in favour of H1 is 0.016/0.081 = 0.2. H1 is five time more probable than H0. The probability that the observed binomial probability q1 = 59/100 differs from q1 = ½ given the observed data is Pobs = 0.93. The probability in favour of the null assumption is therefore P0 = 1-0.93 = 0.07. According to Fisher the test failed to reject the hypothesis of P = 59/100. Fisher: โข The significance P of a test is the probability of the hypothesis given the data! โข The significance P of a test refers to a hypothesis to be falsified. โข It is the probability to obtain an effect in comparison of a random assumption. โข As a hypothesis P is part of the discussion of a publication. In Fisherโs view a test should falsify a hypothesis with respect to a null assumption given the data. This is in accordance with the Popper - Lakatos approach to scientific methodology. The Pearson โ Neyman framework Egon Pearson 100 100 ๐ ๐ฅ โฅ 59 = ๐=59 ๐ 1 2 Jerzy Neyman The likelihood result 100 = 0.04 ๐ 1 โ ฮ = 1 โ ๐ ๐ 2 = 3.26; ๐๐๐2 = 1 = 0.07 Pearson-Neyman asked what is the probability of the data given the model! The significance value a of a test is the probability (the evidence) against the null assumption. Type I error H1 true H0 true P is the probability to reject H0 given that H0 is true (the type I error rate). 1-P P Reject H0 It is not allow to equal P and Q, the probability to reject H1 given that H1 is true Reject H1 Q 1-Q (the type II error rate). Type II error Classical frequentist hypothesis testing P Distribution of b under H0 0.04 0.03 0.02 H1 0.01 H0 0 0 1 2 3 4 b P 1 P(H1) 0.8 0.6 0.4 0.2 0 0 5 Test value Cumulative distribution of b under H0 1 2 3 4 5 b For 50 years Pearson and Neyman won because their approach is simpler in most applications. Pearson-Neyman: โข The significance P of a test is the probability that our null hypothesis is true in comparison a to precisely defined alternative hypothesis. โข This approach does not raise concerns if we have two and only two contrary hypotheses (tertium non datur). โข As a result P is part of the results section of a publication. In the view of Pearson and Neyman a test should falsify a null hypothesis with respect to the observation. Fisher: Pearson-Neyman: A test aims at falsifying a hypothesis. A test aims at falsifying a null assumption. We test for differences in the model parameters. We test against assumed data that have not been measured. P values are part of the hypothesis development. P values are central to hypothesis testing. We test the observed data. We test against something that has not been measured. P is not the probability that H0 is true!! 1-P is not the probability that H1 is true!! Rejecting H0 does not mean that H1 is true. Rejecting H1 does not mean that H0 is true. โข The test does not rely on prior information. โข It does not consider additional hypothesis. โข The result is invariant of the way of testing. A word on logic Modus tollens ๐ด โ ¬๐ต ¬(¬๐ต) ¬๐ด If Ryszard is from Poland he is probably not a member of Sejm. Probably Ryszard is a member of Sejm. Thus he is probably not a citizen of Poland. If P(H1) > 0.95 H0 is probably false. H0 is probably true. P(H1) < 0.95. This does not mean that H1 is probably false. It only means that we donโt know. If multiple null assumptions are possible the results of classical hypothesis testing are difficult to interpret. If multiple hypotheses are contrary to a single null hypothesis the results of classical hypothesis testing are difficult to interpret. Pearson-Neyman and Fisher testing works always properly if there are two and only two truly contrasting alternatives. Examples The pattern of co-occurrences of the two species appeared to be random (P(H0) > 0.3). (we cannot test for randomness) We reject our hypothesis about antibiotic resistences in the Bacillus thuringiensis strains P(H1) > 0.1. (we can only reject null hypotheses) The two enzymes did not differ in substrate binding efficacy (P > 0.05). (we do not know) Time of acclimation and type of injection signi๏ฌcantly affected changes in Tb within 30 min after injection (three-way ANOVA: F5;461 = 2:29; P<0.05). (with n = 466, time explains 0.5% variation) The present study has clearly con๏ฌrmed the hypothesis that non-native gobies are much more aggressive ๏ฌsh than are bullheads of comparable size... This result is similar to those obtained for invasive round goby in its interactions with the native North American cottid. (F1,14 = 37.83); (if others have found the same, we rather should test for lack of difference. The present null assumption is only a straw man). The Bayesian philosophy ๐ ๐ดโ๐ต = ๐(๐ตโ๐ด)) ๐ ๐ดโ๐ต = ๐ ๐ด ๐ต ๐(๐ต) ๐ ๐ด ๐ต ๐(๐ต) = ๐ ๐ต ๐ด ๐(๐ด) ๐ ๐ตโ๐ด = ๐ ๐ต ๐ด ๐(๐ด) The law of conditional probability ๐ ๐ด๐ต = ๐ ๐ต ๐ด ๐(๐ด) ๐ ๐ต ๐ด = ๐(๐ด) ๐(๐ต) ๐(๐ต) Theorem of Bayes posterior ๏ฝ conditional ๏ช priori(A) priori(B) Theorem of Bayes Thomas Bayes (1702-1761) Abraham de Moivre (1667-1754) A frequentist test provides a precise estimate of probability P ๐ ๐๐๐ ๐ก ๐๐๐๐๐ = ๐(๐๐๐ ๐ก) P Post is independent of prior 0 0.1 0.5 0.9 0.99 Under a frequentist interpretation a statistical test provides an estimate of the probability in favour of our null hypothesis. In the frequentist interpretation probability is an objective reality. ๐ ๐๐๐ ๐ก ๐๐๐๐๐ = ๐ ๐๐๐๐๐ ๐๐๐ ๐ก ๐(๐๐๐ ๐ก) ๐(๐๐๐๐๐) DP P P Post is mediated by prior ๐ ๐๐๐ ๐ก ๐๐๐๐๐ โค ๐(๐๐๐ ๐ก) 0 0.1 0. 5 0.9 0.99 A Bayesian interpretation of probability Under a Bayesian interpretation a statistical test provides an estimate of how much a test shifted an initial assumption about the level of probability in favour of our hypothesis towards statistical significance. Significance is the degree of belief based on prior knowledge. The earth is round: P < 0.05 (Goodman 1995) Often null hypotheses serve as straw man only to โsupportโ or hypothesis (fictional testing) P 0 0.9 0. 5 P 0.1 0.01 We perform a test in our bathroom to look whether the water in the filled bathtub is curved according to a globe or to a three-dimensional heart. Our test gives P = 0.98 in favour of earth like curvature (P(H0) < 0.05). Does this change our view about the geometry of earth? Does this mean that a heart model has 2% support? The frequentist probability level in favour of H0 that the earth is a heart 0.001 0.0001 0.00001 0.00000001 The Bayesian probability level in favour of H0 The higher our initial guess about the probability of our hypothesis is, the less does any new test contribute to further evidence. Frequentist tests are not as strong as we think. P P Confirmatory studies A study reports that potato chips increase the risk of cancer. P < 0.01. Tests in confirmatory studies must consider prior information. ๐ ๐๐๐ ๐ก ๐๐๐๐๐ = ๐ ๐๐๐๐๐ ๐๐๐ ๐ก ๐(๐๐๐ ๐ก) ๐(๐๐๐๐๐) Our test provides a significance level independent of prior information only if we are quite sure about the hypothesis to be tested. P(H1) = 0.99 However, previous work did not find a relationship. Thus we believe that p(H1) < 0.5. Our test returns a probability of P = (0.0 < P < 0.5) * 0.99 < 0.5 The posterior test is not as significant as we believe. Bayesian prior and conditional probabilities are often not known and have to be guessed. Frequentist inference did a good job, we have scientific progress. Bayesian inference ๐(๐ด) ๐ ๐ด ๐ต ๐ต๐น = = ๐(๐ต) ๐ ๐ต ๐ด Bayes factor, odds ๐ต๐น = ๐ ๐ก ๐ป1 ๐ ๐ก ๐ป0 We have 59 heads and 41 numbers. Does this mean that head has a higher probability? The Bayes approach asks what is the probability of our model with respect to any other possible model. The frequentist approach 1 ๐ ๐ฅ โฅ 59 = 2 100 ๐=59 100 ๐ 1 2 100 = 0.044 ๐ 59 ๐ฅ ๐พ= Under Bayesian logic the observed result is only 5 times less probable than any other result. 100 = 59 1 ๐ฅ 41 1 โ ๐ฅ 59 0 0.044 = 4.44 0.0099 The odds for a deviation is 4.44. 1/4.44 = 0.23 ๐๐ฅ = 0.0099 How to recalculate frequentist probabilities in a Bayesian framework The Bayesian factor give the odds in favour of H0 A factor of 1/10 means that H0 is ten times less probable than H1. Bayes factor in favour of H0 0.5 0.1 0.05 0.01 0.001 0.0001 0.00001 Z-score 1.177 2.146 2.448 3.035 3.717 4.292 4.799 Parametric frequentist probability 0.239032 0.031876 0.014375 0.002407 0.000202 0.000018 0.000002 For tests approximately based on the normal distribution (Z, t, F, c2) Goodman defined the minimal Bayes factor BF as: ๐ ๐ก ๐ป1 ฮ = ๐ 2 = โ2ln( = โ2 ln ๐ต๐น ๐ ๐ก ๐ป0 ) โ๐2 ๐ ๐ก ๐ป1 ๐ต๐น = =๐ 2 ๐ ๐ก ๐ป0 For large n, c2 is approximately normally distributed ๐ต๐น = Z โ๐2 ๐ 2 ๐= โ2ln(๐ต๐น) p(Z) For a hypothesis to be 100 times more probable than the alternative model we need a parametric significance level of P < 0.0024! Bayesian statisticians call for using P < 0.001 has the upper limit of significance!! All models are wrong but some are useful. 3300 y = -0.375x4 + 14.462x3 - 164.12x2 + 609.02x - 356.84 R² = 0.9607 2800 Y 2300 1800 1300 Hirotugo Akaike Wiliam Ockham Occamโs razor Pluralitas non est ponenda sine necessitate The sample size corrected Akaike criterion of model choice 2๐(๐ + 1) ๐ด๐ผ๐ถ๐ = 2๐ โ 2 ln ฮ + ๐โ๐โ1 800 y = 90.901x R² = 0.5747 300 -200 0 5 10 15 20 X Any test for goodness of fit will eventually become significant if we only enlarge the number of free parameters. Bias Optimum Maximum information content k: total number of model parameters +1 n: sample size L: maximum likelihood estimate of the model Many Few Variables Explained variance Significance 2๐(๐ + 1 ๐โ๐โ1 Maximum likelihood estimated ๐ด๐ผ๐ถ = 2๐ โ 2 ln ฮ + by c2 ๐ด๐ผ๐ถ๐ = 2๐ + ๐ 2 + 2๐(๐ + 1) ๐โ๐โ1 by r2 1 โ ๐2 2๐(๐ + 1) ๐ด๐ผ๐ถ๐ = 2๐ + ๐๐ + ๐ ๐โ๐โ1 The lower is AIC, the more parsimonious is the model DAIC ๏ฝ AIC1 ๏ญ AIC2 We choose the model with the lowest AIC (โthe most useful modelโ). This is often not the model with the lowest P-value. AIC model selection serves to find the best descriptor of observed structure. It is a hypothesis generating method. It does not test for significance. Model selection using significance levels is a hypothesis testing method. When to apply AIC: General linear modelling (regression models, ANOVA, MANCOVA) Regression trees Path analysis Time series analysis Null model analysis 3300 y = -0.375x4 + 14.462x3 - 164.12x2 + 609.02x - 356.84 R² = 0.9607 2800 Y 2300 1800 1300 800 y = 90.901x R² = 0.5747 300 -200 0 5 10 15 20 X ๏ฆ 1 ๏ญ 0.9607 ๏ถ 12(6 ๏ซ 1) AICC r 2 ๏ฝ 12 ๏ซ ln๏ง ๏ฝ 12.81 ๏ท๏ซ 19 19 ๏ญ 6 ๏ญ 1 ๏จ ๏ธ ๏ฆ 1 ๏ญ 0.5747 ๏ถ 4(2 ๏ซ 1) AICC r 2 ๏ฝ 4 ๏ซ ln๏ง ๏ฝ 0.95 ๏ท๏ซ 19 ๏จ ๏ธ 19 ๏ญ 2 ๏ญ 1 Model selection using significance levels is a hypothesis testing method. Significance levels and AIC must not be used together. AIC should be used together with r2. Large data sets The relationship between P, r2, and sample size F-test P=0.9999 ๐2 ๐น= (๐ โ 2) โ ๐(๐น, 1, ๐ โ 2) 1 โ ๐2 r2=0.01 P=0.95 Using an F-test at r2 = 0.01 (regression analysis) we need 385 data to get at significant result at P < 0.05. At very large sample sizes (N >> 100) classical statistical tests break down. Any statistical test will eventually become significant if we only enlarge the sample size. 100 pairs of Excel random numbers 1 2 3 4 5 โฆ 99 100 Ran1 0.008328 0.820474 0.648093 0.935418 0.406203 โฆ 0 1 Ran2 r F p 0.107104 -0.051 2.68 0.90 0.309694 r2 0.798087 0.003 0.164762 0.178282 โฆ 0 1 N = 100, one pair of zeroes and ones 7.5% significant correlations 3000 replicates N = 1000, 10 pairs of zeroes and ones. 16% significant correlations N = 10000, 100 pairs of zeroes and ones. 99.9% significant correlations Number of species co-occurrences in comparison to a null expectaction (data are simple random numbers) The null model relies on a randomisation of 1s and 0s in the matrix 1 2 3 4 5 6 7 8 9 10 A 1 0 0 1 1 0 0 0 1 0 B 1 1 0 1 0 1 0 0 1 0 C 1 0 1 0 0 1 0 1 0 0 D 0 0 0 0 1 1 1 0 1 0 1 2 3 4 5 6 7 8 9 10 A 1 0 0 1 1 0 0 0 1 0 B 1 1 0 1 0 1 0 0 1 0 C 1 0 1 0 0 1 0 1 0 0 D 0 0 0 0 1 1 1 0 1 0 E 1 1 1 1 1 0 1 0 0 0 F 0 1 1 0 1 1 0 1 1 0 G 1 1 1 1 1 1 1 1 1 1 H 0 0 0 1 1 1 0 0 0 1 1 2 3 4 5 6 7 8 9 10 A 1 0 0 1 1 0 0 0 1 0 B 1 1 0 1 0 1 0 0 1 0 C 1 0 1 0 0 1 0 1 0 0 D 0 0 0 0 1 1 1 0 1 0 E 1 1 1 1 1 0 1 0 0 0 F 0 1 1 0 1 1 0 1 1 0 G 1 1 1 1 1 1 1 1 1 1 H 0 0 0 1 1 1 0 0 0 1 I 0 1 0 1 0 0 0 1 1 0 J 1 1 0 1 0 1 0 0 1 0 K 1 0 1 0 0 1 0 1 0 1 L 0 0 0 0 1 1 1 0 1 1 M 1 1 1 1 0 0 1 0 0 0 N 0 1 1 0 0 1 0 1 1 1 O 1 1 1 1 0 1 0 0 0 1 Null distribution Nobs The variance of the null space decreases due to statistical averaging. ๐๐๐๐๐๐ก ๐ ๐๐ง๐ โข Any test that evolves randomisation of a compound ๐ก= metric will eventually become significant due to the ๐๐ธ decrease of the standard error. ๐๐ธ โ 0 โ ๐ก โ โ โข This reduction is due to statistical averaging. At very large sample sizes (N >> 100) classical statistical tests break down. Instead of using a predefined significance level use a predefined effect size or r2 level. P 0 1 0 1 1 1 1 1 0 1 The T-test of Wilcoxon revealed a statistically significant difference in pH of surface water between the lagg site (Sphagno-Juncetum) and the two other sites. Every statistical analysis must at least present sample sizes, effect sizes, and confidence limits. Multiple independent testing needs independent data. Pattern seeking or P-fishing Blood presure Gender Person 1 2 3 4 5 6 7 80 133 64 139 63 105 114 m f m f m f f Age class Smoker 30 40 60 40 80 70 60 y y n y n y y Variables Gender Age class Smoker Gender*Age class Gender*Smoker Age class*Smoker Gender*Age class*Smoker Error Simple linear random numbers SS 1 15183 4062 6507 1168 8203 df 1 8 1 7 1 7 MS 1.37 1897.85 4061.61 929.57 1167.74 1171.81 F 0.00 2.32 4.97 1.14 1.43 1.43 P 0.97 0.02 0.03 0.34 0.23 0.19 4083 5 816.58 1.00 0.42 790913 968 817.06 Of 12 trials four gave significant results False discovery rates (false detection error rates): The proportion of erroneously declared significances. Using the same test several times with the same data needs a Bonferroni correction. Single test n independent tests p(nsig ) ๏ฝ 1 ๏ญ p( sig ) pExp (nsig ) ๏ฝ (1 ๏ญ ptest ( sig )) n pExp ( sig ) ๏ฝ 1 ๏ญ (1 ๏ญ ptest ( sig )) n ๏ป ๏ก Exp ๏ฝ 0.05 ๏ฝ n๏ก Test ๏ฎ ๏ก Test ๏ฝ 0.05 n The Bonferroni correcton is very conservative. ๏ป 1 ๏ญ (1 ๏ญ nptest ( sig )) pExp ( sig ) ๏ป nptest ( sig ) False discovery rates (false detection error rates): The proportion of erroneously declared significances. A sequential Bonferroni correction Test 7 6 5 4 3 2 1 Significances 0.03 0.14 0.45 0.001 0.012 0.007 0.06 ๐ผ๐๐๐ค Ranked significance 0.001 0.007 0.012 0.03 0.06 0.14 0.45 Significance cut-off level 0.01 0.01 0.01 0.01 0.01 0.01 0.01 ๐ผ ๐ = ; ๐ = 1, . . , ๐ ๐โ๐โ1 What is multiple testing? โข Single analysis? โข Single data set? โข Single paper? โข Single journal โข Lifetime work? Corrected cutoff level 0.001429 0.001667 0.002 0.0025 0.003333 0.005 0.01 Significance Sig Nsig Nsig Nsig Nsig Nsig Nsig K is the number of contrasts. There are no established rules! A data set on health status and reproductive success of Polish storks N: 866 stork chicken K: 104 physiological and environmental variables Tot Nu Bo Bill St Ca Rin Ag W Mo Ur Cho Tryg al HD as Al M mb Chi dy Wei len LDL Cd ud mp Ge g e Ht Hb RBC B M M M czni in lest licer pro L pA A Ca g Zn Si er cke we ght gt [m N F C MC (m P y ylo nd nu [d Sr [g/d [T/l C C C CH k e erol ids tei [m T T K (mg (m (mg te of n igh /ag h g/d a e u n o g/l b ye bac er mb ay % l] ] [G V H C [mg aci [mg [mg n g/d [U [U /l) g/l /l) chic No. t e [m l] ) ar ter er s] /l] /dl] d /dl] /dl] [g/ l] /l] /l] ) ken [g] m] dl] fe P 89. 1 1 25 19 5 1 20 29 7.75 1.5 5 16. 19 218. 203. 115 258. 68 4 1.84 1 3 0.3 1 0 mal 2 1 215 393 33 96 30 3. 9 .8 3.4 63 6. 1. 2 8 3 4 06 50 505 5 0 8 .5 9 4 .3 9 6 1 7 6 0 92 e 1 9 1 4 5 4 3 2 fe P 83. 1 2 5 23 22 5 1 20 30 10 1.5 15. 13 184. 187. 40. 106 404. 22 4 0.77 3 0.3 2 1 mal 2 2 215 611 36 33 7.9 2. 1 0. .9 3.2 9. 1. 2 6 2 3 3 06 10 1 6 1 .1 6 9 3 .7 6 3 1 1 0 54 e 2 1 9 2 6 4 9 2 2 โฆ โฆ โฆ โฆ โฆ โฆ โฆ โฆ โฆ โฆ โฆ โฆ โฆ โฆ โฆ โฆ โฆ โฆ โฆ โฆ โฆ โฆ โฆ โฆ โฆ โฆ โฆ โฆโฆ โฆ โฆ โฆ โฆ โฆโฆโฆ โฆ โฆ 1 P 71. 1 2 5 24 14 4 20 mal 26 10 1.3 14. 16 153. 186. 43. 72. 6 562. 35 4 1.14 3 0.3 3 0 3 152 215 621 37 30 7.3 6. 2 3. .3 3.3 6. 6. 7 2 3 3 12 e 50 4 6 6 .2 2 3 4 5 7 9 5 0 3 0 69 9 4 6 9 1 7 3 7 7 No clear hypothesis Bill Numb WB Mocz Choles Tryglic Total asp AlA Stu Camp Chick Ring Body Weig Age leng Urin HDL LDL Mg Cd Sit Gend er of Ht Hb RBC C MC MC MC nik terol erids prote AT T Ca Zn C M C P dy ylobac en numb weig ht/ag [day th e [mg/ [mg/ Na K (mg/ Fe (mg/ e er chicke Sr% [g/dl] [T/l] [G/ V H HC [mg/d [mg/d [mg/dl in [U/l [U/ (mg/l) (mg/l) u n o b year ter No. er ht [g] e s] [mm acid dl] dl] l) l) n l] l] l] ] [g/dl] ] l] ] P < 0.000001 1 2006 0 female 2 1 P 2151 2950 89.3939 2 2006 1 female 2 2 P 2152 3010 83.6111 36 101 33 7.9 1.56 โฆ โฆ โฆ โฆ โฆ โฆ โฆ โฆ โฆ โฆ โฆ โฆ โฆ โฆ 2012 0 male 3 152 P 2154 2650 71.6216 37 104 30 7.3 1.36 139 33 96 30 7.75505 1.55 13.1 194 50 25.85 16.8 19.5 Possibly data are nonindependent due to sampling sequence 12.9 212 50.6 23.94 โฆ โฆ โฆ โฆ 16.9 221 53.7 24.33 218.9 203.4 3.4 63 115.3 196.4 51.3 122 8 258.9 686 41 1.847 16 30 3 0.392 4 229.9 51.2 122 6 404.6 223 41 0.771 2 30 3 0.354 3 โฆ โฆ โฆ โฆ โฆ โฆ 562.9 355 40 1.143 15.1 13.1 184.6 187.9 3.2 40.3 106.7 โฆ โฆ โฆ โฆ โฆ โฆ โฆ 14.6 16.2 153.2 186.3 3.3 43.4 72.5 โฆ โฆ โฆ โฆ 146.7 46.7 67 7 P-fishing โข Common practise is to screen the data for significant relationships and publish these significances. โข The respective paper does not mention how many variables have been tested. โข Hypotheses are constructed post factum to match the โfindingsโ. โข โResultsโ are discussed as if they corroborate the hypotheses. โข Hypotheses must come from theory (deduction), not from the data. โข Inductive hypothesis testing is critical. โข If the hypotheses are intended as being a simple description, donโt use P-values. If the data set is large โข Divide the records at random into two or more parts. โข Use one part for hypothesis generation, use the other parts for testing. โข Use always multiple testing corrected corrected significance levels. โข Take care of non-independence of data. Try reduced degrees of freedom. โฆ โฆ โฆ 2 30 3 0.369 3 Final guidelines Donโt mix data description, classification and hypotheses testing. Provide always sample sizes and effect sizes . If possible provide confidence limits. Data description and model selection: Hypothesis testing: โข Rely on AIC, effect sizes, and r2 only. โข Be careful with hypothesis induction. โข Do not use P-values. Hypotheses should stem from theory not โข Check for logic and reason. from the data. โข Do not develop and test hypotheses using the same data. โข Do not use significance testing without a priori defined and theory derived hypotheses. โข Check for logic and reason. โข Check whether results can be reproduced. โข Do not develop hypotheses post factum (telling just so stories) Testing for simple differences and relationships: โข Be careful in the interpretation of P-values. P does not provide the probability that a certain observation is true. โข P does not provide the probability that the alternative observation is true. โข Check for logic and reason. โข Donโt use simple tests in very large data sets. Use effect sizes only. โข Use predefined effect sizes and explained variances. โข If possible use a Bayesian approach.