* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Chapter 2-4. Comparison of Two Independent Groups
Survey
Document related concepts
Transcript
Chapter 2-4. Comparison of Two Independent Groups In this chapter, we consider the situation where we want to compare two groups of subjects. This is called the “independent groups” situation, because any given subject is in only one group or the other. Different statistical tests (often called significance tests) are required for the situation where the measurements are taken on the same subjects more than once, such as with baseline and postintervention measurements. Usually, a regression model is used in a study to test the research hypothesis, to control for confounding variables. The situation where the two independent group significance tests, which are not regression models, are most frequently used is the Table 1 “Patient Characteristics” table of an article. Table 1. Patient Characteristics Almost every researcher will report the descriptive statistics for a long list of variables, showing that the study groups (e.g., active drug intervention vs. placebo) are balanced (similarly distributed) on these variables. For example, Brady et al (2000) include the following table (only partially shown) in their JAMA article: Table 1. Demographic and Clinical Data Sertraline Placebo Variable (n = 94) (n = 93) Sex, % Female 75.5 71.0 Male 24.5 29.0 Age, mean (SD), y 40.2 (9.6) 39.5 (10.6) … P Value .48 .54 Referring to this table in their text, they report, “For the total randomized sample there were no significant differences between the treatment groups in any of the baseline demographic and clinical characteristics (TABLE 1).” _____________________ Source: Stoddard GJ. Biostatistics and Epidemiology Using Stata: A Course Manual [unpublished manuscript] University of Utah School of Medicine, 2011. http://www.ccts.utah.edu/biostats/?pageId=5385 Chapter 2-4 (revision 17 Oct 2011) p. 1 The argument that Brady is presenting with her Table 1 and that statement that refers to Table 1 is that the list of variables in the Table have been ruled out as potential confounders. She does this by eliminating the confounder-exposure association (see box), where exposure is the study drug and the potential confounder is any variable listed in Table 1. Properties of a confounding factor A confounding factor must have an effect on disease and it must be imbalanced between the exposure groups to be compared. That is, a confounding factor must have two associations: 1) A confounder must be associated with the disease. 2) A confounder must be associated with exposure. Diagrammatically, the two necessary associations for confounding are: Confounder association association Exposure Disease confounded effect There is also a third requirement. A factor that is an effect of the exposure and an intermediate step in the causal pathway from exposure to disease will have the above associations, but causal intermediates are not confounders; they are part of the effect that we wish to study. Thus, the third property of a confounder is as follows: 3) A confounder must not be an effect of the exposure. Rothman (2002, p.164) criticizes the practice of statistically comparing baseline characteristics in clinical trials, which researchers do to rule out confounding (i.e., show that a variable is not a confounder by showing one of the associations required for confounding does not exist). Rothman argues that the degree of confounding is not dependent upon statistical significant, but Chapter 2-4 (revision 17 Oct 2011) p. 2 rather upon the strength of the associations between the confounder and both exposure and disease. He proposes that a better way to evaluate confounding, in a clinical trial or with any study design, is to statistically control for the potential confounder (using stratification or regression analyses, discussed in a later chapter) and determine whether the un-confounded result differs from the crude (the simple analysis without stratification or regression) potentially confounded result. Personally, I think it is still useful to include a Table 1, with p values. It is a convenient way to alert readers to potential confounding variables. After that, you can go on to evaluate confounding like Rothman suggests. In clinical trials, where randomization is used, it is frequently argued that the p values do not make sense. A p value is normally used to test if a difference exists in the sampled population, which is the normal observation study interpretation. In randomized clinical trials, bench, or animal experiments, one starts with the same group, so there is no imbalance in the sampled population. Any observed balance is obviously due to the randomization, so just what does the p value mean in this situation? Still, the p value is frequently reported for such studies, because it alerts the reader and investigator of an imbalance induced by the randomization process, which in turn could induce confounding that should be controlled for, provided the sample size is large enough to allow for this. Asymptotic Tests vs. Exact Tests Asymptotic test gives accurate p values only for large sample sizes (as n ), the p value being based on the Central Limit Theorem, which is discussed below. Exact tests give accurate p values for any sample size, the p values not being based on the Central Limit Theorem. Thus it can be argued that exacts tests are always preferable--however, this is controversial, particularly for the 2 2 crosstabulation table case, which we will see below. Two Independent Groups Comparison of a Dichotomous Variable Suppose we have an Active Drug group and a Placebo Group in our clinical trial. We wish to test if the groups are balanced on our gender variable (equal distributions of males and females in the two study groups). The variable being tested is often referred to as the “dependent variable”, and the variable defining the study groups if often referred to as the “independent variable”. This nomenclature is consistent with the idea of a deterministic function in algebra, Y = f(X), where the dependent variable (Y) depends on the value of the independent variable (X). This, however, implies Y is caused by X, which may not be the case at all. For example, there might be an intermediate variable, which is not recorded, that is the actual causal factor. For this reason, many statisticians prefer the terms “outcome” and “predictor” for the Y and X variables, which allows for simply modeling an association. (Steyerberg, 2009, p.101) Chapter 2-4 (revision 17 Oct 2011) p. 3 The most popular test for comparing two groups on a dichotomous dependent variable is the ChiSquare test (frequently called the “Chi-Squared” test). The second most popular test is the Fisher’s exact test. There is a third test found in elementary statistics textbooks, called the “Two-Sample Test for Binomial Proportions (Normal-Theory Test), or two-proportions z test. It is algebraically identical to the chi-square test (without Yates continuity correction). [see box] Since the chisquare is better known, you should just use that. Equivalence of Chi-Square Test for 2 2 Table and the two-proportions Z test (Altman, 1991, pp 257-258). Given a 2 2 table, Group 1 a c a+c = n1 Group 2 b d b+d = n2 N= n1+ n2 We have p1= a/(a+c), p2= b/(b+d) , and the pooled proportion is p = (a+b)/N. Then, the z test for comparing two proportions is given by z p1 p2 1 1 p(1 p) n1 n2 p1 p2 standard error of (p1 p2 ) Substituting, this is equivalent to z a b ac bd ab cd 1 1 N N ac bd which, after some manipulation, gives N (ad bc)2 z (a b)(a c)(b d )(c d ) 2 Chapter 2-4 (revision 17 Oct 2011) p. 4 Thus, the chi-square with 1 degree of freedom (the 2 2 table case) is identically the square of the z test (the square of the standard normal distribution). Most statistics books provide the following formula for the chi-square test: 2 (O E )2 N (ad bc)2 E (a b)(a c)(b d )(c d ) , where in the first formula (the theoretical formula) the sum is over all cells of the crosstabulation table, and O = observed cell frequency E = expected cell frequency (defined below) and the second formula is the quick computational formula that is algebraically equivalent. With this formula, it is difficult to see that the test statistic is a signal-to-noise ratio, an idea introduced in Chapter 2. All statistical tests have the form of a signal-to-noise ratio (Stoddard and Ring, 1993; Borenstein M, 1997). In the above box, we see that this formula is algebraically identical to the two-proportions Z test, which is clearly a signal-to-noise ratio (effect divided by its variability, or standard error). The Fisher’s exact test is an example of an “exact test”. That is, it gives a legitimate p value even for small sample sizes. Let’s begin with this test. We will use the births dataset (see box). This dataset is from a study were the investigators wanted to test the association between maternal hypertension and a preterm delivery outcome of the pregnancy. Births Dataset ( births.dta ) This dataset is distributed with the textbook: Hills M, De Stavola BL. A Short Introduction to Stata for Biostatistics. London, Timberlake Consultants Ltd. 2002. http://www.timberlake.co.uk The dataset concerns 500 mothers who had singleton births in a large London hospital. Codebook Variable id bweight lowbw gestwks Labels subject number birth weight (grams) birth weight < 2500 g 1=yes, 0=no gestational age (weeks) Chapter 2-4 (revision 17 Oct 2011) p. 5 preterm matage hyp sex sexalph gestational age < 37 weeks 1=yes, 0=no maternal age (years) maternal hypertension 1=hypertensive, 0=normal sex of baby 1=male, 2=female sex of baby (alphabetic coding) “male”, “female” Start the Stata program and read in the data, File Open Find the directory where you copied the course CD: Change to the subdirectory: datasets & do-files Single click on births.dta OK use "C:\Documents and Settings\u0032770.SRVR\Desktop\ Biostats & Epi With Stata\Section 2 Biostatistics\ \datasets & do-files\births.dta", clear * which must be all on one line, or use: cd "C:\Documents and Settings\u0032770.SRVR\Desktop\" cd "Biostats & Epi With Stata\Section 2 Biostatistics\" cd "datasets & do files" use births.dta, clear In the Births Dataset, births.dta, let’s test whether or not low birth weight deliveries occur more frequently for mothers with hypertension than for mothers without hypertension. We display the two variables simultaneously using a contingency table (also called a cross-tabulation table). Requesting a crosstabulation table with the preterm outcome as the rows and the gender of the baby as the columns, so column percents are the most useful percentages, Statistics Summaries, tables & tests Tables Two-way tables with measures of association Main tab: Row variable: preterm Column variable: hyp Cell contents: within column relative frequencies OK Chapter 2-4 (revision 17 Oct 2011) p. 6 tabulate preterm hyp, column +-------------------+ | Key | |-------------------| | frequency | | column percentage | +-------------------+ | hypertens pre-term | 0 1 | Total -----------+----------------------+---------0 | 375 52 | 427 | 89.50 73.24 | 87.14 -----------+----------------------+---------1 | 44 19 | 63 | 10.50 26.76 | 12.86 -----------+----------------------+---------Total | 419 71 | 490 | 100.00 100.00 | 100.00 We observe that mothers with hypertension more frequently delivered a preterm baby more frequency that mothers without hypertension. We can test this hypothesis H0: phypertenion present = phypertenion absent i.e. H0: no association between preterm delivery and maternal hypertension , where p is the population proportion of preterm deliveries with the Fisher’s exact test, Statistics Summaries, tables & tests Tables Two-way tables with measures of association Main tab: Row variable: preterm Column variable: hyp Cell contents: within column relative frequencies Test statistics: Fisher’s exact test OK tabulate preterm hyp, column exact Chapter 2-4 (revision 17 Oct 2011) p. 7 +-------------------+ | Key | |-------------------| | frequency | | column percentage | +-------------------+ | hypertens pre-term | 0 1 | Total -----------+----------------------+---------0 | 375 52 | 427 | 89.50 73.24 | 87.14 -----------+----------------------+---------1 | 44 19 | 63 | 10.50 26.76 | 12.86 -----------+----------------------+---------Total | 419 71 | 490 | 100.00 100.00 | 100.00 Fisher's exact = 1-sided Fisher's exact = 0.001 <- use this one (2-sided test) 0.000 supporting the conclusion that maternal hypertension is a risk factor for pre-term delivery (p = 0.001). For the Fisher’s exact test, there is no test statistic--only a p value. The Fisher’s exact test is simply a direct probability calculation (a p value calculation). The first p value listed is a 2-sided comparison. Always report the two-sided p value (we’ll see in the next chapter why we do this). Alternatively, we could test this same hypothesis using the chi-square test. Statistics Summaries, tables & tests Tables Two-way tables with measures of association Main tab: Row variable: preterm Column variable: hyp Cell contents: within column relative frequencies Test statistics: Pearson’s chi-squared OK tabulate preterm hyp, chi2 column Chapter 2-4 (revision 17 Oct 2011) p. 8 +-------------------+ | Key | |-------------------| | frequency | | column percentage | +-------------------+ | hypertens pre-term | 0 1 | Total -----------+----------------------+---------0 | 375 52 | 427 | 89.50 73.24 | 87.14 -----------+----------------------+---------1 | 44 19 | 63 | 10.50 26.76 | 12.86 -----------+----------------------+---------Total | 419 71 | 490 | 100.00 100.00 | 100.00 Pearson chi2(1) = 14.3254 Pr = 0.000 We would report this p value as (p < .001). It is actually p < 0.0005, since it did not round to the third decimal place, but there is never a reason to show a p value to more than three decimal places. This is so because the decision about significance is made using two decimal places (a comparison with 0.05). Notice that Stata calls the chi-square test the “Pearson” chi-square to distinguish it from other versions of a chi-square statistic (Likelihood ratio chi-square and Cochran-Mantel-Haenszel chisquare), which can also be computed in Stata. The Pearson chi-square test is simply the test everyone just calls the “chi-square test”, so you never need to add the “Pearson” qualifier to it when you publish. Stata provided two p values for the Fisher’s exact test (a two-tailed and a one-tailed p value). For the chi-square test, Stata only provides one p value. This is the two-tailed p value. To get a one-tailed p value (in the unlikely event you need it) you simply divide the p value by 2 (onetailed p = 0.605/2 = 0.303)(Breslow and Day, 1980, p.139). For the Fisher’s exact test, the onetailed p value is not equal to the two-tailed p value, as we’ll see below, so Stata provides the onetailed p value for you. In the box a few pages up, it was pointed out the there is another test statistic, called the twosample test of proportions, or two-proportions z test. Since this test is algebraically identical to the chi-square test, the chi-square test is normally reported, being a more widely recognized test. Just for completeness, let’s compute that test using Stata. Chapter 2-4 (revision 17 Oct 2011) p. 9 Statistics Summaries, tables & tests Classical tests of hypotheses Two-group proportions tests Main tab: Variable name: preterm Group variable name: hyp OK prtest preterm , by(hyp) Two-sample test of proportion 0: Number of obs = 419 1: Number of obs = 71 -----------------------------------------------------------------------------Variable | Mean Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------0 | .1050119 .0149769 .0756578 .1343661 1 | .2676056 .0525401 .1646289 .3705824 -------------+---------------------------------------------------------------diff | -.1625937 .054633 -.2696725 -.0555149 | under Ho: .0429586 -3.78 0.000 -----------------------------------------------------------------------------diff = prop(0) - prop(1) z = -3.7849 Ho: diff = 0 Ha: diff < 0 Pr(Z < z) = 0.0001 Ha: diff != 0 Pr(|Z| < |z|) = 0.0002 Ha: diff > 0 Pr(Z > z) = 0.9999 When we computed the chi-square test above, we got | hypertens pre-term | 0 1 | Total -----------+----------------------+---------0 | 375 52 | 427 | 89.50 73.24 | 87.14 -----------+----------------------+---------1 | 44 19 | 63 | 10.50 26.76 | 12.86 -----------+----------------------+---------Total | 419 71 | 490 | 100.00 100.00 | 100.00 Pearson chi2(1) = 14.3254 Pr = 0.000 We cannot tell they are algebraically identical tests from the p values, due to insufficient decimal places displayed. For a 2 × 2 table, which gives a one degree of freedom chi-square test, the chi-square statistic is simply the z statistic squared. To see this, display (-3.7849)*(-3.7849) 14.325468 which we see is identically the chi-square test statistic. You can be confident that the p values are identical, as well. Chapter 2-4 (revision 17 Oct 2011) p. 10 Chi-Square Test with Continuity Correction There is another form of the chi-square test, called either the “continuity corrected chi-square test” or “chi-square test with continuity correction” or “chi-square test with Yates continuity correction”. Stata does not provide this, although it is frequently advocated in statistics textbooks. It is automatically output in the SPSS statistical software. Showing an SPSS output for a comparison that is not so significant: PRETERM * SEX Crosstabulation PRETERM 0 1 Total SEX male female 225 202 87.9% 86.3% 31 32 12.1% 13.7% 256 234 100.0% 100.0% Count % within SEX Count % within SEX Count % within SEX Total 427 87.1% 63 12.9% 490 100.0% Chi-Square Tests Pears on Chi-Square Continuity Correctiona Likelihood Ratio Fisher's Exact Test Linear-by-Linear Ass ociation N of Valid Cas es Value .268 b .146 .267 .267 1 1 1 Asymp. Sig. (2-s ided) .605 .702 .605 1 .605 df Exact Sig. (2-s ided) Exact Sig. (1-s ided) .686 .351 490 a. Computed only for a 2x2 table b. 0 cells (.0%) have expected count les s than 5. The minimum expected count is 30.09. We see that the continuity corrected chi-square test has a larger p value than the uncorrected chisquare test. We also see that the continuity corrected chi-square p value is closer to the Fisher’s exact test p value. Finally, the Fisher’s exact p value is the largest (most conservative) of all. This illustrates a controversy among statisticians. One camp claims that the continuity correction should always be applied, because the p value is more accurate and because it is closer to an exact p value (Fisher’s exact p value). The other camp claims that the continuity correction should not be applied, because it takes the p value closer to the Fisher’s exact test p value, which is not a good thing because it is known that the Fisher’s exact p value is conservative (does not drop below alpha, 0.05, often enough). (Agresti, 1990, p.68) Stata does not even offer a continuity corrected chi-square test. This is because the camp of statisticians against the continuity correction have made a sufficiently compelling argument. Let’s illustrate that the Fisher’s exact test is conservative with a Monte Carlo simulation, computing the long-run average of 10,000 samples. Chapter 2-4 (revision 17 Oct 2011) p. 11 We first compare the power to detect a difference between 10% and 20% using sample sizes of 300 in each group. *-----------------------------------------------* Compare uncorrected chi-square test and Fisher's exact test *-----------------------------------------------------------*-- step 1: compare power -* * | Group A Group B | *-----------------+------------------------+----* Outcome Yes | 30 (10%) 60 (20%) | 90 * Nos | 270 240 | 510 *-----------------+------------------------+----* Total | 300 300 | 600 *sampsi .10 .20 , alpha(.05) n1(300) n2(300) --> power = 0.9145 Times observe p<0.05 for chi-squared test: 9352 out of 10,000 samples (93.52%) Times observe p<0.05 for Fisher's exact test: 9183 out of 10,000 samples (91.83%) expected answer is power = .9145, or 91.45% We see that the uncorrected chi-square test is slightly more powerful (93.5% vs 91.8%), about an absolute 1.5% difference. Next we will determine if the test is conservative, by sampling from populations with 10% and 10%, so there is no difference to be detected. We expect to get significance by chance 5% of the time. *-- step 2: compare alpha -* * | Group A Group B | *-----------------+------------------------+----* Outcome Yes | 30 (10%) 30 (10%) | 60 * Nos | 270 270 | 540 *-----------------+------------------------+----* Total | 300 300 | 600 Times observe p<0.05 for chi-squared test: 496 out of 10,000 samples (4.96%) Times observe p<0.05 for Fisher's exact test: 350 out of 10,000 samples (3.5%) expected answer is alpha = .05, or 5% We see that the Fisher’s exact test in this example is indeed conservative (does not show significance frequently enough). We again see about a 1.5% absolute difference between the Fisher’s exact test and the chi-square test, the chi-square test outperforming Fisher’s exact test. Chapter 2-4 (revision 17 Oct 2011) p. 12 By some trial and error, we can find the following borderline significant case, which illustrates the frustration that can arise by limiting yourself to the Fisher’s exact test (Fisher’s test not significant, but chi-square test is). tabi 30 46 \ 270 254 , chi2 exact | col row | 1 2 | Total -----------+----------------------+---------1 | 30 46 | 76 2 | 270 254 | 524 -----------+----------------------+---------Total | 300 300 | 600 Pearson chi2(1) = Fisher's exact = 1-sided Fisher's exact = 3.8570 Pr = 0.050 0.065 0.033 Again, we see a 1.5% absolute difference between the Fisher’s exact test and the chi-square test, the chi-square test outperforming Fisher’s exact test. Chapter 2-4 (revision 17 Oct 2011) p. 13 If you are curious how this simulation (page 11) was run, here is the first part: *-----------------------------------------------------------* Compare uncorrected chi-square test and Fisher's exact test *-----------------------------------------------------------*-- step 1: compare power -* * | Group A Group B | *-----------------+------------------------+----* Outcome Yes | 30 (10%) 60 (20%) | 90 * Nos | 270 240 | 510 *-----------------+------------------------+----* Total | 300 300 | 600 *sampsi .10 .20 , alpha(.05) n1(300) n2(300) --> power = 0.9145 clear set seed 999 scalar chi_signif=0 scalar fish_signif=0 quietly set obs 600 quietly gen group = 0 in 1/300 quietly replace group = 1 in 301/600 quietly gen v1=. quietly gen outcome=. forvalues x = 1/10000{ quietly replace v1 = uniform() /* random number between 0 and 1 */ quietly replace outcome = 0 quietly replace outcome=1 if (v1 <= .10) in 1/300 quietly replace outcome=1 if (v1 <= .20) in 301/600 quietly tab outcome group , chi2 exact if r(p)< 0.05 { scalar chi_signif=chi_signif+1 } if r(p_exact)<0.05 { scalar fish_signif=fish_signif+1 } } display "Times observe p<0.05 for chi-squared test: " /// chi_signif " out of 10,000 samples (" chi_signif/10000*100 "%)" display "Times observe p<0.05 for Fisher's exact test: " /// fish_signif " out of 10,000 samples (" fish_signif/10000*100 "%)" display "expected answer is power = .9145, or 91.45%" *-- end step 1 -- Chapter 2-4 (revision 17 Oct 2011) p. 14 and here is the second part: *-----------------------------------------------------------* Compare uncorrected chi-square test and Fisher's exact test *-----------------------------------------------------------*-- step 2: compare alpha -* * | Group A Group B | *-----------------+------------------------+----* Outcome Yes | 30 (10%) 30 (10%) | 60 * Nos | 270 270 | 540 *-----------------+------------------------+----* Total | 300 300 | 600 clear set seed 999 scalar chi_signif=0 scalar fish_signif=0 quietly set obs 600 quietly gen v1=. quietly gen outcome=. quietly gen group = 0 in 1/300 quietly replace group = 1 in 301/600 forvalues x = 1/10000{ quietly replace v1 = uniform() /* random number between 0 and 1 */ quietly replace outcome = 0 quietly replace outcome=1 if (v1 <= .10) in 1/300 quietly replace outcome=1 if (v1 <= .10) in 301/600 quietly tab outcome group , col chi2 exact if r(p)< 0.05 { scalar chi_signif=chi_signif+1 } if r(p_exact)<0.05 { scalar fish_signif=fish_signif+1 } } display "Times observe p<0.05 for chi-squared test: " /// chi_signif " out of 10,000 samples (" chi_signif/10000*100 "%)" display "Times observe p<0.05 for Fisher's exact test: " /// fish_signif " out of 10,000 samples (" fish_signif/10000*100 "%)" display "expected answer is alpha = .05, or 5%" *-- end step 2 -- Chapter 2-4 (revision 17 Oct 2011) p. 15 Exact Tests (Permutation Tests) We will now see an explanation for why the Fisher’s exact test is conservative. Exact tests are also called permutation tests. An exact p value can be computed for any nonparametric test (we define nonparametric tests later), if you have the software available (such as StatXact). Generally, asymptotic (large sample approximation) p values are computed for most nonparametric tests (the chi-square test is a good example). The way such tests work are to compute the p value by summing the probabilities of the observed table along with the probabilities of all permutations of the data that are more extreme. Matthews and Farewell (1985, pp. 24-26) illustrate this approach for the Fisher’s exact test. The observed data are shown in the following table: Tumor activity of two drugs in leukemic mice Complete remission yes no Total Methyl GAG 7 3 10 6-MP 2 7 9 Total 9 10 19 Holding the row and column totals fixed (the marginals), we construct all possible permutations of the data, and compute there probabilities (these called hypergeometric probabilities, which we will omit learning about).[MStat students: see box] Table 0 0 10 |10 9 0 | 9 -----+-9 10 |19 p=0.00001 Table 1 1 9 |10 8 1 | 9 -----+-9 10 |19 p=0.0009 Table 2 2 8 |10 7 2 | 9 -----+-9 10 |19 p=0.0175 Table 3 3 7 |10 6 3 | 9 -----+-9 10 |19 p=0.1091 Table 4 4 6 |10 5 4 | 9 -----+-9 10 |19 p=0.2864 Table 5 5 5 |10 4 5 | 9 -----+-9 10 |19 p=0.3437 Table 6 6 4 |10 3 6 | 9 -----+-9 10 |19 p=0.1910 Table 7 7 3 |10 2 7 | 9 -----+-9 10 |19 p=0.0468 Table 8 8 2 |10 1 8 | 9 -----+-9 10 |19 p=0.0044 Table 9 9 1 |10 0 9 | 9 -----+-9 10 |19 p=0.00019 All tables at least as extreme are those will table probabilities less than or equal to the observed table. So the 2-sided p value is (tables 0,1,2,7,8,9) display 0.00001+0.0009+0.0175+0.0468+0.0044+0.00019 .0698 The 1-sided p value is (tables 7,8,9) display 0.0468+0.0044+0.00019 .05139 Chapter 2-4 (revision 17 Oct 2011) p. 16 For Master of Statistics Students Only -- Hypergeometric Probabilities Computed in Fisher’s Exact Test This presentation follows closely that of Rice (1988, pp. 434-436). We denote the permutations of the data for the Fisher’s exact test as, N11 N21 n.1 N12 N22 n.2 n1. n2. n.. and assume for purposes of probability calculation that the margins of the table are fixed. For the specific permutation observed in the Matthews and Farewell’s (1985, pp. 24-26) example above, Tumor activity of two drugs in leukemic mice Complete remission yes no Total Methyl GAG 7 3 10 6-MP 2 7 9 Total 9 10 19 we consider the count N11, the number of leukemic mice treated with Methyl GAG who experience complete remission. Under the null hypothesis of no association, the distribution of N11 is that of the number of success in 10 draws (without replacement) from a population of 9 successes and 10 failures. That is, the distribution of N11 induced by chance is hypergeometric, with probability, n.1 n.2 n n p( N11 n11 ) 11 12 n.. n1. For the observed table, the probability of observed 7 in the N11 cell is 9 10 7 3 p( N11 7) 19 10 Performing the calculation in Stata, display comb(9,7)*comb(10,3)/comb(19,10) we get .04676438 Chapter 2-4 (revision 17 Oct 2011) p. 17 We verify this is how the p values for the Fisher’s exact test are computed in Stata. tabi 7 3 \ 2 7 , chi2 exact | col row | 1 2 | Total -----------+----------------------+---------1 | 7 3 | 10 2 | 2 7 | 9 -----------+----------------------+---------Total | 9 10 | 19 Pearson chi2(1) = Fisher's exact = 1-sided Fisher's exact = 4.3372 Pr = 0.037 0.070 0.051 We see that the p values for the 2- and 1-sided Fisher’s exact test agree with what we computed above. We also see that we missed significance with Fisher’s exact test, but would get it with the chi-square test. Unfortunately, the data are too sparse to apply the chi-square test (violates rule of thumb presented below). Where Does the Conservativeness of Fisher’s Exact Test Come From? The conservativeness comes entirely from the discreteness of the test statistic. [Cytel, 2001, pp 1058-1061] An asymptotic test computes it’s p value from integrating the area under the curve of the sampling distribution (such as the chi-square distribution), and so conceivably one can get a p value very close to alpha = 0.05. Fisher’s exact test, on the other hand, sums up a discrete number of probabilities. One sum might be a bit below 0.05, adding one more probability to the sum might raise the sum above 0.05. Since this sum has to change in discrete steps, it cannot get smoothly close to 0.05. Chapter 2-4 (revision 17 Oct 2011) p. 18 Minimum Expected Frequency Rule for Using Chi-Square Test The expected frequency of a contingency table cell is calculated as expected cell frequency = (row total × column total) / grand total. There is one issue with the chi-square test, which even the continuity correction does not remove. Being an asymptotic test, the chi-square test requires a sufficiently large sample size. Just how large the sample size must be is determined by the expected cell frequencies, not the cell counts themselves (Altman, 1991, p.253). Daniel (1995, pp.524-526) in his statistics textbook, cites a rule attributable to Cochran (1954): 2 × 2 table: the chi-square test should not be used if n < 20. If 20 < n < 40, the chisquare test should not be used if any expected frequency is less than 5. When n ≥ 40, three of the expected cell frequencies should be at least 5 and one expected frequency can be as small as 1. larger than 2 × 2 table (r × c table): the chi-square test can be used if no more than 20% of the cells have expected frequencies < 5 and no cell has an expected frequency < 1. Rosner (2006, pp. 396, 428), in his statistics textbook, citing Cochran (1954), proposes the following: No more than 20% of the cells should have expected frequencies < 5, and no cell should have an expected frequency < 1. For a 2 × 2 table, no cell should have an expected frequency < 5. Altman (1991, pp. 248,253), in his statistics textbook, citing Cochran (1954), proposes the following: No more than 20% of the cells should have expected frequencies <5, with no cell having expected frequency < 1; although for a 2 × 2 table, one cell can have an expected value slightly lower than 5. Chapter 2-4 (revision 17 Oct 2011) p. 19 Stata provides the expected frequencies with the expect option. For the example above, tabi 7 3 \ 2 7 , expect +--------------------+ | Key | |--------------------| | frequency | | expected frequency | +--------------------+ | col row | 1 2 | Total -----------+----------------------+---------1 | 7 3 | 10 | 4.7 5.3 | 10.0 -----------+----------------------+---------2 | 2 7 | 9 | 4.3 4.7 | 9.0 -----------+----------------------+---------Total | 9 10 | 19 | 9.0 10.0 | 19.0 We see that 75% of the cells have an expected frequency < 5, so that data are too sparse for the chi-square test to give a sufficiently accurate p value. We are stuck with the non-significant Fisher’s exact test. For the first cell, the expected frequency is 4.7. We can verify the calculation in Stata, by applying the formula, expected cell frequency = (row total × column total) / grand total. display 10*9/19 4.7368421 The derivation of the expected cell frequency formula in shown in the following box. Chapter 2-4 (revision 17 Oct 2011) p. 20 Expected cell frequency The expected cell frequency formula comes from the “multiplication rule for independent events” in probability. If two events, A and B, are independent, or no association between the row and column variables, then the probability they will both occur is: P(AB) = P(A)P(B) , where P(AB) = probability both occur P(A) = probability A will occur P(B) = probability B will occur Row Variable Yes No Total Column Variable Yes a c c1 No b d c2 Total r1 r2 N A probability is just the proportion of times an event occurs, so P(in Yes row) = r1/N , P(in Yes column) = c1/N and P(in Yes row and in Yes column) = (r1/N)(c1/N ) To get the expected cell frequency for cell a we multiply the probably by the total sample size, (r1/N)(c1/N )(N) = (row total)(column total)/(grand total), since the numerator N cancels with one of the denominator Ns. The expected cell frequency represents the cell count that would be expected by chance, or sampling variation. Chapter 2-4 (revision 17 Oct 2011) p. 21 To get the expected frequencies when the data are in variables, we use Statistics Summaries, tables & tests Tables Two-way tables with measures of association Main tab: Row variable: preterm Column variable: hyp Cell contents: Expected frequencies Open tabulate preterm hyp, expected +--------------------+ | Key | |--------------------| | frequency | | expected frequency | +--------------------+ | hypertens pre-term | 0 1 | Total -----------+----------------------+---------0 | 375 52 | 427 | 365.1 61.9 | 427.0 -----------+----------------------+---------1 | 44 19 | 63 | 53.9 9.1 | 63.0 -----------+----------------------+---------Total | 419 71 | 490 | 419.0 71.0 | 490.0 Chapter 2-4 (revision 17 Oct 2011) p. 22 What to use? Occasionally, someone will advise researchers to always just use the Fisher’s exact test, rather than the chi-square test, because the p value is always “accurate.” That is bad advice, since we saw above that the Fisher’s exact test is conservative, so occasionally significance is needlessly lost. The generally more powerful approach and more popular approach is to use the uncorrected chi-square test if the expected frequency rule is met, and use Fisher’s exact test if it is not. (In some rare cases, the relative power is reversed.) Protocol/Article Almost always you will find that authors do not give the details of expected frequencies in their articles (to save space and because it is an elementary statistics principle) and just state: Categorical variables were analyzed with chi-square test or Fishers exact test, as appropriate. This short version is the way I always state it. For completeness, you could state the following; but since it is “basic statistics”, the reviewer will not expect to see this, so I never do this: Comparisons between the study groups for dichotomous outcomes will be performed using the chi-square test if the minimum expected cell frequency assumption is met (80% of the cells have expected frequencies of at least 5 and no cell has an expected frequency less than 1). Otherwise, Fisher’s exact test will be used. However, here is an example of some authors who mentioned the minimum expected frequency rule in their article (Cachel et al, N Engl J Med, 2007), “Percentages were analyzed using the chi-square test or Fisher’s exact test when expected cell counts were less than 5.” Chapter 2-4 (revision 17 Oct 2011) p. 23 Barnard’s Unconditional Exact Test There is another exact test, called Barnard’s unconditional exact test, which is available in StatXact. Using the same data that from above, where we just missed significance with the Fisher’s exact test: row | 1 2 | Total -----------+----------------------+---------1 | 30 46 | 76 2 | 270 254 | 524 -----------+----------------------+---------Total | 300 300 | 600 Pearson chi2(1) = Fisher's exact = 1-sided Fisher's exact = 3.8570 Pr = 0.050 0.065 0.033 and entering this table into StatXact-5, we get BARNARD'S UNCONDITIONAL TEST OF SUPERIORITY USING DIFFERENCE OF TWO BINOMIAL PROPORTIONS Statistic based on the observed 2 by 2 table : Results: ------------------------------------------------------------------------1-sided P-value 2-sided P-value Method Pr{T .GE. t} Pr{|T|.GE.|t|} ------------------------------------------------------------------------Asymp 0.0248 0.0495 Exact 0.0268 0.0499 we see that Barnard’s test is just as powerful as the chi-square test and is clearly superior to Fisher’s exact test. Let’s see how Barnard’s test performs for the Matthews and Farewell example given above, which was: Tumor activity of two drugs in leukemic mice Complete remission yes no Total Methyl GAG 7 3 10 6-MP 2 7 9 Total 9 10 19 where the Fisher’s exact test result was p = 0.070, and the chi-square test was significant (p = 0.037) but clearly not appropriate for this sparse of data (3 of 4 cells with expected frequency less than 5). BARNARD'S UNCONDITIONAL TEST OF SUPERIORITY USING DIFFERENCE OF TWO BINOMIAL PROPORTIONS Statistic based on the observed 2 by 2 table : Observed proportion for population <col1 > : piHat_1 Observed proportion for population <col2 > : piHat_2 Observed difference of proportions : piHat_2-piHat_1 Stderr (pooled estimate of stdev of piHat_2-piHat_1) Standardized test statistic (t) : (piHat_2-piHat_1)/Stderr Chapter 2-4 (revision 17 Oct 2011) = = = = = 0.7778 0.3000 -0.4778 0.2294 -2.083 p. 24 Results: ------------------------------------------------------------------------1-sided P-value 2-sided P-value Method Pr{T .LE. t} Pr{|T|.GE.|t|} ------------------------------------------------------------------------Asymp 0.0186 0.0373 Exact 0.0260 0.0500 We see that Barnard’s test is significant (p = 0.050). Unfortunately, Barnard’s test has not received widespread use, perhaps partly because it is only available in StatXact, but an example of a paper that reports using it is (Gonzalez-Martinez, 2006). It appears to be a perfectly fine test. Whereas Fisher’s exact test is known to be conservative for 2 × 2 tables (type I error rate actually smaller than alpha) which is entirely attributable to the discreteness of the test statistic, Barnard’s test does not have that shortcoming while still maintaining the type I error rate to at most alpha (Cytel, 2001, pp 1058-1061). One reason Barnard’s test has not received widespread use is that no consensus has been reached among statisticians about whether a conditional exact test (Fisher’s exact test) or an unconditional exact test (Barnard’s test is one of the many of these) is more appropriate, even after half a century of debate (Greenland, 1991). So, until Barnard’s test gains wider acceptance, you are better off avoiding it, so that you can stay out of the debate. It is not available in Stata, anyway. Chapter 2-4 (revision 17 Oct 2011) p. 25 Two Independent Groups Comparison of a Dichotomous Variable This was sufficiently discussed above. You use the chi-square test or Fisher’s exact test, depending on the minimum expected frequency rule. Two Independent Groups Comparison of a Nominal Variable Here we are considering a crosstabulation table of size r c (where r is the number of rows and c is the number of columns), which is larger than 2 2. Looking this situation up in the statistical test digest making up Ch 2-3, we see that the chisquare test is suggested. The chi-square test, in this situation, still assumes a sufficiently large sample size (sufficiently large cells sizes) for the asymptotic p value to be appropriate. The minimum expected cell frequency rule of thumb, given above, again applies. When the minimum expected frequency assumption is not meet, you next use the FisherFreeman-Halton test, which being an exact test, does not have that assumption. Note: In the “old days”, before the Fisher-Freeman-Halton test was available in statistical software, the researcher had to collapse (combine) rows or columns until the minimum expected frequency assumption was satisfied. In Stata, this Fisher-Freeman-Halton test is simply called Fisher’s exact test. Originally, Fisher’s exact test was only for 2 × 2 tables. Later (1951), Fisher and Freeman extended the test to any size of contingency table, which became known as the Freeman-Halton test. To give proper credit, many statisticians call it the Fisher-Freeman-Halton test (the StatXact-5 manual refers to it as the Fisher-Freeman-Halton test, for example). Note: Some researchers and editors are still in the old days and do not know this test exists, so you should always provide a reference for it when you use it. As an example, we will use the crosstabulation of race with study drug, taken from Brady et al. (2000) Table 1. After computing the cell frequencies from the percents and entering these data into Stata, we get tabi 14 8 \ 76 82 \ 4 3 , col chi2 exact expect Chapter 2-4 (revision 17 Oct 2011) p. 26 +--------------------+ | Key | |--------------------| | frequency | | expected frequency | | column percentage | +--------------------+ | col row | 1 2 | Total -----------+----------------------+---------1 | 14 8 | 22 | 11.1 10.9 | 22.0 | 14.89 8.60 | 11.76 -----------+----------------------+---------2 | 76 82 | 158 | 79.4 78.6 | 158.0 | 80.85 88.17 | 84.49 -----------+----------------------+---------3 | 4 3 | 7 | 3.5 3.5 | 7.0 | 4.26 3.23 | 3.74 -----------+----------------------+---------Total | 94 93 | 187 | 94.0 93.0 | 187.0 | 100.00 100.00 | 100.00 Pearson chi2(2) = Fisher's exact = 2.0018 Pr = 0.368 0.358 We find that 2 cells (2/6=33%) have expected frequencies less than 5. Therefore, the chi-square test is not appropriate for these data. We should report the Fisher-Freeman-Halton p value (p=0.358). We can verify that this statistic (what Stata calls Fisher’s exact test) is actually the FisherFreeman-Halton test by testing it in StatXact-5. !StatXact-5 (5.0.3) !Unordered R x C Table:Fisher-Freeman-Halton Test FISHER'S EXACT TEST Statistic based on the observed 3 x 2 table(x) (rows/cols with 0 totals are ignored): P(X) : Hypergeometric Prob. of the table = 0.0203 FI(X) : Fisher statistic = 2.023 Asymptotic p-value: (based on Chi-Square distribution with 2 df ) Pr { FI(X) .GE. 2.023 } = 0.3637 Exact p-value and point probability : Pr { FI(X) .GE. 2.023 } = Pr { FI(X) .EQ. 2.023 } = 0.3583 0.0406 Indeed, we get the same p value. (Note: even StatXact calls it Fisher’s exact test in the output, calling it the Fisher-Freeman-Halton test only in the heading.) Besides being more correct, it is a good idea to not call this test the Fisher’s exact test. That way, you avoid the editor, reviewer, or reader from saying “What are you talking about? The Fisher’s exact test is only available for 2 × 2 tables.” Chapter 2-4 (revision 17 Oct 2011) p. 27 Protocol You could state: Comparisons between the study groups for unordered categorical variables will be performed using the chi-square test if the minimum expected cell frequency assumption is met (80% of the cells have expected frequencies of at least 5 and no cell has an expected frequency less than 1). Otherwise, Fisher’s exact test will be used for variables with two categories and the Fisher-Freeman-Halton test for variables with three or more categories. The Fisher-Freeman-Halton test is the Fisher’s exact test generalized by Freeman and Halton to greater than 2 × 2 crosstabulation tables (Conover, 1980). Fishcer, et al (N Engl J Med, 2009) used something similar, but more brief, in their statistical methods section, “The total number of thoracotomies and the number of futile thoracotomies in each group were compared by means of a chi-square test with a two-sided significance level of 0.05. When the expected number in any cell was less than five, a Fisher’s exact test for two-bytwo tables and a Fisher-Freeman-Halton test for two-by-k tables for binary comparisons were used….” Mid-P Exact Test Occasionally you will see an “exact mid-p test” reported. For example, you can get this using the PEPI 4.0 program EXACT2XK.EXE (Abramson and Gahlinger, 2001) when either the row or column variable has only 2 categories. Running EXACT2XK.EXE for the above 2 × 3 table, we get Exact Exact (mid-P) p = 0.358 p = 0.338 The Exact row is the Fisher-Freeman-Halton test, which agrees with Stata and StatXact. The Exact (mid-P) row is a variation of the test, where only 1/2 of the middle probability is added to the sum. The middle probability is the permutation that was observed. This test was originally introduced to address the problem of the Fisher exact test being conservative. This test is legitimate, and you could use it if you wanted to. It never became completely accepted by statisticians because this approach does not guarantee that the test maintains alpha at 0.05 (may give significant results too often)(Cytel, 2001, pp.1059-1061). Chapter 2-4 (revision 17 Oct 2011) p. 28 Two Independent Groups Comparison of an Ordinal Variable As listed in the Statistical Test Digest, for this comparison we use the Wilcoxon-Mann-Whitney test. In statistic textbooks, you will find two tests for this application: 1) Wilcoxon ranksum test, and 2) Mann-Whitney U test. Rosner (1995, p. 566) points out “The Mann-Whitney U test and the Wilcoxon rank-sum test are completely equivalent, since the same p-value is obtained by applying either test.” In Siegel and Castellan (1988, p. 128) the test is called the Wilcoxon-Mann-Whitney test. Many statisticians are now calling it this in order to give all three test developers credit. In the Stata help for the command ranksum, you will find “ranksum tests the hypothesis that two independent samples (i.e., unmatched data) are from populations with the same distribution by using the Wilcoxon rank-sum test, which is also known as the Mann-Whitney two-sample statistic (Wilcoxon 1945; Mann and Whitney 1947).” ---------Mann, H. B., and D. R. Whitney. 1947. On a test whether one of two random variables is stochastically larger than the other. Annals of Mathematical Statistics 18: 50-60. Wilcoxon, F. 1945. Individual comparisons by ranking methods. Biometrics 1: 80-83. Note: Although it is more correct to call it the Wilcoxon-Mann-Whitney test, not everyone has heard of the test being referred to by this name. Therefore, you might consider providing a reference. The Wilcoxon-Mann-Whitney test is always described as “a test of whether two independent groups have been drawn from the same population” (another test, called the median test, is specifically a comparison of medians, but is not as powerful as the Wilcoxon-Mann-Whitney test). By comparing ranks, it is comparing if the “bulk” of the values in the population in one group are larger than those of the other group, which equates to H0: P(Group 1 > Group 2) = 1/2. Because of this construction, Siegel and Castellan (1988, p.129) point out that the test equates to a comparison of medians. Chapter 2-4 (revision 17 Oct 2011) p. 29 ____________________________________________________________________________ Aside, on what was just said. Just for sake of completeness, not everyone agrees with Siegel and Castellan that the WilcoxonMann-Whitney (WMW) test equates to a comparison of medians. Bergmann et al (2000) insist on being strictly precise about what the test does, “The WMW procedure tests for equality of group mean-ranks, not of group medians. This is evident from our experimental data (Table 1). However, by providing group medians or their differences in their outputs, statistics package such as SigmaStat, Unistat, Stata, and even Arcus QuickStat may mislead investigators into supposing that the p values refer to the hypothesis that group medians are equal. This common misapprehension is not unique to statistics packages. It appears in Siegel and Castellan (1988) and many other elementary texts on statistics.” ____________________________________________________________________________ Exercise Notice in the Sulkowski (2000) article, the Table 1 laboratory variables were compared using the Wilcoxon-Mann-Whitney test, which Sulkowski refers to as the “nonparametric Mann-Whitney test” in the Methods Section paragraph just above the table on page 76. Although the name Wilcoxon-Mann-Whitney test has been proposed and used for decades, it is still frequently referred to as the Wilcoxon test or the Mann-Whitney test. An example of a paper that uses the more correct name that gives all three developers credit is Brown et al (N Engl J Med 2006) who state in their Statistical Methods, “Continuous variables were compared with the use of a two-tailed unpaired t-test ... and ordinal variables with the use of the Wilcoxon-Mann-Whitney test.” Cytel (2001, p. 709) provides the following example of a two-sample comparison of an ordinal variable. “A randomized clinical trial of Interferon versus placebo was conducted on 44 children infected with childhood chicken pox (varicella)(Arvin, et al., 1982). One of the end points of the study was to determine whether Interferon is more effective than placebo in preventing adverse effects. There are four ordinal categories of adverse effects. The number of children falling in each category, by treatment, is: Adverse Effect None Life Threatening Death in 2-3 Weeks Death in Less Than 1 Week ____________ Placebo 15 3 1 2 Interferon 21 0 2 0 Ref: Arvin AM, Kushner JH, Feldman S, et al. (1982). Human leukocyte interferon for the treatment of varicella in children with cancer. NEJM 306:761-765.” Chapter 2-4 (revision 17 Oct 2011) p. 30 We can quickly enter these data using the “expand” trick, by copying the following into the dofile editor and then executing it: clear input ae drug count 1 0 15 2 0 3 3 0 1 4 0 2 1 1 21 2 1 0 3 1 2 4 1 0 end expand count drop if count==0 // must do this, otherwise it leaves that line in the file drop count tab ae drug // check that data match original table which creates the number of rows of data based on the variable count. We then drop (delete) the variable count, which we only used as an intermediate variable for the expand command. Chapter 2-4 (revision 17 Oct 2011) p. 31 We now compute the Wilcoxon-Mann-Whitney test using Statistics Summaries, tables & tests Nonparametric tests of hypotheses Wilcoxon rank-sum test Main tab: Variable: ae Grouping variable: drug Open ranksum ae, by(drug) (Note: you ask for the “Mann-Whitney” test from the menu, and it shows “Wilcoxon rank-sum test” on the menu dialog box”. They are the same test.) Two-sample Wilcoxon rank-sum (Mann-Whitney) test drug | obs rank sum expected -------------+--------------------------------0 | 21 519.5 472.5 1 | 23 470.5 517.5 -------------+--------------------------------combined | 44 990 990 unadjusted variance adjustment for ties adjusted variance 1811.25 -992.93 ---------818.32 Ho: ae(drug==0) = ae(drug==1) z = 1.643 Prob > |z| = 0.1004 <-- report this (which is a two-sided p value) Protocol Suggestion If you wanted to be complete, you could state, Comparisons of two groups for ordered categorical variables (ordinal scale) will be performed using a Wilcoxon-Mann-Whitney test (many statisticians now refer to the test by this name since the Wilcoxon rank-sum test and the Mann-Whitney U test are essential the same and give identical P Values) (Siegel and Castellan, 1988, p. 128). However, the following shorter version should be sufficient and is recommended since the test name Wilcoxon-Mann-Whitney is sufficiently common now: Two group comparisons for ordered categorical variables will be performed using the Wilcoxon-Mann-Whitney test. Chapter 2-4 (revision 17 Oct 2011) p. 32 Definition of Parametric and Nonparametric Tests The tests introduced thus far are called nonparametric tests. The next test we will introduce is the Student’s t test, which is an example of a parametric test. Now, then, is a good time to formally define what parametric and nonparametric tests are. Paraphrasing Siegel and Castellan (1988, pp.33-34): A parametric statistical test specifies certain conditions about the distribution of the dependent variable in the population from which the research sample was drawn. [The term “parametric” comes from the statistical jargon of referring to population means and standard deviations as “parameters”, in order to avoid confusion with sample means and standard deviations, which are referred to as “statistics”.] The most frequent condition is “normally distributed”. Parametric tests based on the normal distribution require that the dependent variable is measured in at least an interval scale. A nonparametric statistical test is based on a model that specifies only very general conditions and none regarding the specific form of the distribution from which the sample was drawn. Nonparametric tests do not require that the dependent variable is measured in at least an interval scale (some requiring an ordinal scale, and some requiring only a nominal scale). Nonparametric tests, then, are used when you have nominal or ordinal levels variables. They are also useful when you have a highly skewed interval level variable, particularly with small sample sizes (since your data do not look anything like a normally distributed variable). Chapter 2-4 (revision 17 Oct 2011) p. 33 Central Limit Theorem In Chapter 2 is a presentation of the concept called statistical regularity (which statisticians also call the Strong Law of Large Numbers). It was illustrated by a simulation involving increasing sample sizes from a dichotomous variable with a population proportion of 0.5. *----------------------------------------------------------------* Demonstrate statistical regularity by plotting proportion of 1's * from a dichotomous variable for increasingly large sample sizes * when population proportion is 0.5 *----------------------------------------------------------------- Statistical Regularity for Binomial Variable (p=0.5) 1 .9 Proportion of One's .8 .7 .6 .5 .4 .3 .2 .1 0 1 50 Sample Size (log scale) 100 200 300400 In statistics, there is an important second form of regularity that occurs with means, which is called the central limit theorem. Rosner (1995, p.158) provides a simple version of it, Central-Limit Theorem The distribution of means from samples of size N from some population with mean and variance 2 will have an approximate normal distribution with mean and variance 2/n (standard error = / n ), even if the sampled distribution is not normal. What is remarkable is how fast, that is requiring only small sample sizes, the distribution of means approaches the normal distribution. This is illustrated by a Monte Carlo simulation, where we choose samples of size n=10 from a dichotomous variable (with values 0 and 1) with population parameter p=0.5. Doing this for 1,000 samples, we get the following distribution of means: Chapter 2-4 (revision 17 Oct 2011) p. 34 *------------------------------------------------------------------* Demonstrate the central limit theorem by taking samples of * size n=10 from a dichotomous variable with p=.5 *-------------------------------------------------------------------------Means Computed From 1000 Samples of Size n=10 0 50 100 Frequency 150 200 250 (Sampled From Dichotomous Distribution With p=0.5) 0 .1 .2 .3 .4 .5 Mean .6 .7 .8 .9 1 We see that the distribution of means is remarkably close to a normal distribution. When we increase the sample size to n=100 in the otherwise same Monte Carlo experiment, we get: *------------------------------------------------------------------* Demonstrate the central limit theorem by taking samples of * size n=100 from a dichotomous variable with p=.5 *-------------------------------------------------------------------------Means Computed From 1000 Samples of Size n=100 60 40 0 20 Frequency 80 100 (Sampled From Dichotomous Distribution With p=0.5) .3 Chapter 2-4 (revision 17 Oct 2011) .4 .5 Mean .6 .7 p. 35 This simulation illustrates the “even if the sampled distribution is not normal” phrase stated above in the Central Limit Theorem definition. In the population, the histogram of individual values is simply two bars, each of equal height, which is a long ways from being a normal distribution. Population Distribution for the Above CLT Simulation 1 0 .5 Density 1.5 2 (Binomial Distribution With p=0.5) -.5 0 .5 Values of Individual Observations 1 1.5 There are many parametric tests, such as the t test and linear regression, which have the assumption that the data come from a normal distribution. That is simply a convenient way to express it in introductory statistical texts. The real assumption involves the form of the sampling distribution, and also the distribution of residuals in linear regression. Rather than go into a precise description, suffice it to say that the above stated Central Limit Theorem, as well as other versions of this theorem, assure us that the actual assumption for what needs to be normally distribution is taken care of if the sample size is “large enough.” It turns out that the Central Limit Theorem “kicks in” with even small sample sizes. Another way to state this is that the ttest (as well as analysis of variance and linear regression) is very robust to the normality assumption, providing sufficiently accurate p values regardless of how the data are distributed in the sampled population. Therefore, you can basically just ignore the normality assumption. This robustness topic is covered in Chapter 5-10. Two Independent Groups Comparison of an Interval Variable The comparison of two groups on an interval scaled variable is done using the independent sample Student’s t test (the “Student’s” is generally dropped, referring to the test as the independent sample t test). Chapter 2-4 (revision 17 Oct 2011) p. 36 There are two versions of the t test for two independent groups. The test has the assumption that the variances (and thus the standard deviations) of the two groups being compared are equal. The alternate version does not have this assumption. The added assumption gives the first version greater power, and so it is more widely used. The equal variance assumption is one reason the t test is a parametric test (the assumption is referring to the variances, which are parameters of the sampled populations). It is advocated by some to test the assumption of equal variances (also called the homogeneity of variance assumption) using Levene’s test for equality of variances. If the assumption holds (Levene’s test is not statistically significant) then the equal variance t test is used. If the assumption fails, the unequal variance t test is used. This approach is not necessary, though, since the t test is “robust” to the equal variances assumption. This robustness topic is covered in Chapter 5-10. Although I do not advocate it as a needed step, I will now show how to test the homogeneity of variance assumption, just so you know what others are talking about when they report doing it. In SPSS, both tests are output at the same time, along with the Levene’s test for equality of variance, just to make the “advocated” process easier. In Stata, you have to ask for all three tests separately. An example dataset, coronary artery data, which is on the SPSS distribution CD, contains the following variables: Variable Time Group Label Treadmill Time Study Group 1=healthy 2=disease Comparing treadmill time between the two study groups results in the following independent sample t test output in SPSS. Group Statistics TIME GROUP 1 2 N Mean 928.50 764.60 8 10 Std. Deviation 138.121 213.750 Std. Error Mean 48.833 67.594 Independent Samples Test Levene's Test for Equality of Variances F TIME Equal variances ass umed Equal variances not as sumed Sig. .137 Chapter 2-4 (revision 17 Oct 2011) .716 t-tes t for Equality of Means t df Sig. (2-tailed) Mean Difference Std. Error Difference 95% Confidence Interval of the Difference Lower Upper 1.873 16 .080 163.90 87.524 -21.642 349.442 1.966 15.439 .068 163.90 83.388 -13.398 341.198 p. 37 To perform the same analysis in State, we use the following commands: ttest depvar , by(groupvar) -- independent groups t test with equal variances assumption and confidence intervals robvar depvar, by(groupvar) -- Levene’s test for equality of variances ttest depvar , by(groupvar) unequal -- independent groups t test without equal variances assumption (uses Satterthwaite's degrees of freedom approximation) and confidence intervals Reading in the data, File Open Find the directory where you copied the course CD: Change to the subdirectory: datasets & do-files Single click on coronary artery data.dta Open use "C:\Documents and Settings\u0032770.SRVR\Desktop\ Biostats & Epi With State\Section 2 Biostatistics\ \datasets & do-files\coronary artery data.dta", clear * which must be all on one line, or use: cd "C:\Documents and Settings\u0032770.SRVR\Desktop\" cd "Biostats & Epi With State\Section 2 Biostatistics\" cd "datasets & do files" use "coronary artery data.dta", clear We might first verify the t test’s assumption of equal variances, using Levene’s Test for Equality of Variances. In introductory statistics textbooks, you will find the F Test for equality of variances (the sdtest command in State). The F Test is sensitive to the normality assumption, so if the data are skewed, it gives an inaccurate p value. Levene’s test, on the other hand, is robust to the normality assumption; so it provides an accurate p value even if the data are skewed. Therefore, always use Levene’s test rather than the F test. Statistics Summaries, tables & tests Classical tests of hypotheses Robust equal variance test Main tab: Variable: time Variable defining two comparison groups: group OK robvar time, by(group) Chapter 2-4 (revision 17 Oct 2011) p. 38 | Summary of TIME GROUP | Mean Std. Dev. Freq. ------------+-----------------------------------1 | 928.5 138.12106 8 2 | 764.6 213.7497 10 ------------+-----------------------------------Total | 837.44444 197.65306 18 W0 = .1368483 W50 = .17792242 W10 = .0650524 df(1, 16) df(1, 16) df(1, 16) Pr > F = .71628551 Pr > F = .67877762 Pr > F = .80193108 <-W0 is Levene’s test <-ignore this test <-ignore this test Notice that the robvar command gives two alternative tests for equality of variance (W50 and W10), which you can ignore. Just by visual expectation, the standard deviations (and hence the variances) seem quite different (138 vs. 214). Still, the Levene’s test for equality variances was not significant (p = 0.716) so we cannot reject the hypothesis of equal variances (not sufficient evidence in the data to conclude that the equal variances assumption was not justified). Using the “advocated” approach of confirming the assumptions, we have justification, then, to use the equal variances t test, which we compute next. Statistics Summaries, tables & tests Classical tests of hypotheses Two-group mean-comparison test Main tab: Variable name: time Group variable name: group OK ttest time, by(group) Two-sample t test with equal variances -----------------------------------------------------------------------------Group | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval] ---------+-------------------------------------------------------------------1 | 8 928.5 48.83317 138.1211 813.0279 1043.972 2 | 10 764.6 67.59359 213.7497 611.6927 917.5073 ---------+-------------------------------------------------------------------combined | 18 837.4444 46.58727 197.6531 739.1539 935.735 ---------+-------------------------------------------------------------------diff | 163.9 87.52394 -21.64246 349.4425 -----------------------------------------------------------------------------diff = mean(1) - mean(2) t = 1.8726 Ho: diff = 0 degrees of freedom = 16 Ha: diff < 0 Pr(T < t) = 0.9602 Chapter 2-4 (revision 17 Oct 2011) Ha: diff != 0 Pr(|T| > |t|) = 0.0795 Ha: diff > 0 Pr(T > t) = 0.0398 p. 39 We could stop at this point if we wanted. However, since the p value is so close to 0.05 and was not significant, we might be especially nervous about the equal variances assumption. So, we next compute the unequal variances t test just to see what we get. Statistics Summaries, tables & tests Classical tests of hypotheses Two-group mean-comparison test Main tab: Variable name: time Group variable name: group Unequal variances OK ttest time, by(group) unequal Two-sample t test with unequal variances -----------------------------------------------------------------------------Group | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval] ---------+-------------------------------------------------------------------1 | 8 928.5 48.83317 138.1211 813.0279 1043.972 2 | 10 764.6 67.59359 213.7497 611.6927 917.5073 ---------+-------------------------------------------------------------------combined | 18 837.4444 46.58727 197.6531 739.1539 935.735 ---------+-------------------------------------------------------------------diff | 163.9 83.38808 -13.39825 341.1983 -----------------------------------------------------------------------------diff = mean(1) - mean(2) t = 1.9655 Ho: diff = 0 Satterthwaite's degrees of freedom = 15.4391 Ha: diff < 0 Pr(T < t) = 0.9662 Ha: diff != 0 Pr(|T| > |t|) = 0.0676 Ha: diff > 0 Pr(T > t) = 0.0338 We still did not get statistical significance, but notice that the p value is smaller. That is not supposed to happen, in general, since the equal variances t test is a more powerful test. A close look at the reported standard deviations reveals that the larger standard deviation goes with the smaller mean. Normally, we would expect the mean to standard deviation ratio to be similar for both groups, which would lead us to suspect that one group is more skewed than the other. The t test itself has another assumption, the assumption that the data for each group have an approximately Normal distribution. However, this assumption is not as critical for the t test because the distribution of means (and similarly mean differences), which is what is actually being compared, is guaranteed to be normally distributed by the Central Limit Theorem for “sufficiently” large sample sizes. Chapter 2-4 (revision 17 Oct 2011) p. 40 Altman (1991, p.199) states these two assumptions and advises, “The use of the t test is based on the assumption that the data for each group (with independent samples) or the differences (with paired samples) have an approximately Normal distribution, and for the two sample case we also require the two groups to have similar variances. We sometimes find that at least one requirement is not met. When the data are skewed we can either use a non-parametric method, or try a transformation of the raw data.” Not all statisticians agree with Altman, who is taking a very conservative approach to the assumptions. Other statistics simply trust in the robustness of the t-test to both non-normality and unequal variances (see Chapter 5-10—I am also nearly finished with a new chapter on this specific subject, with far more citations and some simulations). Although I do not personally advocated bothering with the normality assumption, the normally distributed assumption can be tested using the Shapiro-Wilk test for normality. NOTE: always test for normality separately for each group (if there is a difference in means, then the total sample distribution will look like a bimodal distribution, having two modes, which is clearly not normal). Statistics Summaries, tables & tests Distributional plots & tests Shapiro-Wilk normality test Main tab: Variables: time by/if/in tab: Repeat command by groups: Variables that define groups: group OK by group, sort : swilk time <or> bysort group: swilk time _______________________________________________________________ -> group = 1 Shapiro-Wilk W test for normal data Variable | Obs W V z Prob>z -------------+------------------------------------------------time | 8 0.92428 1.055 0.086 0.46559 _______________________________________________________________ -> group = 2 Shapiro-Wilk W test for normal data Variable | Obs W V z Prob>z -------------+------------------------------------------------time | 10 0.74104 3.991 2.773 0.00278 We see that the Shapiro-Wilk test identified non-normality hypothesis in the second group. Chapter 2-4 (revision 17 Oct 2011) p. 41 Examining the data with a boxplot, Graphics Box plot Main tab: Variables: time By tab: Draw subgraphs for unique values of variables: group OK 600 800 TIME 1,000 1,200 1,400 graph box time , by(group) 1 2 From the boxplot, we have graphically identified an outlier in the second group (Boxplots are explained in Chapter 1). An outlier is a data value that appears to have not come from the same population that the rest of sample came from. In our example, the outlier is for the unhealthy group, which had a treadmill time even higher than the maximum value in the healthy group. Either it was a sick marathon runner, or a data coding error (or perhaps a patient attempting to impress the clinician conducting the treadmill test, even if it meant having a heart attack trying). Let’s see what happens if we use a Wilcoxon-Mann-Whitney test. After all, that test does not have the assumption of normally distributed data. Similarly, it is not affected by outliers, since it simply compares ranks, so that the outlier simply looks like one unit larger (next higher rank) than the next largest value. Chapter 2-4 (revision 17 Oct 2011) p. 42 To see this (just for illustration, you would never do this as part of your analysis), egen timerank = rank(time) list group time timerank 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. +-------------------------+ | group time timerank | |-------------------------| | 1 1014 16 | | 1 684 5 | | 1 810 10 | | 1 990 14 | | 1 840 11 | |-------------------------| | 1 978 13 | | 1 1002 15 | | 1 1110 17 | | 2 864 12 | | 2 636 3 | |-------------------------| | 2 638 4 | | 2 708 6 | | 2 786 9 | | 2 600 2 | 15. | 2 1320 18 | |-------------------------| 16. | 2 750 7.5 | 17. | 2 594 1 | 18. | 2 750 7.5 | +-------------------------+ <-- we see that the new score (the rank) is simply one unit larger than the next largest value Computing a Wilcoxon-Mann-Whitney test, Statistics Summaries, tables & tests Nonparametric tests of hypotheses Wilcoxon rank-sum test Main tab: Variable: time Grouping variable: group Open ranksum time, by(group) Two-sample Wilcoxon rank-sum (Mann-Whitney) test group | obs rank sum expected -------------+--------------------------------1 | 8 101 76 2 | 10 70 95 -------------+--------------------------------combined | 18 171 171 unadjusted variance adjustment for ties adjusted variance 126.67 -0.13 ---------126.54 Ho: time(group==1) = time(group==2) z = 2.222 Prob > |z| = 0.0263 Chapter 2-4 (revision 17 Oct 2011) p. 43 We see that this test gives a statistically significant result. That is because it is treating the outlier the same as if it is just barely larger than the next largest value, thus shrinking the tail of the distribution back towards the rest of the distribution. The Wilcoxon-Mann-Whitney test is actually a comparison of ranks, which is why it shows the “rank sum” column in the above output. We can verify this by, Statistics Summaries, tables & tests Summary and descriptive statistics Summary statistics Main tab: Variables: timerank Options: Display additional statistics by/if/in tab: if (expression): group==1 OK summarize timerank if group==1, detail rank of (time) ------------------------------------------------------------Percentiles Smallest 1% 5 5 5% 5 10 10% 5 11 Obs 8 25% 10.5 13 Sum of Wgt. 8 50% 75% 90% 95% 99% 13.5 15.5 17 17 17 Largest 14 15 16 17 Mean Std. Dev. 12.625 3.889087 Variance Skewness Kurtosis 15.125 -.8502071 2.830671 return list scalars: r(N) r(sum_w) r(mean) r(Var) r(sd) r(skewness) r(kurtosis) r(sum) r(min) r(max) r(p1) r(p5) r(p10) r(p25) r(p50) r(p75) r(p90) r(p95) r(p99) = = = = = = = = = = = = = = = = = = = 8 8 12.625 15.125 3.889087296526011 -.8502070865436711 2.830671207079922 101 5 17 5 5 5 10.5 13.5 15.5 17 17 17 The summarize command does not display the sum, but we got it using return list. We see that the sum for group 1 is 101, which agrees exactly with the sum for group 1 shown in the Wilcoxon-Mann-Whitney output. Chapter 2-4 (revision 17 Oct 2011) p. 44 One might wonder what would happen if we omitted the outlier from the data and then performed a t test. Let’s find out. We could drop the outlier from the dataset by using: drop if time==1320 However, might find we need that observation later, so let’s keep it just for now. Instead, we will use: Statistics Summaries, tables & tests Classical tests of hypotheses Two-group mean-comparison test Main tab: Variable name: time Group variable name: group by/if/in tab: If (expression): time~=1320 OK ttest time if time~=1320, by(group) Note: The “~=”, or you can use “!=”, are the State symbols for “not equal to”. We could also put the “if” expression at the end by adding an extra comma ttest time, by(group), if time~=1320 Two-sample t test with equal variances -----------------------------------------------------------------------------Group | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval] ---------+-------------------------------------------------------------------1 | 8 928.5 48.83317 138.1211 813.0279 1043.972 2 | 9 702.8889 30.83488 92.50465 631.7835 773.9943 ---------+-------------------------------------------------------------------combined | 17 809.0588 39.18175 161.5505 725.9972 892.1204 ---------+-------------------------------------------------------------------diff | 225.6111 56.38805 105.4228 345.7994 -----------------------------------------------------------------------------diff = mean(1) - mean(2) t = 4.0010 Ho: diff = 0 degrees of freedom = 15 Ha: diff < 0 Pr(T < t) = 0.9994 Ha: diff != 0 Pr(|T| > |t|) = 0.0012 Ha: diff > 0 Pr(T > t) = 0.0006 We see that we now get significance with the t test. Chapter 2-4 (revision 17 Oct 2011) p. 45 Just for illustration, let’s also look at the t-test with unequal variances, ttest time, unequal by(group), if time~=1320 Two-sample t test with unequal variances -----------------------------------------------------------------------------Group | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval] ---------+-------------------------------------------------------------------1 | 8 928.5 48.83317 138.1211 813.0279 1043.972 2 | 9 702.8889 30.83488 92.50465 631.7835 773.9943 ---------+-------------------------------------------------------------------combined | 17 809.0588 39.18175 161.5505 725.9972 892.1204 ---------+-------------------------------------------------------------------diff | 225.6111 57.75352 99.80301 351.4192 -----------------------------------------------------------------------------diff = mean(1) - mean(2) t = 3.9064 Ho: diff = 0 Satterthwaite's degrees of freedom = 12.0224 Ha: diff < 0 Pr(T < t) = 0.9990 Ha: diff != 0 Pr(|T| > |t|) = 0.0021 Ha: diff > 0 Pr(T > t) = 0.0010 Comparing the two t-test results, we see that the equal variance t test is more powerful than the unequal variance t test, as it should be if the assumptions are met. Indeed, the data are now sufficiently normal: bysort group: swilk time if time~=1320 _______________________________________________________________ -> grou p = 1 Shapiro-Wilk W test for normal data Variable | Obs W V z Prob>z -------------+------------------------------------------------time | 8 0.92428 1.055 0.086 0.46559 _______________________________________________________________ -> group = 2 Shapiro-Wilk W test for normal data Variable | Obs W V z Prob>z -------------+------------------------------------------------time | 9 0.93563 0.946 -0.092 0.53677 Chapter 2-4 (revision 17 Oct 2011) p. 46 Let’s suppose that we could not justify eliminating the outlier. How should we report these data? As seen by examining the frequency table, reporting the mean does a poor job of describing the central tendency (average) of the data for group two, since 70% of the data are below the mean. tab time if group==2 TIME | Freq. Percent Cum. ------------+----------------------------------594 | 1 10.00 10.00 600 | 1 10.00 20.00 636 | 1 10.00 30.00 638 | 1 10.00 40.00 708 | 1 10.00 50.00 750 | 2 20.00 70.00 786 | 1 10.00 80.00 864 | 1 10.00 90.00 1320 | 1 10.00 100.00 ------------+----------------------------------Total | 10 100.00 summarize time if group==2 Variable | Obs Mean Std. Dev. Min Max -------------+----------------------------------------------------time | 10 764.6 213.7497 594 1320 No matter what average we report, median or mean, we should report the p value from the Wilcoxon-Mann-Whitney test if we don’t eliminate the outlier. Chapter 2-4 (revision 17 Oct 2011) p. 47 Truncation Approach to Outliers Another approach to outliers is to set them to the highest biologically plausible value on the high side, and set them to the lowest biologically plausible value on the low side, leaving all of plausible values of the variable unchanged. In the treadmill example, the researcher might feel that the highest plausible value for this population of patients is 900, for example. The value of 1320 would simply be recoded to be 900 before statistical analysis. Steyerberg (2009, p.168) describes this approach, “…Another check is on biological plausibility. This judgment requires expert opinion, and depends on the setting. For example, a systolic blood pressure of 250 mmHg is biologically plausible in the acute care situation for traumatic brain injury patients, but may not be plausible in an ambulatory care situation. Implausible values may best be considered as errors and hence set to missing.315 For biologically possible values, various statistical approaches are subsequently possible. To reduce the influence on the regression coefficients (‘leverage’), we may consider to transform the variable by ‘truncation.’ Very high and very low values are shifted to truncation points: If X>Xmax then X=Xmax; If X<Mmin then X=Xmin; else X=X Here, xmax and xmin are the upper and lower truncation points. These may be defined from examining distributions, e.g., with box plots and histograms, and the predictor-outcome relationship. -----------315 Osborne JW, Overby A. The power of outliers (and why researchers should always check for them. Pract Assess Res Eval 2004;9(6). If the truncation approach is used, you could state it like the following in your article: Suggestion for Statistical Methods Section For the outcome of treadmill time, we set the one outlier of 1320 to 900, based on our judgment that a value of 900 was not biologically plausible for this type of patient. The next largest value in our data for this patient group was 864. This is known as the truncation approach to outliers (Steyerberg, 2009, p.168), which is less extreme than simply eliminating outliers from the analysis. Returning to the example, we will set the 1320 to 900, and then re-run the analysis, gen time2 = time // make a copy of the variable replace time2=900 if time>900 & time~=. // truncate to 900 bysort group: sum time time2 // check our work ttest time, unequal by(group) // with outlier included ttest time2, unequal by(group) // with outlier truncated to 900 Chapter 2-4 (revision 17 Oct 2011) p. 48 -> group = 1 Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------time | 8 928.5 138.1211 684 1110 time2 | 8 854.25 77.10058 684 900 -> group = 2 Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------time | 10 764.6 213.7497 594 1320 time2 | 10 722.6 107.1989 594 900 We see that in Group 2, the 1320 maximum was correctly set to 900. . ttest time, unequal by(group) // with outlier included Two-sample t test with unequal variances -----------------------------------------------------------------------------Group | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval] ---------+-------------------------------------------------------------------1 | 8 928.5 48.83317 138.1211 813.0279 1043.972 2 | 10 764.6 67.59359 213.7497 611.6927 917.5073 ---------+-------------------------------------------------------------------combined | 18 837.4444 46.58727 197.6531 739.1539 935.735 ---------+-------------------------------------------------------------------diff | 163.9 83.38808 -13.39825 341.1983 -----------------------------------------------------------------------------diff = mean(1) - mean(2) t = 1.9655 Ho: diff = 0 Satterthwaite's degrees of freedom = 15.4391 Ha: diff < 0 Pr(T < t) = 0.9662 Ha: diff != 0 Pr(|T| > |t|) = 0.0676 . ttest time2, unequal by(group) Ha: diff > 0 Pr(T > t) = 0.0338 // with outlier truncated to 900 Two-sample t test with unequal variances -----------------------------------------------------------------------------Group | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval] ---------+-------------------------------------------------------------------1 | 8 854.25 27.25917 77.10058 789.7923 918.7077 2 | 10 722.6 33.89926 107.1989 645.9145 799.2855 ---------+-------------------------------------------------------------------combined | 18 781.1111 26.93892 114.2921 724.275 837.9473 ---------+-------------------------------------------------------------------diff | 131.65 43.49968 39.37361 223.9264 -----------------------------------------------------------------------------diff = mean(1) - mean(2) t = 3.0265 Ho: diff = 0 Satterthwaite's degrees of freedom = 15.8705 Ha: diff < 0 Pr(T < t) = 0.9960 Ha: diff != 0 Pr(|T| > |t|) = 0.0081 Ha: diff > 0 Pr(T > t) = 0.0040 The t test p = 0.0676 before the truncation of the outlier is now p = 0.0081 after the outlier truncation. Chapter 2-4 (revision 17 Oct 2011) p. 49 How Many Decimals Places to Report This is described very nicely in the American Medical Association (AMA) Manual of Style (Iverson et al, AMA Manual of Style, 2007, p.851). The number of decimal places reported outcomes should match the original precision of the variable. If the variable has no decimal places, then reported measurements should be rounded to the nearest integer. If one decimal place exists in the data, which reflects the precision of that variable, than reported numbers should be rounded to one decimal place. Similarly for mathematical calculations, the results should be rounded to the same digit of accuracy as the original variable. For means and standard deviations, no more than one significant digit beyond the accuracy of the measurement should be used. Decimal Places for P Values This is described very nicely in the American Medical Association (AMA) Manual of Style (Iverson et al, AMA Manual of Style, 2007, p.851-52): “Briefly, P values should be expressed to 2 digits to the right of the decimal point (regardless of whether the P value is significant), unless P < .01, in which case the P value should be expressed to 3 digits to the right of the decimal point. (One exception to this rule is when rounding P from 3 digits to 2 digits would result in P appearing nonsignificant, such as P = 0.046. In this case, expressing the P value to 3 places may be preferred by the author. The same holds true for rounding confidence intervals that are significant before rounding but nonsignificant after rounding.) The smallest P value that should be expressed is P < .001, since additional zeros do not convey useful information.37 P values should never be rounded up to 1.0 or down to 0. While such a procedure might be justified arithmetically, the results are misleading. Statistical inference is based on the assumption that events occur in a probabilistic, rather than deterministic, universe. P values may approach infinitely close to these upper and lower bounds, but never close enough to establish that the associated observation was either absolutely predestined (P = 1.0) or absolutely impossible (P = 0) to occur. Thus, very large and very small P values should be expressed as P > .99 and P <.001, respectively.” ----37 Bailar JC, Mosteller F. Medical Uses of Statistics. 2nd ed. Boston, MA: NEJM Books,1992. Exercise. Look at the article by Brady et al (JAMA, 2000). In their Table 3, you will see examples of this style of reporting p values, p=.003, p=.01, p=.04, p=.07. In their Table 1, you’ll see p>.99. In their text you will find p<.001. Chapter 2-4 (revision 17 Oct 2011) p. 50 Reporting Styles For Two Sample Continuous Outcome Comparisons Here are some example reporting styles (rounding p=0.026 to p=0.03) 1. The diseased group had a significant shorter treadmill time than the healthy group (meanSEM seconds; diseased: 76565 , healthy: 92849, p=0.03). 2. The diseased group had a significant shorter treadmill time than the healthy group (diseased: mean 765 seconds, 95% CI, 612-918, healthy: mean 929 s, 95% CI, 813-1044, , p=0.03). 3. The diseased group had a significant shorter treadmill time than the healthy group [mean difference, 164 seconds, 95% CI (-22 , 349) , p=0.03]. For examples 1, 2, and 3, all statistics can be found on the t test output. 4. The diseased group had a significant shorter treadmill time than the healthy group [median (interquartile range) seconds; diseased: 729 (627 - 806) , healthy: 984 (818 1011), p=0.03)]. 5. The diseased group had a significant shorter treadmill time than the healthy group [diseased: median 729, 95% CI, 612-839 , healthy: median 984, 95% CI, 769-1045, p=0.03)]. 6. The diseased group had a significant shorter treadmill time than the healthy group [median difference, 222 seconds, 95% CI, 48-360, p=0.03)]. For examples 4 and 5, the p value comes from the Wilcoxon-Mann-Whitney output and the median, interquartile range, and CI for median comes from the following centile command: Statistics Summaries, tables & tests Summary and descriptive statistics Centiles with CIs Main tab: Variables: time Centiles: 25 50 75 by/if/in tab: Repeat command by groups: Variables that define groups: group OK by group, sort : centile time, centile(25 50 75) <or> bysort group: centile time, centile(25 50 75) Chapter 2-4 (revision 17 Oct 2011) p. 51 -------------------------------------------------------------------------------> group = 1 -- Binom. Interp. -Variable | Obs Percentile Centile [95% Conf. Interval] -------------+------------------------------------------------------------time | 8 25 817.5 684 991.1952* | 50 984 769.05 1045.2 | 75 1011 964.2548 1110* * Lower (upper) confidence limit held at minimum (maximum) of sample -------------------------------------------------------------------------------> group = 2 -- Binom. Interp. -Variable | Obs Percentile Centile [95% Conf. Interval] -------------+------------------------------------------------------------time | 10 25 627 594 746.2082* | 50 729 611.68 838.6933 | 75 805.5 711.7918 1320* * Lower (upper) confidence limit held at minimum (maximum) of sample In example 6, the 95% CI for the median comes from the following cendif command cendif time ,by(group) but you have to first update your State to include the commands somersd and cendif before you can use the command. Do this, while connected to the internet, using, findit somersd which will display SJ-6-4 snp15_7 . CIs for rank stat: Percentile slopes, differences, & ratios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . R. Newson (help cendif, censlope, censlope_iteration, mata bcsf_bracketing(), mata blncdtree(), mata somdtransf(), mata u2jackpseud(), somersd, somersd_mata if installed) Q4/06 SJ 6(4):497--520 calculates confidence intervals for generalized Theil-Sen median (and other percentile) slopes (and per-unit ratios) of Y with respect to X; help files also document supporting Mata functions and then click on the snp15_7 link to see INSTALLATION FILES (click here to install) and click on this link to install it. Chapter 2-4 (revision 17 Oct 2011) p. 52 If it crashes on you the first time you run it, then repeat this installation step and you will see ----------------------------------------------------------------------------------package installation ----------------------------------------------------------------------------------package name: from: snp15_7.pkg http://www.stata-journal.com/software/sj6-4/ checking snp15_7 consistency and verifying not already installed... the following files already exist and are different: c:\ado\plus\c\cendif.ado c:\ado\plus\c\cendif.hlp no files installed or copied Possible things to do: 1. Forget it (best choice if any of the above files were written by you and just happen to have the same name or you do not want the originals changed) 2. Look for an already-installed package of the same name (which you might then choose to uninstall) 3. Search installed packages for the duplicate file(s) by clicking on the file names above 4. Force installation replacing already-installed files (if this is an update, this would be a safe choice; you will end up with the original and the update apparently installed, but it doesn't matter; you can even uninstall the original later) Choice option 4. This will overwrite a couple of files that caused it to crash. After installing the update and running the following to get the 95% CI for the median in Example 6 cendif time ,by(group) Y-variable: time (TIME) Grouped by: group (GROUP) Group numbers: GROUP | Freq. Percent Cum. ------------+----------------------------------1 | 8 44.44 44.44 2 | 10 55.56 100.00 ------------+----------------------------------Total | 18 100.00 Transformation: Fisher's z 95% confidence interval(s) for percentile difference(s) between values of time in first and second groups: Percent Pctl_Dif Minimum Maximum 50 222 48 360 but you have to first update your State to include the commands somersd and cendif before you can use the command. Chapter 2-4 (revision 17 Oct 2011) p. 53 The command cendif gives a difference in medians of 222, which is different from the observed medians, 984 – 729 = 255. This is because it’s computation of the median follows a different formula than the ordinary median, which is rather too complex to get into. If you use this median difference and confidence interval, then, it is best to not report the individual group medians because readers will think you made a mistake if they notice the inconsistency. Protocol You could state (although I do not recommend it), Comparisons between two groups for continuous variables will be performed using independent groups t tests if the equal variance assumption is met, otherwise independent groups t tests with unequal variances (Satterthwaite’s method) will be used (Rosner, 1995). If the data for either group is skewed sufficiently to not meet the normality assumption, then a Wilcoxon-Mann-Whitney test will be used. The equality of variance assumption will be tested using Levene’s test for equality of variances, and the normality assumption will be tested using Shapiro-Wilks test for normality. The t test is very robust to the assumptions of normality and homogeneity of variance, so this approach is unnecessary. It is sufficient to just say, Comparisons between two groups for continuous variables will be performed using an independent groups t test. Fisher-Pitman Permutation Test for Independent Samples In the above example, we found that the distribution for treadmill time (a continuous variable) was skewed for one of the two groups being compared. Thus, the normality assumption was not met for the independent samples t test. In this example, the sample size was too small to rely on the central limit theorem to provide asymptotic normality of the sample means. The approach we took was to use the nonparametric Wilcoxon-Mann-Whitney test, which only requires the data to be an ordinal scale. A more powerful approach is to use a nonparametric test that requires an interval scale. It would be more powerful because it would use an additional property of the measurement, the equal interval property. Such a test is the Fisher-Pitman permutation test for independent samples (Siegel and Castellan, 1988, pp.151-155; Kaiser, 2007), which just became available in State. It uses all the information in the data, all of the interval scale properties, but does not have any assumptions about the distribution. That is, it does not assume a normal distribution or equal variances, which the t test assumes. Chapter 2-4 (revision 17 Oct 2011) p. 54 The first time you use it, you have to update your State to include it, since it was a user contributed procedure. Use the following command, and then click on the st0134 link to install. findit permtest2 SJ-7-3 st0134 . . Fisher-Pitman perm. tests for paired rep. & indep. samples (help permtest1, permtest2 if installed) . . . . . . . . . J. Kaiser Q3/07 SJ 7(3):402-412 exact and Monte Carlo proposals to the nonparametric Fisher-Pitman tests for paired replicates and independent Samples INSTALLATION FILES st0134/permtest1.ado st0134/permtest1.hlp st0134/permtest2.ado st0134/permtest2.hlp (click here to install) This installs two commands, permtest1 <- Fisher-Pitman permutation test for paired replicates permtest2 <- Fisher-Pitman permutation test for two independent samples You can easy verify you are using the right version of the test by looking at the help file, help permtest2 Using the “Fisher-Pitman permutation test for independent samples” to analyze the treadmill data, permtest2 time , by(group) Fisher-Pitman permutation test for two independent samples group | obs mean std.dev. -------------+--------------------------------1 | 8 928.5 138.12106 2 | 10 764.6 213.7497 -------------+--------------------------------combined | 18 837.44444 197.65306 mode of operation: Progress: Montecarlo simulation (200000 runs) |........................................| Test of hypothesis Ho: time(group==1) >= time(group==2) : Test of hypothesis Ho: time(group==1) <= time(group==2) : Test of hypothesis Ho: time(group==1) == time(group==2) : p=.95916 (one-tailed) p=.041865 (one-tailed) p=.08373 (two-tailed) In general, this test is more powerful than the Wilcoxon-Mann-Whitney test, because it does arithmetic directly on the observations themselves, rather than the ranks. Using additional information in the data, it has greater power. In this example, it had a larger p value than the Chapter 2-4 (revision 17 Oct 2011) p. 55 Wilcoxon-Mann-Whitney test (Fisher-Pitman, p = 0.08373; Wilcoxon-Mann-Whitney, p = 0.0263). The outlier was in a direction that made the two groups look more alike, so the Wilcoxon-Mann-Whitney test was to our advantage. The Fisher-Pitman treated the outlier like a much larger number, making the two groups more alike on average, while the Wilcoxon-MannWhitney test treated the outlier like it was just a tiny bit bigger than the number just smaller than it, keeping the two groups separated. Like the Fisher’s exact test, the Fisher-Pitman test is a permutation test. It constructs all the ways to combine the data into two groups, with the same sample sizes as the original sample, and then defines the p value as the proportion of all combinations which are more extreme than the data observed. That is, letting X be the variable that contains the observations for the first group and Y be the variable that contains the observations for the second group, it computes: X Y i j It does this for every possible way to combine the observations into the two groups. Observations that were originally in group X can switch to group Y, and vis-a-versa. The p value is the proportion of times that this difference of the sums is more extreme than the observed difference of the sums. It is intuitive that an outlier contributes more to this test than when using ranks like the Wilcoxon-Mann-Whitney test. Not many researchers are familiar with this test, so if you use it, always provide a citation. Article Statistical Methods Section Suggestion Here is some suggested wording for your statistical methods suggestion, when you use the Fisher-Pitman permutation test for independent samples. You should always provide a citation for this test, since it is not well known. (The blue sentence is the minimum you should say. Adding the green line is recommended since the reader will likely be unfamiliar with the test and will want to know why you chose to use it. For univariable group comparisons, unordered categorical variables were compared using a chi-square test, or Fisher's exact test, as appropriate. For ordered categorical variables, a Wilcoxon-Mann-Whitney test was used. For continuous variables, an independent samples Student t test was used if the data were approximately normally distributed. If the continuous variable was skewed, a nonparametric Fisher-Pitman permutation test for independent samples was used (Siegel and Castellan, 1988; Kaiser, 2007). The FisherPitman test, which assumes a continuous scaled variable, is as powerful as the independent groups t test, but without the distributional assumptions. In contrast, a Wilcoxon-Mann-Whitney test, is less powerful, since it assumes only an ordered categorical scale, and thus discards information in continuous scaled data. (Siegel and Castellan, 1988; Kaiser, 2007). For skewed continuous variables, medians and interquartile ranges (25th and 75th percentiles) are reported in place of means and standard deviations. ------------Chapter 2-4 (revision 17 Oct 2011) p. 56 Kaiser J. An exact and a Monte Carlo proposal to the Fisher-Pitman permutation tests for paired replicates and for independent samples. The State Journal 2007;7(3):402412. Siegel S and Castellan NJ Jr. Nonparametric Statistics for the Behavioral Sciences, 2nd ed. New York, McGraw-Hill, 1988, pp.151-155. A shorter version is: For skewed continuous variables, the two groups were compared using the Fisher-Pitman permutation test for independent samples. The Fisher-Pitman test, which assumes a continuous, or interval, scaled variable, is as powerful as the independent groups t test, but without the distributional assumptions. In contrast, a Wilcoxon-Mann-Whitney test, is less powerful, since it assumes only an ordered categorical, or ordinal, scale and thus discards information in continuous scaled data. (Siegel and Castellan, 1988; Kaiser, 2007). ------------Kaiser J. An exact and a Monte Carlo proposal to the Fisher-Pitman permutation tests for paired replicates and for independent samples. The State Journal 2007;7(3):402412. Siegel S and Castellan NJ Jr. Nonparametric Statistics for the Behavioral Sciences, 2nd ed. New York, McGraw-Hill, 1988, pp.151-155. Confidence Intervals Confidence intervals were used in the reporting styles shown above, which will now be defined. When we compute an effect, such as the difference between two means, we call that the point estimate of the effect. The point estimate is our best guess of what is true population effect is. A confidence interval is called an interval estimate, which is an interval (lower bound , upper bound) that we can be confident covers, or straddles, the true population effect with some level of confidence. How to Interpret a Confidence Interval A subtlety that statisticians are careful to make is to keep in mind that the population effect or parameter is fixed, remaining the same from sample to sample, while the endpoints of the confidence interval are subject to sampling variation. Statisticians are careful, then, to refer to the 95% confidence interval as “covering” or “containing” the population effect or parameter, thereby implying that only the endpoints vary from sample to sample. Statisticians avoid saying there is a 95% probability that the population effect is contained within the interval, which would imply that the population effect varies from sample to sample, while the interval is fixed. Chapter 2-4 (revision 17 Oct 2011) p. 57 Next, the explanation of a confidence interval, from three different statistics textbooks is provided. Meyer (1970, pp.303-304) gives a formula for a confidence interval around a population mean, the formula being based on the standard normal distribution, which is a normal distribution with mean 0 and standard deviation equal to 1. In this formula, you can think of 2 ( z ) 1 being equal to 0.95, or 95%, the usual case, z z 2( z ) 1 ... P X X n n This last probability statement must be interpreted very carefully. It does not mean the probability of the parameter µ falling into the specified interval equals 2 ( z ) 1 ; µ is a parameter and either is or is not in the above interval. Rather, the above should be interpreated as follows: 2 ( z ) 1 equals the probability that the random interval “ ( X z / n , X z / n ) contains µ. Such an interval is called a confidence interval for the parameter µ. Since z is at our disposal, we may choose it so that the above probability equals, say 1 – α.” Chow and Liu (2000, pp.83-84), using the subscript T for test group and R for referent group, with α = 0.025 for a traditional 95% confidence interval around a mean difference, explain, “A (1 – 2α) × 100% confidence interval for µT - µR is a random interval and its associate confidence limits are, in fact, random variables. The fundamental concept of a (1 – 2α) × 100% confidence interval for µT - µR is that if the same study can be repeatedly carried out many times, say B, then (1 – 2α) × 100% times of the B constructed random intervals will cover µT - µR (Bickel and Doksum, 1977). In other words, in the long run, a (1 – 2α) × 100% confidence interval will have at least a 1 – 2α chance to cover the true mean difference….” Bain and Engelhardt (1992, p.359) give further clarification of why statisticians use “confidence” rather than “probability” to describe this random interval, in order for the terminology to be precise. Using an example of a 95% confidence interval around a population parameter, denoted as θ, which you can think of as the population mean, or else the population mean difference, for purposes of the present discussion, where the interval has already been computed to be (69.9 , 130.3), “…We will refer to this interval as a 95% confidence interval for θ. Because the estimated interval has known endpoints, it is not appropriate to say that it contains the true value of θ with probability 0.95. That is, the parameter θ, although unknown, is a constant, and this particular interval either does or does not contain θ. However, the fact that the associated random interval had probability 0.95 prior to estimation might lead us to assert that we are ‘95% confident’ that 69.9 < θ < 130.3.” Chapter 2-4 (revision 17 Oct 2011) p. 58 Relationship of Confidence Interval and Significance Testing There is a direct relationship between testing at the 0.05 level (looking for p<0.05) and constructing a 95% confidence interval. If the null effect (mean difference = 0 in this case) is contained within the confidence interval, than the test statistic will not be statistically significant. This is true because a confidence interval is algebraically equivalent to the test statistic vs. “reference range” endpoints inequality. The formula for this t test (when equal variances are assumed) is: x1 x2 x x t 1 2 s.e.( x1 xx ) s 2p s 2p n1 n2 (n1 1) s12 (n2 1) s22 , where s n1 n2 2 2 p Statisticians know how this test statistic will be distributed for any given sample size (by applying the central limit theorem). Using a “reference range” logic, we construct an inequality: t(1 / 2), n 2 df tobserved t(1 / 2), n 2 df t(1 / 2), n 2 df x1 x2 t(1 / 2), n 2 df s.e.( x1 xx ) Multiplying each term by the denominator of the middle term, t(1 / 2), n 2 df s.e.( x1 xx ) x1 x2 t(1 / 2), n 2 df s.e.( x1 xx ) is the formula for the confidence interval around the mean difference. When we choose the criterion p<0.05 for significance, we say we are testing at the alpha = 0.05 level of significance. To construct the confidence interval, then, we need to know the value of t for (1-0.05/2) or 97.5 and for n-2 degrees of freedom. For n = 18, this is display abs(invttail(16, .975)) 2.1199053 Chapter 2-4 (revision 17 Oct 2011) p. 59 Above, we calculated an independent groups t test: Two-sample t test with equal variances -----------------------------------------------------------------------------Group | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval] ---------+-------------------------------------------------------------------1 | 8 928.5 48.83317 138.1211 813.0279 1043.972 2 | 10 764.6 67.59359 213.7497 611.6927 917.5073 ---------+-------------------------------------------------------------------combined | 18 837.4444 46.58727 197.6531 739.1539 935.735 ---------+-------------------------------------------------------------------diff | 163.9 87.52394 -21.64246 349.4425 -----------------------------------------------------------------------------diff = mean(1) - mean(2) t = 1.8726 Ho: diff = 0 degrees of freedom = 16 Ha: diff < 0 Pr(T < t) = 0.9602 Ha: diff != 0 Pr(|T| > |t|) = 0.0795 Ha: diff > 0 Pr(T > t) = 0.0398 The 95% confidence interval (-21.64246 , 349.4425) around the mean difference 163.9, where the standard error for this difference was 87.52394, is computed as display "( " 163.9-2.1199053*87.52394 " , " 163.9+2.1199053*87.52394 " )" ( -21.642464 , 349.44246 ) This interval covers the null effect value of a 0 difference ( H 0 : 1 2 or 1 2 0 ) so our p value is likewise non-significant (p > 0.05). Note that we do not have to compute this 95% confidence interval with the display command, which was only done here to demonstrate the formula. Just take the confidence interval from the t test output. Statistical Tests to Identify Outliers There is an entire class of statistical tests for outlier detection. For example, Rosner (1995,pp.277-282) describes the Extreme Studentized Deviate (or ESD Statistic), which is not available in State. Statisticians generally avoid these tests, because it is difficult to argue that the rules used in these tests are not to some extent arbitrary. If you can verify that it was due to a data coding error, or a laboratory error, then absolutely. You can eliminate the outlier without any mention of it, because it was in actuality an error rather than an outlying value. The FDA Guidance Document provides some excellent approaches to dealing with outliers. Exercise Look at the FDA Guidance Document E9 Statistical Principles for Clinical Trials, section 5.3 paragraph 2. Chapter 2-4 (revision 17 Oct 2011) p. 60 Discussing Outlier in Articles It is rare to see someone discuss outlier exclusion in an article, because the author is concerned about making readers and editors uncomfortable with the analysis. This practice seems to reversing somewhat now in the N Engl J Med. McWilliams et al. (N Engl J Med, 2007) analyzed biennial, or every two year, survey data from the Health and Retirement Study. The study hypothesis was that previously uninsured adults who enroll in Medicare programs at the age of 65 years may have greater morbidity, requiring more intensive and costlier care over subsequent years, than they would if they had been previously insured. In their Statistical Methods section they state, “We excluded a small number (<0.1%) of biennial observations that were extreme outliers (≥50 hospitalizations, ≥300 doctor visits, or total expenditures ≥$2 million).” Another paper, investigating needle stick injuries among surgical residents, reported in their Results Section (Makary et al, N Engl J Med, 2007), “One respondent was excluded from the analysis as an outlier for reporting a range of more than 100 injuries, and two did not report the number of needlestick injuries.” Chapter 2-4 (revision 17 Oct 2011) p. 61 Prespecification of Analysis The exercise for the interval variable was a good example of the issue of prespecification of analysis, including the Protocol Suggestion (which was a prespecification of analysis). Exercise Look at the FDA Guidance Document E9 Statistical Priniciples for Clinical Trials, section 5.1 Prespecification of Analysis. Exercise Look at the Fonseca reprint, Statistical Methods section, second paragraph where they talk about parametric and nonparametric analysis. In this paragraph, Fonseca is stating a “prespecification of analysis” to convince the reader that bias was not introduced by the choice of statistics. (That is, without this presentation, the reader might think that Fonseca is simply choosing to use those tests which produced a significant result.) Chapter 2-4 (revision 17 Oct 2011) p. 62 References Abramson JH, Gahlinger PM. (2001). Computer Programs for Epidemiologists: PEPI Version 4.0. Salt Lake City, UT, Sagebrush Press. Agresti A. (1990). Categorical Data Analysis. New York, John Wiley & Sons. Altman DG. (1991). Practical Statistics for Medical Research. New York, Chapman & Hall/CRC. Bain LJ, Engelhardt M. (1992). Introduction to Probability and Mathematical Statistics. 2nd ed. Pacific Grove CA, Duxbury. Bergmann R, Ludbrook J, and Spooren WPJM. (2000). Different outcomes of the WilcoxonMann-Whitney Test From Different Statistics Packages. The American Statistician \ 54(1):72-77. Bickel PJ, Doksum AD. (1977). Mathematical Statistics. Holden-Day, San Francisco, CA. Borenstein M. (1997). Hypothesis testing and effect size estimation in clinical trails. Annals Allergy, Asthma, & Immunology 78:5-11. Borenstein M, Rothstein H, Cohen J. (2001). SamplePower® 2.0. Chicago, SPSS Inc. software can be purchased at http://www.spss.com Brady K, Pearlstein T, Asnie GM, et al. (2000). Efficacy and safety of sertraline treatment of posttraumatic stress disorder: a randomized controlled trial. JAMA 283(14):1837-1844. Breslow NE, Day NE. (1980). Statistical Methods in Cancer Research: Volume 1 – The Analysis of Case-Control Studies. Lyon, France, Internal Agency for Research on Cancer (IARC Scientific Publications No. 32). Brown KM, Kondeatis E, Vaughan RW, et al (2006). Influence of donor C3 allotype on late renal-transplantation outcome. N Engl J Med 354;19:2014-23. Chow S-C, Liu J-P. (2000). Design and analysis of bioavailability and bioequivalence studies. 2nd edition, New York, Marcel Dekker. Cochran WG. (1954). Some methods for strengthening the common χ2 tests. Biometrics 10:417451. Conover WJ (1980). Practical Nonparametric Statistics, 2nd ed. New York, John Wiley & Sons, pp.165-169. Cuchel M, Bloedon LT, Szapary PO, et al (2007). Inhibition of microsomal triglyceride transfer protein in familial hypercholesterolemia. N Engl J Med 356(2):148-156. Chapter 2-4 (revision 17 Oct 2011) p. 63 Cytel. (2001). StatXact 5® Statistical Software for Exact Nonparametric Inference User Manual Volume 2. Cambridge MA, CYTEL Software Corporation. Daniel WW. (1995). Biostatistics: A Foundation for Analysis in the Health Sciences. 6th ed. New York, John Wiley & Sons. Fischer B, Lassen U, Mortensen J, et al. (2009). Preoperative staging of lung cancer with combined PET-CT. N Engl J Med 361(1):32-39. Gonzalez-Martinez JA, Gupta A, Kotagal P, et al. (2005). Hemispherectomy for catastrophic epilepsy in infants. Epilepsia 46(9):1518-25. Greenland S. (1991). On the logical justificaiton of conditional tests for two-by-two contingency tables. The American Statistician 45(3):248-251. International Conference on Harmonisation E9 Expert Working Group. (1999). ICH harmonised tripartite guideline: statistical principles for clinical trials. Stat Med 18(15):1905-42. Freely available as a guidance document on the FDA website (word for word same content): Guidance for industry: E9 statistical principles for clinical trials. http://www.fda.gov/downloads/Drugs/GuidanceComplianceRegulatoryInformation/Guid ances/ucm073137.pdf Iverson C, Christiansen S, Flanagin A, et al. (2007). AMA Manual of Style: A Guide for Authors and Editors, 10th ed. New York, Oxford University Press. Kaiser J. (2007). An exact and a Monte Carlo proposal to the Fisher-Pitman permutation tests for paired replicates and for independent samples. The Stata Journal. 7(3):402-412. Makary MA, Al-Attar A, Holzmueller GC, et al. (2007). Needlestick injuries among surgeons in training. N Engl J Med 356(26):2693-9. Matthews DE, Farewell V. (1985). Using and understanding medical statistics. New York, Karger. McWilliams JM, Meara E, Zslavsky AM, Ayanian JZ. (2007). Use of health services by previoulsy uninsured Medicare beneficiaries. N Engl J Med 357(2):143-53. Meyer PL (1970). Introductory Probability and Statistical Applications, 2nd ed. Reading MA, Addison-Wesley Publishing Company. Rice JA. (1988). Mathematical Statistics and Data Analysis. Pacific Grove, California, Wadsworth & Brooks/Cole. Rosner B. (1995). Fundamentals of Biostatistics, 4th ed. Belmont CA, Duxbury Press. Rosner B. (2006). Fundamentals of Biostatistics, 6th ed. Belmont CA, Duxbury Press. Chapter 2-4 (revision 17 Oct 2011) p. 64 Rothman KJ. (2002). Epidemiology: an Introduction. New York, Oxford University Press. Siegel S and Castellan NJ Jr (1988). Nonparametric Statistics for the Behavioral Sciences, 2nd ed. New York, McGraw-Hill. Steyerberg EW. (2009). Clinical Prediction Models: A Practical Approach to Development, Validation, and Updating. New York, Springer. Stoddard GJ, Ring WH. (1993). How to evaluate study methodology in published clinical research. J Intravenous Nursing 16(2):110-117. Chapter 2-4 (revision 17 Oct 2011) p. 65