* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Notes on sample size calculations
Foundations of statistics wikipedia , lookup
History of statistics wikipedia , lookup
Bootstrapping (statistics) wikipedia , lookup
Confidence interval wikipedia , lookup
Taylor's law wikipedia , lookup
Analysis of variance wikipedia , lookup
Resampling (statistics) wikipedia , lookup
Biostats VHM 801/802 Courses Fall 2007, Atlantic Veterinary College, PEI Henrik Stryhn Notes on sample size calculations These notes are intended to supplement, not replace, material in statistics textbooks ([1],[4]) about sample size determination. Their purpose is twofold, 1) to review ways of and arguments for choosing sample size, 2) to show how non-standard sample size questions may often be rephrased in terms of simple questions in designs with accessible answers (formulae or software). These purposes are separated into sections 1 and 2. The sole statistical prerequisites of the notes are basic notions of confidence intervals and test statistics. However, for section 2 to be of interest, acquaintance with the partly complex designs and models reviewed there by example, probably corresponding to the level of an “advanced” biostatistics course (VHM 802), would be required. 1. Introduction to sample size calculations The primary purpose of sample size calculations is for planning of experiments or studies where statistical data analysis is contemplated: how many “subjects” or “experimental units” are needed to achieve a reasonable (desired) precision? The specific meaning of “subjects” depends on the actual study, and in complex designs there may be more than one type of “experimental unit” (multi-level designs, discussed in section 2). In simple designs, the meaning is usually straightforward; for example, in a study comparing treatments given to dogs to alleviate a disease, the experimental units are the dogs. There are several ways of quantifying or specifying “precision”, the most common of which follow, – the length of a confidence interval (of chosen coverage, say 0.95) for a parameter of interest, or almost equivalently the standard error of the corresponding parameter estimate, corresponding to situations where primary interest is in the parameter and its estimate, and not necessarily in a particular test for a hypothesis involving the parameter, – the power of a statistical test to show a significant difference of a certain magnitude. Specifically, the power is the probability of the outcome of the statistical test to be significant at a prescribed significance level (e.g., 0.05) given that the null hypothesis being tested is false. This probability naturally depends on what is actually true — intuitively, how far the null hypothesis is off the true situation — which must therefore be specified to perform any power calculations. For example, if the null hypothesis is that a treatment and a control group have equal means, power calculations require specification of the true difference between the means in the two groups. We review these two approaches in the context of normal distribution models for the basic experimental designs: one and two samples. For non-normal data, the concepts are similar but in most cases all calculations require specialized software or computation (briefly reviewed in Note 1.4 below). 1 Example 1.1: One-sample model/design Consider a statistical model which assumes observations y1 , . . . , yn to be independent and distributed according to N(µ, σ 2 ). The primary interest is usually in the mean parameter µ, although both the mean and the standard deviation σ are assumed unknown. A (1−α) confidence interval for µ takes the well-known form: √ µ : ȳ ± t(1− α2 , n−1) s/ n, (1) where t(1 − α2 , n − 1) is the (1 − α2 )-percentile in a t-distribution with n − 1 degrees of freedom. In order to achieve a confidence interval of a certain, predescribed length (say, not exceeding L), the √ obvious procedure is to solve the equation L ≥ 2 t(1− α2 , n − 1)s/ n with respect to n. Note that the “2” on the right hand side stems from the length of the interval being twice its margin of error √ (t(1− α2 , n−1)s/ n). It gives of course the same result to solve for the margin of error (say M = L/2), √ where the equation is M ≥ t(1− α2 , n − 1)s/ n. To make the mechanics of solving these equations a bit easier, one usually approximates the t-distribution percentile by the corresponding value from a N(0,1) distribution. This approximation is valid only when n is large, and the calculation should be redone with a suitable t-distribution percentile if the resulting n is not so. Also, the value of s is unknown and must be substituted by an assumed/estimated/guessed value of the true standard deviation (for simplicity denoted also by σ), usually an estimate from previous data or a literature search. The (approximate) equation for n becomes n ≥ [z(1− α2 ) σ/M ]2 . (2) Choice of sample size based on power calculations is usually less natural for single sample models because of lack of interesting null hypotheses. We therefore defer the discussion to the next example below on matched samples which effectively are treated as a single sample. The formula (1) is the simplest example of a general method for computing confidence intervals in normal distribution models1 . For a parameter of interest (P ar, in the example above: µ), an estimate √ of the parameter (Est, above: ȳ), a standard error of the estimate (SE(Est), above: s/ n) and the degrees of freedom for the variance estimate in the model (df, above: n − 1), a (1 − α) confidence interval takes the form (using the notation from [1]), P ar : Est ± t(1− α2 , df) SE(Est). (3) √ Furthermore, the standard error SE(Est) takes the form s × constant, where s is the estimated (error) standard deviation in the model, and the constant depends only on the design and the estimate used. In the above example, the constant equals 1/n. Generally, the constant will involve the dimensions of the design. Therefore, the method of inverting the confidence interval with respect to n applies generally to models of this type, when the constant has been figured out. Example 1.2: Two paired (matched, dependent) samples One standard way of analyzing paired samples is to create differences, say di = y1i−y2i , where y1i and y2i constitute the pair of values in the ith pair, and to assume the differences form a single sample of independent observations from N(µ, σ 2 ). The pairs may be two values from the same subject2 , and interest lies in comparing the two values in the pair because they have been treated differently, 1 More precisely, it applies to confidence intervals for mean parameters in models with a single error term assumed to follow N(0,σ 2 ). 2 Examples are abundant: measurements of the left and right legs (arms, eyes, lungs) from human or animal patients, or measurements before/after an intervention. 2 usually to give a treatment and a control measurement in each pair. The parameter of interest is µ, the difference in means between the first and second value in a pair, and the null hypothesis of interest is H0 : µ = 0. Sample size may be selected based on a desired margin of error (M ) for a confidence interval for the mean difference, using formula (2). In our context, n is the number of pairs, and ¯ d /√n), σ is the standard deviation of differences within a pair. Our test statistic for H0 is t = d/(s and the power for a test at significance level α given a true difference in population means of δ is the probability Prδ (|t| ≥ t(1− α2 , n−1)), where subscript δ refers to the true distribution of d1 , . . . , dn being the N(δ, σ 2 ). Unfortunately, under this assumption the distribution of t is somewhat complex — a so-called non-central t-distribution, requiring special tables or software to compute probabilities. Therefore, one typically resorts to statistical software for power calculations and for determining the required sample to obtain a desired power. A small numerical example illustrates the concepts. Assume differences in blood pressure before and after an intervention to be normally distributed with an unknown mean and a standard deviation which is guessed to be 10 units (mm Hg). To achieve a 95% confidence interval for the mean difference with a margin of error of at most 3 units, requires at least (1.96 · 10/3)2 = 42.7 ≈ 43 subjects. With 43 subjects, the probability/power of detecting a true difference before and after the intervention of 3 units, using a 5% significance level, is 0.485, and to achieve a power of 0.80 would require a minimum sample size of 90 subjects (values obtained from Minitab). Example 1.3: Two independent samples Consider a statistical model with independent samples (y11 , . . . , y1n1 ) and (y21 , . . . , y2n2 ) from normal distributions N(µ1 , σ12 ) and N(µ2 , σ22 ), respectively. If the standard deviations in the two populations are equal (i.e., σ1 = σ2 = σ), highest precision for inference about the mean difference is achieved by taking equally many observations in the two samples (i.e., n1 = n2 ). Furthermore, similar procedures, as above, apply to selecting the sample size based either on a desired confidence interval size for the mean difference, or on the power of a t-test for the null hypothesis H0 : µ1 = µ2 . The relevant quantities p for the confidence interval are therefore P ar = µ1 − µ2 , Est = ȳ1· − ȳ2· and SE(Est) = s × 2/n (because Var(ȳ1· − ȳ2· ) = σ 2 × 2/n). The resulting (approximate) formula for n, analogous to (2), is n ≥ 2 [z(1− α2 ) σ/M ]2 . (4) Beware again the different interpretations of σ: here σ is the guessed, common value of the standard deviations in the two populations. Calculations of power or sample size to achieve a desired power require additional values of a true (non-zero) difference between population means and of the significance level (α). Note 1.4: Software and Simulation A plethora of different statistical software exists for power and sample size calculations ([10]). Most major statistical packages like Stata and Minitab have built-in routines for the simplest designs (oneand two-sample models for discrete and continuous data), and in particular the Minitab implementation is user-friendly and easy to use. We do not attempt a full review here, but refer to the following links (as many routines are available on the Internet for free download or free use) – http://calculators.stat.ucla.edu/powercalc/ (UCLA power calculators3 ), – http://www.stat.uiowa.edu/~rlenth/Power/index.html (applets; active researcher), 3 As of September 2007, these are disabled and it is not clear whether they will be made available again. 3 – http://www.bio.ri.ccf.org/power.html (SAS procedures, incl. UnifyPow macro [6]), – http://statpages.org/index.html#Power (list of interactive stats webpages). These lists of available software, plus the built-in and add-on facilities of statistical packages such as Minitab, Stata, and SAS, cover most reasonably standard situations. Applied reviews of software and methodology have been published in the literature ([5], [6], [10]). Criticism of the widespread (mis-)use of power calculations after the study has been carried out, so-called retrospective or post-hoc power calculations, has also been published ([3],[8]). A general approach for any particular question related to any specific model is to explore the model’s properties by simulation: use simulated random data to mimic outcomes from the model, and evaluate the performance of test statistics (or other statistics) by their outcomes among the simulated data sets. The practical limitations of this approach is that it must be possible/feasible/easy to simulate random data from the model; some tips on simulation of random effects models for discrete data are collected in the conference paper [9]. The paper [2] describes a simulation approach using Stata. 2. An approach to nonstandard sample size calculations As has hopefully become apparent from the discussion in the previous section, the essential tool in sample size calculations is the ability to calculate the standard error of certain estimates — involved in either confidence intervals or test statistics. Such calculations are possible in a much wider range of models and designs than those reviewed so far, allowing us to handle such models and designs in a very similar way as the basic designs. Example 2.1: Balanced ANOVA model without interactions Consider as an example the additive 2-way ANOVA model, yijk = µ + αi + βj + εijk , i = 1, . . . , a; j = 1, . . . , b; k = 1, . . . , c, (5) with indices i and j corresponding to two factors A and B, and with index k corresponding to replications. If interest is in comparing two levels of factor A (e.g. if A has only 2 levels), the appropriate statistic is Est = ȳ1·· − ȳ2·· . The variance of Est is σ 2 × 2/(bc), where σ is the standard deviation of the error terms εijk in the model. Therefore, sample size calculations based on the precision of the two levels of factor A are almost entirely similar4 to two-sample sample size calculations, only substituting the sample size n for the number of observations in each group (bc) and the sample standard deviation for the standard deviation of the error terms. However, the calculations will be based on an additional assumption — the lack of interaction in (5) — and the error standard deviation may be much less than the standard deviations among observations in each of the two groups (if factor B has a large effect). If it is desired to base sample size calculations on more than two levels of factor A, the reduction from model (5) is to a 1-way ANOVA. Models with more factors and unbalanced designs are dealt with along similar lines. 4 More precisely, they are similar except for the impact of the degrees of freedom for the estimated variance in the model. This can usually be ignored for the calculations as long as the resulting value is checked afterwards and is not critically small. 4 Example 2.2: Balanced ANOVA model with interaction We extend the model (5) with an interaction between factors A and B, yijk = µ + αi + βj + γij + εijk , i = 1, . . . , a; j = 1, . . . , b; k = 1, . . . , c. (6) Assume that interest is in the interaction, and — for simplicity — that both factors A and B have only 2 levels (a = b = 2). Then the interaction has only a single degree of freedom and can be represented by the estimate (contrast) Est = ȳ11· + ȳ22· − ȳ12· − ȳ21· 5 . The interpretation of Est is that it estimates the difference between factor A differences at the two levels of factor B (recall, that the presence of interaction means exactly that factor A differences are not the same at the different levels of factor B). For sample size calculations based on Est, we calculate Var(Est) = σ 2 × 4/c. The formula tells us how to base sample size on desired confidence interval length for the interaction. For power calculations, we compare the formula to the two-sample situation: it has a “4” instead of a “2”, because the interaction is estimated from 4 groups of √ size c instead of the usual 2 groups. However, we can “fix” this by a little rewriting: Var(Est) = ( 2σ)2 × 2/c. This shows, that sample size calculations may be based on two-sample √ formulae/software, if we take as the standard deviation the error standard deviation multiplied by 2 and as the group size the number of replications (c) within each group of the combined factor A×B. We next turn to random effects models, and consider as our example a two-level data structure which could correspond to modelling data for animals within herds. It can be said generally that with an increasing number of random effects, the computation of variances for the estimates of interest becomes more difficult, and more importantly the assessment of estimates of the variation in the data becomes more speculative (and really needs to be based on a previous study of a similar type). The basic statistical model is (using multiple indices here to denote the two levels of the model instead of the different factor levels), yij = β0 + β1 x1ij + . . . + βr xrij + Ai + εij , i = 1, . . . , a; j = 1, . . . , m, (7) 2 ) and the errors ε are assumed to follow where the random effects Ai are assumed to follow N(0, σA ij N(0, σε2 ). In the context of animals within herds, the index i corresponds to herds and the index j corresponds to animals. For simplicity, we have taken the number of animals in each herd to be the same for all herds (m). The fixed part of the model, involving possibly both factors and regression variables, is contained in the formalism β0 + β1 x1ij + . . . + βr xrij , where β0 is usually termed the “intercept”, and β1 , . . . , βr are the regression coefficients for the explanatory variables (possibly dummy variables) x1 , . . . , xr . One important feature of two-level models is that explanatory variables may vary at different levels (in a designed experiment, treatments may be applied at different levels). For example, treatments may be applied to animals but herd management factors are applied to herds, not animals. Another way of saying this is that there are different experimental units in the design/model — the lowest level units (animals) and the highest level units (herds). When carrying out sample size calculations for multi-level models, it is crucial to correctly identify the experimental unit for the factor of interest. If the experimental unit is at the lowest level, the upper level(s) may be ignored for sample size calculations, and the same precision is obtained if replications are obtained at the lowest or at higher levels. Again in our example, it thus makes no difference for a comparison of animal treatments, based on model (7), if replications involve more animals per herd or more herds. For explanatory variables at a higher level, the specific assumptions of the model (7) become important (see the examples below). 5 In a parameterization with the parameter restrictions Est estimates γ11 = γ22 = −γ12 = −γ21 . P 5 αi = 0, P βj = 0 and P i γij = P j γij = 0, the contrast Explanatory variables may also both vary at the lowest level and show considerable clustering at higher levels (typically in observational studies); in such situations, sample size calculations become quite difficult, and procedures will depend on the specific circumstances of the study design (and simulation may become an attractive option). Example 2.3: Two-level model with same correlations within a group For model (7), if we assume that the random variables (Ai ) and (εij ) are all independent, the model effectively assumes all observations within a level/group to be (positively) correlated and to the same degree.6 This would seem a reasonable model for our example with animals within a herd if there are no apriori reasons to expect some animals to be more strongly correlated than others, after the fixed effects in the model are taken into account. Herd factor comparisons will be based on herd averages, so in order to assess precision we need to compute their variances. Considering the first herd, for ease of notation, we have from the model formula (7) ȳ1· = β0 + β1 x̄11· + . . . + βr x̄r1· + A1 + ε̄1· = µ1 + A1 + ε̄1· , writing just µ1 for the fixed part of the model. By the assumption of all random variables being independent, we compute 2 Var(ȳ1· ) = Var(A1 ) + Var(ε̄1· ) = σA + σε2 /m, (8) where m is the number of animals in the herd. The formula shows that taking more animals within a herd will only partly reduce the precision of herd averages, and if the variation between herds (σA ) is large relative to the variation within herds, it has only little impact. Therefore, the relevant parameter to adjust to achieve desired precision of herd comparisons, is the number of herds, and the appropriate standard deviation to use is given by (the square root of) formula (8). To exemplify, if two types of herds are to be compared, sample size calculations for a two-sample model apply to determine the number of herds of each type. Values of σA and σε need to be assessed, possibly from previous studies. It is also possible to rewrite the right hand side of (8) as σ 2 (1 + ρ(m−1))/m, where 2 + σ 2 is the total variance on each observation, and ρ is the (intraclass) correlation between σ 2 = σA ε two animals in the same herd. It may be easier or more intuitive to assess these values than the two variance components. Example 2.4: Two-level model with autoregressive correlations over time Model (7) is also useful for repeated measures data with a series of observations (typically over time) on a number of subjects (say, animals). Then the upper level (index i) corresponds to animals, and the lower level (index j) corresponds to time. It is often of interest to extend the model from Example 2.3 with additional correlations between the error terms εij within each subject, and we consider here only the simplest of these — an autoregressive correlation structure (shown here for ni = 4), 1 ρ 1 . Corr(εi1 , εi2 , εi3 , εi4 ) = ρ2 ρ 1 3 2 ρ ρ ρ 1 The model contains as an additional parameter ρ — the correlation between errors of two observations adjacent in time. The variance of subject averages is still of the same form as in (8), 2 Var(ȳ1· ) = σA + Var(ε̄1· ), 6 The technical name for this assumption is a compound symmetry or exchangeable correlation structure. 6 and the last term can be computed to yield for the autoregressive correlation structure: Var(ε̄1· ) = σε2 2σε2 ρ (m−1−mρ+ρm ) + . m m2 (1−ρ)2 (9) The remaining part of sample size calculations follows the same route as in Example 2.3. To illustrate the formulae, we show in the following table some estimated values of variations and the resulting subject mean variances; data are on milk yield with 6 measures per cow, from [7]. Model estimates Same correlations Autoregressive 2 = 21.4 σ̂A σ̂ε2 = 17.9 2 = 18.7 σ̂A σ̂ε2 = 20.1 ρ̂ = 0.239 23.7 subject mean variance 24.4 References [1] Christensen, R. (1996), Analysis of Variance, Design and Regression, Chapman & Hall / CRC Press, Boca Raton. [2] Feiveson, A. H. (2002), Power by simulation, The Stata Journal 2, 107–124. [3] Hoenig, J. M. & Heisey, D. M. (2001), The abuse of power: the pervasive fallacy of power calculations for data analysis, The American Statistician 55, 19–24. [4] Moore, D. S. & McCabe, G. P. (2005), Introduction to the Practice of Statistics, 5th ed., W. H. Freeman and Company, New York. [5] Lenth, R. V. (2001), Some practical guidelines for effective sample size determination, The American Statistician 55, 187–193. [6] Lewis, K. P. (2006), Statistical power, sample sizes, and the software to calculate them easily, BioScience 56, 607–612. [7] Nødtvedt A., Dohoo I. R., Sanchez J., Conboy G., DesCôteaux L. & Keefe G. (2002), Increase in milk yield following eprinomectin treatment at calving in pastured dairy cattle, Vet. Parasitol. 105, 191–206. [8] Smith, A. H. & Bates, M. N. (1992), Confidence limit analyses should replace power calculations in the interpretation of epidemiologic studies, Epidemiology 3, 449–452. [9] Stryhn, H., Dohoo, I. R., Tillard, E. & Hagedorn-Olsen, T. (2000), Simulation as a tool of validation in hierarchical generalised linear models, IXth International Conference of Veterinary Epidemilogy and Economics, Breckenridge, Colorado, August 2000. [10] Thomas, L. & Krebs, C. J. (1997), A review of statistical power analysis software, Bull. Ecol. Soc. Am. 78, 126–139. 7