Download Notes on sample size calculations

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Foundations of statistics wikipedia , lookup

History of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Confidence interval wikipedia , lookup

Taylor's law wikipedia , lookup

Analysis of variance wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Student's t-test wikipedia , lookup

Misuse of statistics wikipedia , lookup

Transcript
Biostats VHM 801/802 Courses
Fall 2007, Atlantic Veterinary College, PEI
Henrik Stryhn
Notes on sample size calculations
These notes are intended to supplement, not replace, material in statistics textbooks ([1],[4]) about
sample size determination. Their purpose is twofold,
1) to review ways of and arguments for choosing sample size,
2) to show how non-standard sample size questions may often be rephrased in terms of simple
questions in designs with accessible answers (formulae or software).
These purposes are separated into sections 1 and 2. The sole statistical prerequisites of the notes
are basic notions of confidence intervals and test statistics. However, for section 2 to be of interest, acquaintance with the partly complex designs and models reviewed there by example, probably
corresponding to the level of an “advanced” biostatistics course (VHM 802), would be required.
1. Introduction to sample size calculations
The primary purpose of sample size calculations is for planning of experiments or studies where
statistical data analysis is contemplated: how many “subjects” or “experimental units” are needed to
achieve a reasonable (desired) precision? The specific meaning of “subjects” depends on the actual
study, and in complex designs there may be more than one type of “experimental unit” (multi-level
designs, discussed in section 2). In simple designs, the meaning is usually straightforward; for example,
in a study comparing treatments given to dogs to alleviate a disease, the experimental units are the
dogs. There are several ways of quantifying or specifying “precision”, the most common of which
follow,
– the length of a confidence interval (of chosen coverage, say 0.95) for a parameter of interest, or
almost equivalently the standard error of the corresponding parameter estimate, corresponding
to situations where primary interest is in the parameter and its estimate, and not necessarily
in a particular test for a hypothesis involving the parameter,
– the power of a statistical test to show a significant difference of a certain magnitude. Specifically,
the power is the probability of the outcome of the statistical test to be significant at a prescribed
significance level (e.g., 0.05) given that the null hypothesis being tested is false. This probability
naturally depends on what is actually true — intuitively, how far the null hypothesis is off the
true situation — which must therefore be specified to perform any power calculations. For
example, if the null hypothesis is that a treatment and a control group have equal means, power
calculations require specification of the true difference between the means in the two groups.
We review these two approaches in the context of normal distribution models for the basic experimental designs: one and two samples. For non-normal data, the concepts are similar but in most cases
all calculations require specialized software or computation (briefly reviewed in Note 1.4 below).
1
Example 1.1: One-sample model/design
Consider a statistical model which assumes observations y1 , . . . , yn to be independent and distributed
according to N(µ, σ 2 ). The primary interest is usually in the mean parameter µ, although both the
mean and the standard deviation σ are assumed unknown. A (1−α) confidence interval for µ takes
the well-known form:
√
µ : ȳ ± t(1− α2 , n−1) s/ n,
(1)
where t(1 − α2 , n − 1) is the (1 − α2 )-percentile in a t-distribution with n − 1 degrees of freedom. In
order to achieve a confidence interval of a certain, predescribed length (say, not exceeding L), the
√
obvious procedure is to solve the equation L ≥ 2 t(1− α2 , n − 1)s/ n with respect to n. Note that
the “2” on the right hand side stems from the length of the interval being twice its margin of error
√
(t(1− α2 , n−1)s/ n). It gives of course the same result to solve for the margin of error (say M = L/2),
√
where the equation is M ≥ t(1− α2 , n − 1)s/ n. To make the mechanics of solving these equations
a bit easier, one usually approximates the t-distribution percentile by the corresponding value from
a N(0,1) distribution. This approximation is valid only when n is large, and the calculation should
be redone with a suitable t-distribution percentile if the resulting n is not so. Also, the value of s
is unknown and must be substituted by an assumed/estimated/guessed value of the true standard
deviation (for simplicity denoted also by σ), usually an estimate from previous data or a literature
search. The (approximate) equation for n becomes
n ≥ [z(1− α2 ) σ/M ]2 .
(2)
Choice of sample size based on power calculations is usually less natural for single sample models
because of lack of interesting null hypotheses. We therefore defer the discussion to the next example
below on matched samples which effectively are treated as a single sample.
The formula (1) is the simplest example of a general method for computing confidence intervals in
normal distribution models1 . For a parameter of interest (P ar, in the example above: µ), an estimate
√
of the parameter (Est, above: ȳ), a standard error of the estimate (SE(Est), above: s/ n) and the
degrees of freedom for the variance estimate in the model (df, above: n − 1), a (1 − α) confidence
interval takes the form (using the notation from [1]),
P ar : Est ± t(1− α2 , df) SE(Est).
(3)
√
Furthermore, the standard error SE(Est) takes the form s × constant, where s is the estimated
(error) standard deviation in the model, and the constant depends only on the design and the estimate
used. In the above example, the constant equals 1/n. Generally, the constant will involve the
dimensions of the design. Therefore, the method of inverting the confidence interval with respect to n
applies generally to models of this type, when the constant has been figured out.
Example 1.2: Two paired (matched, dependent) samples
One standard way of analyzing paired samples is to create differences, say di = y1i−y2i , where y1i and
y2i constitute the pair of values in the ith pair, and to assume the differences form a single sample
of independent observations from N(µ, σ 2 ). The pairs may be two values from the same subject2 ,
and interest lies in comparing the two values in the pair because they have been treated differently,
1
More precisely, it applies to confidence intervals for mean parameters in models with a single error term assumed
to follow N(0,σ 2 ).
2
Examples are abundant: measurements of the left and right legs (arms, eyes, lungs) from human or animal patients,
or measurements before/after an intervention.
2
usually to give a treatment and a control measurement in each pair. The parameter of interest is µ,
the difference in means between the first and second value in a pair, and the null hypothesis of interest
is H0 : µ = 0. Sample size may be selected based on a desired margin of error (M ) for a confidence
interval for the mean difference, using formula (2). In our context, n is the number of pairs, and
¯ d /√n),
σ is the standard deviation of differences within a pair. Our test statistic for H0 is t = d/(s
and the power for a test at significance level α given a true difference in population means of δ is
the probability Prδ (|t| ≥ t(1− α2 , n−1)), where subscript δ refers to the true distribution of d1 , . . . , dn
being the N(δ, σ 2 ). Unfortunately, under this assumption the distribution of t is somewhat complex
— a so-called non-central t-distribution, requiring special tables or software to compute probabilities.
Therefore, one typically resorts to statistical software for power calculations and for determining
the required sample to obtain a desired power. A small numerical example illustrates the concepts.
Assume differences in blood pressure before and after an intervention to be normally distributed with
an unknown mean and a standard deviation which is guessed to be 10 units (mm Hg). To achieve a
95% confidence interval for the mean difference with a margin of error of at most 3 units, requires at
least (1.96 · 10/3)2 = 42.7 ≈ 43 subjects. With 43 subjects, the probability/power of detecting a true
difference before and after the intervention of 3 units, using a 5% significance level, is 0.485, and to
achieve a power of 0.80 would require a minimum sample size of 90 subjects (values obtained from
Minitab).
Example 1.3: Two independent samples
Consider a statistical model with independent samples (y11 , . . . , y1n1 ) and (y21 , . . . , y2n2 ) from normal
distributions N(µ1 , σ12 ) and N(µ2 , σ22 ), respectively. If the standard deviations in the two populations
are equal (i.e., σ1 = σ2 = σ), highest precision for inference about the mean difference is achieved by
taking equally many observations in the two samples (i.e., n1 = n2 ). Furthermore, similar procedures,
as above, apply to selecting the sample size based either on a desired confidence interval size for the
mean difference, or on the power of a t-test for the null hypothesis H0 : µ1 = µ2 . The relevant quantities
p
for the confidence interval are therefore P ar = µ1 − µ2 , Est = ȳ1· − ȳ2· and SE(Est) = s × 2/n
(because Var(ȳ1· − ȳ2· ) = σ 2 × 2/n). The resulting (approximate) formula for n, analogous to (2), is
n ≥ 2 [z(1− α2 ) σ/M ]2 .
(4)
Beware again the different interpretations of σ: here σ is the guessed, common value of the standard deviations in the two populations. Calculations of power or sample size to achieve a desired
power require additional values of a true (non-zero) difference between population means and of the
significance level (α).
Note 1.4: Software and Simulation
A plethora of different statistical software exists for power and sample size calculations ([10]). Most
major statistical packages like Stata and Minitab have built-in routines for the simplest designs (oneand two-sample models for discrete and continuous data), and in particular the Minitab implementation is user-friendly and easy to use. We do not attempt a full review here, but refer to the following
links (as many routines are available on the Internet for free download or free use)
– http://calculators.stat.ucla.edu/powercalc/ (UCLA power calculators3 ),
– http://www.stat.uiowa.edu/~rlenth/Power/index.html (applets; active researcher),
3
As of September 2007, these are disabled and it is not clear whether they will be made available again.
3
– http://www.bio.ri.ccf.org/power.html (SAS procedures, incl. UnifyPow macro [6]),
– http://statpages.org/index.html#Power (list of interactive stats webpages).
These lists of available software, plus the built-in and add-on facilities of statistical packages such
as Minitab, Stata, and SAS, cover most reasonably standard situations. Applied reviews of software
and methodology have been published in the literature ([5], [6], [10]). Criticism of the widespread
(mis-)use of power calculations after the study has been carried out, so-called retrospective or post-hoc
power calculations, has also been published ([3],[8]).
A general approach for any particular question related to any specific model is to explore the model’s
properties by simulation: use simulated random data to mimic outcomes from the model, and evaluate
the performance of test statistics (or other statistics) by their outcomes among the simulated data
sets. The practical limitations of this approach is that it must be possible/feasible/easy to simulate
random data from the model; some tips on simulation of random effects models for discrete data are
collected in the conference paper [9]. The paper [2] describes a simulation approach using Stata.
2. An approach to nonstandard sample size calculations
As has hopefully become apparent from the discussion in the previous section, the essential tool in
sample size calculations is the ability to calculate the standard error of certain estimates — involved
in either confidence intervals or test statistics. Such calculations are possible in a much wider range
of models and designs than those reviewed so far, allowing us to handle such models and designs in
a very similar way as the basic designs.
Example 2.1: Balanced ANOVA model without interactions
Consider as an example the additive 2-way ANOVA model,
yijk = µ + αi + βj + εijk ,
i = 1, . . . , a; j = 1, . . . , b; k = 1, . . . , c,
(5)
with indices i and j corresponding to two factors A and B, and with index k corresponding to replications. If interest is in comparing two levels of factor A (e.g. if A has only 2 levels), the appropriate
statistic is Est = ȳ1·· − ȳ2·· . The variance of Est is σ 2 × 2/(bc), where σ is the standard deviation of
the error terms εijk in the model. Therefore, sample size calculations based on the precision of the
two levels of factor A are almost entirely similar4 to two-sample sample size calculations, only substituting the sample size n for the number of observations in each group (bc) and the sample standard
deviation for the standard deviation of the error terms. However, the calculations will be based on
an additional assumption — the lack of interaction in (5) — and the error standard deviation may
be much less than the standard deviations among observations in each of the two groups (if factor B
has a large effect). If it is desired to base sample size calculations on more than two levels of factor
A, the reduction from model (5) is to a 1-way ANOVA. Models with more factors and unbalanced
designs are dealt with along similar lines.
4
More precisely, they are similar except for the impact of the degrees of freedom for the estimated variance in the
model. This can usually be ignored for the calculations as long as the resulting value is checked afterwards and is not
critically small.
4
Example 2.2: Balanced ANOVA model with interaction
We extend the model (5) with an interaction between factors A and B,
yijk = µ + αi + βj + γij + εijk ,
i = 1, . . . , a; j = 1, . . . , b; k = 1, . . . , c.
(6)
Assume that interest is in the interaction, and — for simplicity — that both factors A and B have
only 2 levels (a = b = 2). Then the interaction has only a single degree of freedom and can be
represented by the estimate (contrast) Est = ȳ11· + ȳ22· − ȳ12· − ȳ21· 5 . The interpretation of Est is
that it estimates the difference between factor A differences at the two levels of factor B (recall, that
the presence of interaction means exactly that factor A differences are not the same at the different
levels of factor B). For sample size calculations based on Est, we calculate Var(Est) = σ 2 × 4/c.
The formula tells us how to base sample size on desired confidence interval length for the interaction.
For power calculations, we compare the formula to the two-sample situation: it has a “4” instead
of a “2”, because the interaction is estimated from 4 groups of
√ size c instead of the usual 2 groups.
However, we can “fix” this by a little rewriting: Var(Est) = ( 2σ)2 × 2/c. This shows, that sample
size calculations may be based on two-sample
√ formulae/software, if we take as the standard deviation
the error standard deviation multiplied by 2 and as the group size the number of replications (c)
within each group of the combined factor A×B.
We next turn to random effects models, and consider as our example a two-level data structure
which could correspond to modelling data for animals within herds. It can be said generally that with
an increasing number of random effects, the computation of variances for the estimates of interest
becomes more difficult, and more importantly the assessment of estimates of the variation in the data
becomes more speculative (and really needs to be based on a previous study of a similar type). The
basic statistical model is (using multiple indices here to denote the two levels of the model instead of
the different factor levels),
yij = β0 + β1 x1ij + . . . + βr xrij + Ai + εij ,
i = 1, . . . , a; j = 1, . . . , m,
(7)
2 ) and the errors ε are assumed to follow
where the random effects Ai are assumed to follow N(0, σA
ij
N(0, σε2 ). In the context of animals within herds, the index i corresponds to herds and the index j
corresponds to animals. For simplicity, we have taken the number of animals in each herd to be the
same for all herds (m). The fixed part of the model, involving possibly both factors and regression
variables, is contained in the formalism β0 + β1 x1ij + . . . + βr xrij , where β0 is usually termed the
“intercept”, and β1 , . . . , βr are the regression coefficients for the explanatory variables (possibly dummy
variables) x1 , . . . , xr .
One important feature of two-level models is that explanatory variables may vary at different levels
(in a designed experiment, treatments may be applied at different levels). For example, treatments
may be applied to animals but herd management factors are applied to herds, not animals. Another
way of saying this is that there are different experimental units in the design/model — the lowest level
units (animals) and the highest level units (herds). When carrying out sample size calculations for
multi-level models, it is crucial to correctly identify the experimental unit for the factor of interest.
If the experimental unit is at the lowest level, the upper level(s) may be ignored for sample size
calculations, and the same precision is obtained if replications are obtained at the lowest or at higher
levels. Again in our example, it thus makes no difference for a comparison of animal treatments, based
on model (7), if replications involve more animals per herd or more herds. For explanatory variables at
a higher level, the specific assumptions of the model (7) become important (see the examples below).
5
In a parameterization with the parameter restrictions
Est estimates γ11 = γ22 = −γ12 = −γ21 .
P
5
αi = 0,
P
βj = 0 and
P
i
γij =
P
j
γij = 0, the contrast
Explanatory variables may also both vary at the lowest level and show considerable clustering at
higher levels (typically in observational studies); in such situations, sample size calculations become
quite difficult, and procedures will depend on the specific circumstances of the study design (and
simulation may become an attractive option).
Example 2.3: Two-level model with same correlations within a group
For model (7), if we assume that the random variables (Ai ) and (εij ) are all independent, the model
effectively assumes all observations within a level/group to be (positively) correlated and to the same
degree.6 This would seem a reasonable model for our example with animals within a herd if there are
no apriori reasons to expect some animals to be more strongly correlated than others, after the fixed
effects in the model are taken into account. Herd factor comparisons will be based on herd averages,
so in order to assess precision we need to compute their variances. Considering the first herd, for ease
of notation, we have from the model formula (7)
ȳ1· = β0 + β1 x̄11· + . . . + βr x̄r1· + A1 + ε̄1· = µ1 + A1 + ε̄1· ,
writing just µ1 for the fixed part of the model. By the assumption of all random variables being
independent, we compute
2
Var(ȳ1· ) = Var(A1 ) + Var(ε̄1· ) = σA
+ σε2 /m,
(8)
where m is the number of animals in the herd. The formula shows that taking more animals within
a herd will only partly reduce the precision of herd averages, and if the variation between herds
(σA ) is large relative to the variation within herds, it has only little impact. Therefore, the relevant
parameter to adjust to achieve desired precision of herd comparisons, is the number of herds, and
the appropriate standard deviation to use is given by (the square root of) formula (8). To exemplify,
if two types of herds are to be compared, sample size calculations for a two-sample model apply to
determine the number of herds of each type. Values of σA and σε need to be assessed, possibly from
previous studies. It is also possible to rewrite the right hand side of (8) as σ 2 (1 + ρ(m−1))/m, where
2 + σ 2 is the total variance on each observation, and ρ is the (intraclass) correlation between
σ 2 = σA
ε
two animals in the same herd. It may be easier or more intuitive to assess these values than the two
variance components.
Example 2.4: Two-level model with autoregressive correlations over time
Model (7) is also useful for repeated measures data with a series of observations (typically over time)
on a number of subjects (say, animals). Then the upper level (index i) corresponds to animals, and
the lower level (index j) corresponds to time. It is often of interest to extend the model from Example
2.3 with additional correlations between the error terms εij within each subject, and we consider here
only the simplest of these — an autoregressive correlation structure (shown here for ni = 4),


1
 ρ 1

.
Corr(εi1 , εi2 , εi3 , εi4 ) = 
 ρ2 ρ 1

3
2
ρ ρ ρ 1
The model contains as an additional parameter ρ — the correlation between errors of two observations
adjacent in time. The variance of subject averages is still of the same form as in (8),
2
Var(ȳ1· ) = σA
+ Var(ε̄1· ),
6
The technical name for this assumption is a compound symmetry or exchangeable correlation structure.
6
and the last term can be computed to yield for the autoregressive correlation structure:
Var(ε̄1· ) =
σε2
2σε2 ρ (m−1−mρ+ρm )
+
.
m
m2 (1−ρ)2
(9)
The remaining part of sample size calculations follows the same route as in Example 2.3. To illustrate
the formulae, we show in the following table some estimated values of variations and the resulting
subject mean variances; data are on milk yield with 6 measures per cow, from [7].
Model
estimates
Same correlations
Autoregressive
2 = 21.4
σ̂A
σ̂ε2 = 17.9
2 = 18.7
σ̂A
σ̂ε2 = 20.1
ρ̂ = 0.239
23.7
subject mean variance
24.4
References
[1] Christensen, R. (1996), Analysis of Variance, Design and Regression, Chapman & Hall / CRC
Press, Boca Raton.
[2] Feiveson, A. H. (2002), Power by simulation, The Stata Journal 2, 107–124.
[3] Hoenig, J. M. & Heisey, D. M. (2001), The abuse of power: the pervasive fallacy of power
calculations for data analysis, The American Statistician 55, 19–24.
[4] Moore, D. S. & McCabe, G. P. (2005), Introduction to the Practice of Statistics, 5th ed., W. H.
Freeman and Company, New York.
[5] Lenth, R. V. (2001), Some practical guidelines for effective sample size determination, The American Statistician 55, 187–193.
[6] Lewis, K. P. (2006), Statistical power, sample sizes, and the software to calculate them easily,
BioScience 56, 607–612.
[7] Nødtvedt A., Dohoo I. R., Sanchez J., Conboy G., DesCôteaux L. & Keefe G. (2002), Increase in
milk yield following eprinomectin treatment at calving in pastured dairy cattle, Vet. Parasitol.
105, 191–206.
[8] Smith, A. H. & Bates, M. N. (1992), Confidence limit analyses should replace power calculations
in the interpretation of epidemiologic studies, Epidemiology 3, 449–452.
[9] Stryhn, H., Dohoo, I. R., Tillard, E. & Hagedorn-Olsen, T. (2000), Simulation as a tool of
validation in hierarchical generalised linear models, IXth International Conference of Veterinary
Epidemilogy and Economics, Breckenridge, Colorado, August 2000.
[10] Thomas, L. & Krebs, C. J. (1997), A review of statistical power analysis software, Bull. Ecol.
Soc. Am. 78, 126–139.
7