* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Laboratory 9: Introduction to Sample Size Calculation
Inductive probability wikipedia , lookup
Foundations of statistics wikipedia , lookup
Bootstrapping (statistics) wikipedia , lookup
History of statistics wikipedia , lookup
Taylor's law wikipedia , lookup
Confidence interval wikipedia , lookup
German tank problem wikipedia , lookup
Resampling (statistics) wikipedia , lookup
Laboratory: Case-Control Studies “How many do I need?” is one of the most common questions addressed to an epidemiologist. The epidemiologist answers with “What question are you attempting to answer?” Sample size depends on the purpose of the study. More often than not the investigator has not precisely determined what question is to be answered. It is essential that this be done before sample size calculations can be performed. There are several key reasons to perform sample size calculations: 1) it forces a specification of expected outcomes, 2) it leads to a stated recruitment goal, 3) it encourages development of appropriate timetables and budgets, 4) it discourages the conduct of small, inconclusive trials, and perhaps most importantly, 5) it reduces the unnecessary use of animals in animal experiments. When you read studies, you will come across common mistakes that are related to sample size. These include: 1) a failure to discuss a sample size, 2) unrealistic assumptions about disease incidence, etc., 3) a failure to explore sample size for a range of input values, 4) a failure to state power for a completed study with negative results (often referred to as post-hoc power which opens up another major debate), and 5) a failure to account for attrition by increasing the sample size above the calculated size. Variability is an important consideration when calculating sample sizes. The greater the variability among subjects, the harder it is to demonstrate a significant difference. Making multiple measurements on subjects and averaging them can increase the precision and may help decrease variability (by decreasing random error). Paired measurements (e.g. baseline and after treatment) can be used to measure change which can reduce the variability of the outcome and enable a smaller sample size. Hypothesis Testing A common use of statistics is to test whether the data that we observed are consistent with the null hypothesis. Of course, we never expect the data to be exactly the same as the null hypothesis as discussed in relation to the Central Limit Theorem. However, if the data we observe would be extremely rare under the distribution proposed by the null hypothesis, then we reject the null hypothesis as being inconsistent with the data and instead accept an alternate hypothesis. In the application of hypothesis testing procedures, one must bear in mind that a result that is declared to be statistically significant may not necessarily be of practical significance. In particular, if the sample size is large, the difference between an observed sample statistic and the hypothesized population parameter may be highly significant, although the actual difference between them is negligible. For example, a difference between an observed proportion of pˆ = 0.84 and a hypothesized value of 0.80 would be significant at the 5% level if the number of observations exceeds 385. By the same token, a non-significant difference may be of practical importance. This might be the case in a medical research program concerning the treatment of a life-threatening disease, where any improvement in the probability of survival beyond a certain time will be important, no matter how small. Generally speaking, the result of a statistical test of the validity of H0 will often only be part of the evidence against H0. This evidence will usually 1 need to be combined with other, non-statistical, evidence before an appropriate course of action can be determined. Type I and Type II Errors Possible condition of null hypothesis True False Possible action Fail to reject H0 Significance (1-α) Type II error (β) Reject H0 Type I error (α) Power (1-β) Significance (1-α): Also called a confidence level, it refers to the probability that an observed difference in a study reflects a true difference in the population under the assumptions of the null hypothesis, i.e. that the observed difference is not due to chance given the null hypothesis is true. Type I error (α): The probability of rejecting a true null hypothesis, or incorrectly concluding that a difference exists (the null hypothesis is not appropriate) when in fact a difference really does not exist in the population. A false positive decision. Power (1-β): The probability of correctly rejecting the null hypothesis when it is in fact false, and thus, the probability of observing a difference in the sample when an equal or greater difference is present in the population. Type II error (β): The probability of accepting a false null hypothesis, or incorrectly concluding that a difference does not exist (the null hypothesis is appropriate) when in fact a difference really does exist in the population. A false negative decision. 2 Precision and Validity In order to make unbiased inferences about the associations between putative causes and outcomes, we must measure the various factors with as little error as possible. The true value of the factor in the population is known as the parameter. This population value is not typically measurable, so instead, we obtain an estimate of the parameter through sampling of the population. The overall goal of an epidemiologic study is accuracy in estimation: to estimate the value of the parameter that is the object of measurement with little error. Sources of error can be classified as either random or systematic. Our goal is to reduce both types of error. Validity High Low High Precision Low Precision (lack of random error) Precision is a reduction in random error, or that variation attributed to sampling. Precision is also referred to as reliability and repeatability, depending on the book you read. I prefer to use precision (and possibly reliability) because it more clearly describes the nature of the problem: how far from the true value are the randomly sampled data points, strictly due to random error. Repeatability, at least to me, implies whether two observers would generate the same results when measuring the same samples (for instance, comparing interpretation of clinical pathology slides). As an example of precision, consider the prevalence of a disease in a herd of 1000 animals. Let the true prevalence be 10%, or 100 out of the 1000. If we randomly sample 10 individuals in the herd, will we always get 1 out of 10 with the disease (and therefore an estimate of prevalence equal to 10%)? How much variability will there be in our estimate due strictly to random error? How do we improve the precision of our study? 1. Increase the sample size. By increasing sample size, we reduce the random error because we gain confidence that our sample is more representative of the population from which the sample was taken. There are formulas that relate the size of a study to the study design, the study population, and the desired precision (or power). We will discuss these methods in more detail later in the course. 3 2. Increase the efficiency of the study. What do we mean by this? We want to maximize the information that each individual in the study provides to our inferences. For example, suppose we are planning a prospective study to determine the influence of exposure X on disease Y. We randomly enroll the subjects, and these subjects ultimately fall into two groups, those exposed to X and those unexposed to X. The goal is to determine precisely the influence of X on Y. Is it better to enroll more subjects? Not necessarily. Suppose we enroll 10,000 subjects, but it turns out that only 50 were exposed to X (because of the way in which subjects were enrolled). We do not have much information about exposure, and therefore, our ability to determine the relationship between X and Y will be imprecise. When evaluating the efficiency of a study, we must consider both the amount of information that each subject provides as well as the cost of acquiring this information. Validity (lack of systematic error) Validity is the ability to measure what we actually think we are measuring. Bias then refers to the presence of a systematic error. It does not include the error associated with sampling variability, but instead, is generally considered a flaw in the design, analysis or both. It is important to recognize that only rarely can data that are biased by the sampling design be adjusted for in the analysis. It is therefore critical to anticipate and control for biases at the outset of the study. When discussing validity, we usually separate it into two components, internal validity and external validity. P-value The p-value represents the probability of obtaining an outcome which is at least as extreme as the one which is actually observed, given that the null hypothesis is correct. All of the testing that is performed IS BASED ENTIRELY on the assumption that the null hypothesis is correct. We are using the distribution of the null hypothesis in order to determine the probability of observing the data that we actually saw, and thus, we can only reject or accept the null hypothesis, but we can never prove that it is true. We are assuming it is true!!! In addition, the p-value tells us nothing about the probability of other alternate hypotheses being more appropriate. Confidence Intervals A confidence interval is a range that we construct in order to capture the desired parameter value with a certain probability. The higher this probability of containing the true parameter value, the wider the interval must be for the same data. The interpretation of a confidence interval is based on the idea of repeated sampling. Suppose that repeated samples of n binary observations are taken, and a 95% confidence interval for p is obtained for each of the data sets. Then, 95% of these intervals would be expected to contain the true success probability. It is important to recognize that this does not mean that there is a probability of 0.95 that p lies in the confidence interval. This is because p is fixed in any particular case and so it either lies in the confidence interval or it does not. The limits of the interval are the random variables. Therefore, the correct 4 view of a confidence interval is based on the proportion of calculated intervals that would be expected to contain p, rather than the probability that p is contained within any particular interval. As an example, suppose a survey of dairy farmers is conducted to estimate the proportion of herds that vaccinate against BVD. Within a specific geographic region, suppose that out of 300 farms, 123 said that they vaccinate. The 95% confidence interval would be calculated as follows: First, we need a point estimate of the population parameter. In this case, we use: pˆ = 123 300 = 0.41 The variance is then calculated as: 1 1 pˆ (1 − pˆ ) .41(.59 ) = Var ( pˆ ) = Var y = 2 (Var ( y )) = 2 np(1 − p ) = n n 300 n n Using the normal approximation, we then have the confidence interval calculated as: pˆ ± z (1−α ) 2 pˆ (1 − pˆ ) , or 0.41 ± 1.96(0.028) = (0.36, 0.46). How would we interpret this? n A primary motivation for believing that Bayesian thinking is important is that it facilitates a common-sense interpretation of statistical conclusions. For instance, a Bayesian (probability) interval for an unknown quantity of interest can be directly regarded as having high probability of containing the unknown quantity, in contrast to a frequentist (confidence) interval, which may strictly be interpreted only in relation to a sequence of similar inferences that might be made in repeated practice. Central Limit Theorem (CLT) In a skewed distribution, the frequency distribution is asymmetrical, with some values being disproportionately more or less frequent. This would seem to preclude the use of the Normal distribution to model a given skewed variable. The sampling distribution of the mean that we developed above, however, circumvents the apparent problem. Regardless of the population's distribution, the distribution of the sampled means that pertain to that population will not be skewed. The CLT states that the distribution of sample means from repeated samples of n independent observations approaches a Normal distribution, with a mean of µ and a variance σ2/n. E ( X ) = µ and Var ( X ) = σ 2 n . 5 This implies that as n increases, the variance of X decreases. These parameters can be transformed back into the Standard Normal, such that X −µ is N(0,1). σ n Notice that this theorem requires no assumption about the form of the original distribution. It works for any configuration. What makes this theorem so important is that it holds regardless of the shape or form of the population distribution. If the sample size is large enough, the sampling distribution of the means is always Normal. The major importance of this result is for statistical inference, as the CLT provides us with the ability to make even more precise statements. Instead of relying on Tchebychev's theorem, which ensures that at least 75% of the observations are within 2 standard deviations of the mean, we know from mathematical statistics that the proportion of observations falling within 2 or 3 standard deviations of the mean for a Normal distribution is 0.95 or 0.997, respectively. One place where the Normal approximation can be used is with the Binomial distribution. As n becomes large, the Binomial can become difficult to calculate. Under the CLT, the Normal can be used to approximate the Binomial, and this approximation improves as n increases. As a general rule, the approximation should only be used when n * p (or n * q) is greater than 5. The CLT states that the distribution of W = Y − np = X−p is N(0,1). Thus, if n is sufficiently large, the distribution of Y p (1 − p ) n np (1 − p ) (the expected number of successes) is approximately N[ np, np(1-p) ]. Remember that the disease outcomes we are measuring are often rare. This may be especially true if we are studying mortality due to some factor. Thus, the Normal approximation may not be the ideal way in which to estimate the Binomial, and instead, exact methods may be required. This will become clear when we discuss confidence intervals later in this section. 6 Sample size to estimate a proportion (e.g., prevalence) When one wants to investigate the presence of disease, we are dealing with a presence/absence (binomial) situation because each selected element is either infected or not infected. One property of the Binomial distribution is that the variance equals p * (1 - p), where p is expressed as a proportion (or it is p * (100 - p), when p is expressed as a percentage). Of course, the standard deviation is the square root of p * (1 -p). We can then use the following formula to determine the samples size. n= n Z 12−α 2 Pˆ Qˆ L2 = Z1-α/2 = ≈ NZ 12−α 2 Pˆ Qˆ L2 ( N − 1) + Z α2 Pˆ Qˆ estimated sample size for the study. value of Z which provides α/2 in each tail of normal curve if a 2-tailed test. If a 1tailed test is used, this should say 1 - α. If α, the type I error, is 0.05, then the 2tailed Z is 1.96. α specifies the probability of declaring a difference to be statistically significant when no real difference exists in the population. P̂ Q̂ L = = = the best guess of the prevalence of the disease in the population. 1 − P̂ allowable error or required precision. N = population size. If the population size is large, this is irrelevant (equivalent to sampling with replacement). A property of the variance of a proportion is that it is symmetric around its maximum at p = 0.5 (or 50%). Hence, the sample size is maximal when p is estimated to be 50%, and this value should be used when there is no idea of the actual proportion. Be careful with the value you select for L, especially at low or high prevalences. For example, one is not interested whether or not the prevalence equals 4% ± 10%, but rather 4% ± 2%. Example: a farmer raising veal calves asks you to determine the prevalence of salmonellosis due to Salmonella dublin in veal calves. It is a large unit with well over 1000 calves. The prevalence on this farm is known to range from 0 to 80%. It is prudent to put the estimated prevalence at 50% because nothing is known about the actual prevalence except that it is between 0 and 80%. By choosing 50%, you will end up with the largest possible sample size for a desired value of L and Z. Now, you need to put values on L and Z, so assume you choose 5% and 1.96. Thus, you are trying to estimate with 95% confidence a true prevalence of 50% plus or minus 5%. Now, calculate n. Answer: 385 (or 278 if use the finite population correction for N = 1000). 7 Sample size to estimate differences between proportions Suppose you want to compare two antibiotics. Two groups of animals are infected with an appropriate pathogen and the percentages of recovery in both groups are compared. Question: how many animals should be included in each group? The following formula can be used (Fleiss, 1981): (Z n= n 1−α 2 = 2 P Q + Z 1− β Pe Qe + Pc Qc (Pe − Pc )2 ) 2 estimated sample size for each of the exposed and unexposed groups. Z1-α/2 = value of Z which provides α/2 in each tail of normal curve if a 2-tailed test. If a 1tailed test is used, this should say 1 - α. If α, the type I error, is 0.05, then the 2tailed Z is 1.96. α specifies the probability of declaring a difference to be statistically significant when no real difference exists in the population. Z1-β = value of Z which provides β in the tail of normal curve. If β, the type II error, is 0.2 then the Z value is 0.84. β specifies the probability of declaring a difference to be statistically nonsignificant when there is a real difference in the population. Pe = estimate of response rate in exposed group or exposure rate in cases. Pc = estimate of response rate in unexposed group or exposure rate in noncases. P Q = = (Pe + Pc) / 2 1− P Example: a pharmaceutical company has developed a brand new antibiotic against pathogen X. No other antibiotics are available, so no comparison can be made with existing antibiotic treatments. However, it is known from field data that 70% recover from the disease (although effects on production are tremendous). It is expected (hoped) that after use of the antibiotic 95% of the animals will improve and that the duration of the disease will be shorter as well. Thus, p1 = 0.70 and p2 = 0.95. Choosing a two-sided confidence of 95% and a power of 80%, it shows that the values for Z1-α/2 and Z1-β are 1.96 and 0.84, respectively. Using the formula, calculate n for each group. Answer: 36 (or 43 in EpiInfo, which uses the finite population correction). 8 Sample size to detect a disease in a population Suppose one is interested in the percentage of farms that are infected with a pathogen X (prevalence based on whether or not the pathogen is present at a farm). A farm is considered to be infected when at least one of its animals is infected. First the appropriate number of farms as units of concern is randomly chosen and secondly the status of the farm (infected or not) is determined. The proportion of infected animals per farm is not of major interest. Ideally, we would screen all the animals on a farm, but often this is not necessary. Suppose that it is known that if the disease is present, about 50% of the animals are likely to be positive. By sampling one animal you have a 50% probability of correctly concluding that a truly positive farm is positive. In general, one aims at a higher probability to classify a positive farm correctly (e.g., 95%). By selecting 2 animals, the probability is increased up to 75% (25% of drawings show two negatives), 3 animals yield 87.5%, 4 animals 93.75% and 5 animals 96.875%. Thus, between 4 and 5 animals should be sampled to detect positive farms with a probability of 95%, if 50% of the animals are truly diseased. The calculations can be put into a general formula (Cannon and Roe, 1982): 1 (d − 1) n = 1 − (1 − p ) d * N − 2 n = sample size p = probability of finding at least one case (= confidence level, e.g., 0.95) d = number of (detectable) cases in the population N = population size If the test that is used to evaluate the status of the animals is not 100% sensitive, d is equal to the number of diseased or infected animals multiplied by the sensitivity of the test. It is assumed that no false-positives are present or that they are ruled out by confirmatory tests. Example: suppose you want to detect whether or not a flock of N = 1000 animals is positive with regard to pathogen X. If X is present, you suspect that about 50% of the animals will be infected, thus d =500. Setting p to 0.95, n equals: (1 – 0.051/500) * (1000 - 499/2) = 4.48 (rounded as 5). Is this number much affected by N? (No, because the prevalence is rather high). A similar formula can be used to determine the maximum number of positives (d) in a population given that all samples (n) tested negative: 1 (n − 1) d = 1 − (1 − p ) n * N − 2 Example: suppose that 1,000 slaughter cows tested negative to E. coli O157:H7 and the total number of cows slaughtered amounted to 1 million. What is the maximal prevalence in the ‘population’ of slaughter cows? What is the maximal prevalence if the sensitivity of the test is only 85%? Answers: if p is set to 0.95 then d equals 2989 which is about 0.3%; 0.3/0.85 = 0.35. 9 Sample size formula to estimate a mean To calculate sample sizes for a mean obtained from a Normal distribution, we need to estimate both the mean and its standard deviation (S). Secondly, we need to define an interval L, indicating that with a probability of (1 - α)% the true population mean will be within the interval of the sample mean ± L. This interval is therefore an indication of the precision of the estimate. Thirdly, we have to decide on (1 - α). There is no general rule for this decision, but 95 is commonly used. From this we can determine the corresponding number of units (Z) of S. For example, if (1 - α)= 95 then the significance level is 5% and Z0.05 = 1.96. This value indicates that that 95% of all the observations will fall within the interval: mean ± 1.96*S / √n. The confidence interval (CI) for a mean, obtained by drawing elements from a Normal distribution, can be written as: CI = x ± Z 1−α 2 * S n If the part after the ± sign is denoted as L, n is then calculated as: n= n Z 12−α 2 * Sˆ 2 L2 = Z1-α/2 = estimated sample size for the study. value of Z which provides α/2 in each tail of normal curve if a 2-tailed test. If a 1tailed test is used, this should say 1 - α. If α, the type I error, is 0.05, then the 2tailed Z is 1.96. α specifies the probability of declaring a difference to be statistically significant when no real difference exists in the population. Ŝ = estimate of standard deviation in the population. L = allowable error or required precision. Example: suppose you want to estimate growth/day in veal calves. Growth should be around 1000 g / day and an estimate of the S is 250 g. How many calves should be weighed to check whether or not the mean growth of a large unit is between 950 and 1050 g / day? A confidence level of 95% is required. (If you want more practice, also try calculating the sample size required when (1) SD increases to 375; (2) the confidence level decreases to 90%; (3) L is changed from 50 to 100). The answers you get should be: 96; 216; 67; 24. 10 Sample size to estimate differences between means Sometimes an investigator is interested in differences between groups of animals, e.g., differences in milk production between mastitic cows and healthy cows. Suppose milk production is a Normally distributed parameter. Now we can perform a one-tailed test, because we know that mastitic cows will show reduced milk production. This has an impact on the value of Z1-α: the one-sided value of Z0.05 is equal to the two-sided Z0.10. In order to make the sample size calculation we need: an estimation of the difference between both groups X e − X c = δ the standard deviation of the trait the significance level α the power of the test, i.e., the probability (1-β) of obtaining a significant result if the true difference equals X e − X c . The sample size formula is written as: (Z 1−α 2 + Z 1− β )* S n = 2 * (X e − X c ) n = 2 estimated sample size for each of the exposed and unexposed groups. Z1-α/2 = value of Z which provides α/2 in each tail of normal curve if a 2-tailed test is used or α in one tail if a 1-tailed test is used. If α, the type I error, is 0.05, then the 2-tailed Z is 1.96. α specifies the probability of declaring a difference to be statistically significant when no real difference exists in the population. Z1-β = value of Z which provides β in the tail of normal curve. If β, the type II error, is 0.2 then the Z value is 0.84. β specifies the probability of declaring a difference to be statistically nonsignificant when there is a real difference in the population. S = estimate of standard deviation common to both exposed and unexposed groups. Xe = estimate of mean outcome in exposed group or cases. Xc = estimate of mean outcome in unexposed group or noncases. The constant, 2, in the formula arises from the assumption that the S is equal in both groups. The calculated sample size is the number of individuals per group. Example: suppose daily milk production in healthy cows is 25 liters and in mastitic cows it is assumed to be reduced by 10%. Thus δ = 2.5 liters. The standard deviation of daily milk production is known to be 6 liters. Given a one-sided α of 0.05, Z0.05 = 1.64. A power of 80% gives Z0.20 = 0.84. Determine the sample size per group. For more practice vary the value of S, δ and α to your liking. Answer for the example: 72 per group. 11 In conclusion, remember that formulae do exist to estimate sample sizes, and these should always be applied in prospective studies. These sample size calculations are only guidelines to how many units should be investigated. It is not a strict number as the assumptions underlying the calculations will almost never exactly mirror the true values. Many other formulae for calculation of sample sizes do exist (e.g., for a different hypothesis, a somewhat different design or different types of data). However, the general principle is always the same and, therefore, only the most basic and most frequently used sample size formulae were presented here. 12