Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Interpreting Statistics in the Urological Literature Charles D. Scales, Jr., Bercedis Peterson and Philipp Dahm* From the Departments of Surgery (Division of Urology) and Biostatistics and Bioinformatics (BP), Duke University Medical Center, Durham, North Carolina Purpose: Knowledge of statistical terminology and the ability to critically interpret research findings are critical skills in the practice of evidence based medicine. Materials and Methods: We provide a series of nontechnical explanations of basic statistical concepts commonly encountered in the urological literature. In addition, we provide examples of common statistical pitfalls to increase awareness of limitations to consider when applying research findings to practice. Results: Statistical goals encountered in the urological literature can be broadly categorized as summarizing outcome variables, comparing 2 or more groups, measuring association among variables or predicting 1 variable from another. Errors frequently include the use of an inappropriate test for the data type of interest or using statistical testing in a manner that increases the likelihood of false-positive results. Such errors pose a threat to the validity of research findings and they may undermine study conclusions. Conclusions: Editors and reviewers alike should strive for high standards of statistical analysis and reporting, and promote the publication of high quality evidence in the urological literature. The understanding of basic statistical concepts and the principles of the hypothesis testing framework is essential to the critical appraisal process and, therefore, important to all urologists. Statistical literacy should be fostered through educational materials and courses in the urological community. Key Words: urology, statistics, evidence-based medicine nowledge of statistical terminology and methods has a pivotal role in the conduct of clinical research and it is an essential tool for the urologist critically appraising the literature.1– 4 With the growing importance of evidence based practice the producers and consumers of biomedical research have a strong interest in statistical literacy.5 Curricular trends in undergraduate and graduate medical education now emphasize a basic knowledge of clinical research and statistical methods as part of a broader preparation for evidence based patient care.6 – 8 Despite these efforts recent evidence suggests that statistical literacy and a knowledge of clinical research methods among urologists may be suboptimal, as among other physicians. Studies in the urological literature and other subspecialties frequently contain statistical errors or they are underpowered.9 –12 These errors threaten to undermine the validity of published studies and flawed investigations may influence medical practice in undesirable ways.13,14 Given this context, we provide nontechnical explanations of basic statistical concepts commonly encountered by urologists. To increase awareness of the limitations of statistical testing we provide practical examples of common pitfalls that urologists should consider when applying research findings to patient care. K Submitted for publication September 23, 2005. * Correspondence: Department of Urology, University of Florida, College of Medicine, Gainesville, Florida 32610 (telephone: 352-2736815; FAX: 352-392-8846; e-mail: [email protected]). 0022-5347/06/1765-1938/0 THE JOURNAL OF UROLOGY® Copyright © 2006 by AMERICAN UROLOGICAL ASSOCIATION RESULTS Outcome Measures Three types of outcome measures or variables are common in biomedical research, including continuous, categorical and time to event.15 Continuous variables have values, such that the distance between the values 3 and 5, for example, is the same as the distance between 20 and 22 or the distance between 66 and 68. Examples of continuous measures are age in years, tumor size in cm and length of hospital stay in days. Categorical variables, of which dichotomous and ordinal are special cases, are measures that have 2 or more categories with no intrinsic numerical value, eg renal cell carcinoma histology (clear cell, papillary and other). Dichotomous (binary) variables have only 2 categories, eg side right/left or stone recurrence yes/no. Many specialized statistical procedures, such as logistic regression, are used for the analysis of dichotomous data. Ordinal variables are categorical variables in which the categories can be ordered. Examples in the urology literature are Gleason score and the visual analog pain scale. Time-to-event data are frequently found in oncological studies of mortality or disease recurrence. Examples are time to death or biochemical failure as well as time to stone passage and time to urethral stricture recurrence.16,17 Summarizing Continuous Data The first question that should arise in the analysis of a continuous outcome variable is whether the data have a normal (also called Gaussian) distribution. Since normal distributions have a bell-shaped symmetry, an approximate assessment of normality can be made by making a histo- 1938 Vol. 176, 1938-1945, November 2006 Printed in U.S.A. DOI:10.1016/j.juro.2006.07.001 INTERPRETING STATISTICS IN UROLOGICAL LITERATURE gram (fig. 1). The assessment of normality determines the appropriate measures for summarizing the data. The most commonly used summary statistics are measures of central tendency, eg the mean, and measures of spread, eg the SD. These statistics are most appropriately used when there are at least 20 data points.15 The mean is a way to characterize the center of a distribution and it is calculated as the sum of individual values divided by the number of observations. SD or , often called sigma, is a measure of the variability (scatter) of a distribution and it describes the degree to which individual values vary from the sample mean. SD is independent of sample size and it is large if the data are highly scattered. When the data are normally distributed, all values within 1 SD encompass 68% of the data and all values within 2 SD encompass 95% of the data. For example, a mean of 100 and an SD of 15 tell us that 68% of the observations are between 85 and 115. It also tells us that values more extreme than 70 and 130 (2 SDs from the mean) are uncommon. In contrast, SEM measures variability in the distribution of sample means from the population mean.15 In other words, if several study samples were available, it describes how the sample means would vary for a given SD. SEM can be calculated from SD using the formula, SEM ⫽ SD/公n, where n represents sample size. It is important to realize that SD and SEM are 2 distinctly different measures. SD describes the scatter of the data in a given sample, while SEM estimates the scatter of sample means in a population.18,19 A common pitfall is the use of SEM instead of SD to describe a data set, thereby suggesting a lesser degree of data dispersion. However, SEM is not a measure of data scatter and, therefore, it is misleading. Generally speaking SDs are most appropriately presented along with means in the table of results that describes the study population. SEMs are used to construct CIs and they are presented in the context of statistical hypothesis testing along with sample means, as described. Summarizing Nonnormal Continuous Data and Ordinal Data For nonnormal continuous data or ordinal data the median and IQR are often the preferred descriptive statistics. In these cases the median is preferred over the mean because the mean may be affected by outliers or a skewed distribution. For example, let us consider a simple case of 5 men with prostate cancer who have PSA 3.5, 3.5, 4.0, 4.5 and 4.5 ng/ml, respectively. In this case the mean and median are identical (4.0 ng/ml). However, if instead we have a group of men with PSA 3.5, 3.5, 4.0, 4.5 and 15.0 ng/ml, respectively, the mean becomes 6.1, while the median remains unchanged (4.0 ng/ml). In this case when an outlier is present, the median is less affected than the mean. IQR is the range of values from the 25th to the 75th percentile of the group of subjects and it gives the reader a sense of the variability in the data. Although SD characterizes the variance of distribution even for nonnormal distributions, it cannot be interpreted in the same way. SD values are easily interpretable only if the distribution is normal. B Normal with N=1000 Normal with N=200 0 0 50 10 20 100 30 150 A 1939 20 30 40 50 30 35 40 AGE Skewed to right with N=1000 50 Bi-modal with N=1000 D 0 0 50 50 100 100 150 150 C 45 AGE 40 45 50 55 AGE 60 65 70 20 30 40 50 60 70 AGE FIG. 1. Histograms show hypothetical distributions. Of 2 histograms with normal distributions (A and B) 1 appears more normal (A) only because sample size is larger. Asymmetrical distribution is said to be skewed to right (C). Bimodal distribution has high point at around age 35 years and another at around age 65 years (D). 1940 INTERPRETING STATISTICS IN UROLOGICAL LITERATURE Summarizing Unordered Categorical Data Unordered categorical data are often summarized using a frequency table, as exemplified by the 2 ⫻ 2 table (fig. 2). In this table the frequency of acute urinary retention (yes/no) is compared between 2 interventions. The frequency table completely describes the data, and reporting absolute numbers and percents, rather than percents alone, is encouraged by medical journal editors.20 –22 A final step in summarizing dichotomous outcomes is to give a CI around estimated percents, eg around the estimated percent of patients with acute urinary retention in each of 2 intervention groups. How to Interpret a CI CIs are used to describe the estimated range of values that is likely to include an unknown population parameter, such as a mean, proportion or HR.23 Thus, a 95% CI gives a range of values that would include the true population parameter in 95% of random samples of the same size from the population of interest. CIs make sense only if the sample is representative of a larger population, which is ensured if the sample is truly randomly selected. The greater the dispersion of the data and the smaller the sample size, the wider the CI. A wide CI indicates that the sample data are insufficient to precisely estimate the outcome measure in the overall population.23 How to Interpret RRs While the display and analysis of continuous outcomes, eg operative time, are relatively intuitive, the interpretation of percents for dichotomous end points, ie recurrence (yes/no), can be challenging.24 Results of dichotomous end points are frequently presented in table form and further described using the terms RR, ARR, RRR, OR and number needed to treat. In studies of therapy understanding these closely related measures of treatment effects greatly aids in the interpretation of study results.25,26 To review these terms consider an example from PLESS, which investigated the effect of finasteride on the risk of acute urinary retention in patients with benign prostatic hyperplasia.27 In this double-blind study patients were randomized with equal probability to receive finasteride (treatment) or placebo (control) and after 4 years acute retention episodes in each group were tabulated (fig. 2). The most interesting result in this trial is the ARR or the difference in the proportion of patients expe- Exposure Finasteride Placebo Outcome (Acute Urinary Retention) Yes No 42 (a) 1461 (b) 99 (c) 1404 (d) 141 2865 Total 1513 (n1) 1503 (n2) Absolute Risk (AR) Finasteride= a/n1 = 0.028 Absolute Risk (AR) Placebo= c/n2 = 0.066 Absolute Risk Reduction (ARR)Finasteride= c/n2 – a/n1 = [ARPlacebo-ARFinasteride] = 0.038 Relative Risk (RR)Finasteride = (a/n1)/(c/n2) = [ARFinasteride/ARPlacebo] = 0.42 Relative Risk Reduction (RRR)Finasteride = (c/n2) – ((a/n1)/(c/n2)) =[ARPlacebo-ARFinasteride/ ARPlacebo] = 0.58 Number Needed To Treat (NNT)Finasteride = 1/ARR = 26.3 OddsFinasteride= a/b = 42/1461 = 0.0287 OddsPlacebo= c/d = 99/1404 = 0.0705 Odds Ratio (OR) = (a/b) ?(c/d) = ac / bd = (42*1404) ?(1461*99) = 0.41 FIG. 2. PLESS results27 riencing retention in each group. In this example the ARR was 3.8%. Another way to describe the difference in outcomes is with the RR, which is the probability of experiencing the outcome in the treatment group divided by the probability of experiencing the outcome in the control group, which was 0.42 in this trial. The RRR in this case was 58%. The problem with the RR and RRR is that they do not reflect the risk of the event without therapy and, therefore, they cannot discriminate large treatment effects from small ones. In contrast, ARR preserves the baseline risk. Thus, it is often considered a more meaningful measure of treatment effect than RRR. Finally, the number of patients who must be treated to prevent 1 additional patient from having the outcome of interest may be calculated as the inverse of the ARR, which in this case was 26.3 or 27 patients. It is helpful for the reader to be familiar with these measures and be able to convert one into another. Odds, OR and the Comparison with RR Odds are defined as the probability of experiencing an outcome divided by the probability of not experiencing the outcome (fig. 2).28 In the PLESS study the odds of acute urinary retention in the finasteride and placebo arms were 0.029 and 0.071, respectively.27 An OR is calculated by dividing the odds of patients in the study group by the odds of patients in the control group or vice versa. In our example the OR for acute urinary retention was 0.41 in the treatment arm, which was similar to the calculated RR. However, ORs and RRs are only similar if the overall event rate is low.29 Therefore, only if an event is relatively rare can RRs and ORs be used interchangeably. Figure 3 shows how probabilities and odds diverge as the event rate increases. ORs are the preferred method of displaying the results of case-control studies, meta-analyses and logistic regression analyses. How to Interpret a KM Curve KM curves are common in the urological literature, particularly in studies of the prognosis of and therapy for malignancies.9 A KM curve represents the estimated distribution of a time-to-event outcome.30 –32 These survival curves show not only time to death, but also time to any event. For example, a KM curve can be used to estimate time to PSA recurrence in patients with prostate cancer or time to spontaneous ureteral calculous passage. The starting point or time zero marks the beginning of the observation period. Usually the start of a KM curve is set to 100% since all patients are alive or recurrence-free initially. In some cases KM curves start at 0% to illustrate the time to a desired outcome, ie at the beginning of the observation period all patients have stones. The specific advantage of a KM curve is that it allows study subjects who do not experience the event of interest to contribute information to the estimation of the time-to-event curve.6 In addition, it allows patients with incomplete followup or who experience competing events to contribute to the curve. These patients are censored at the time that followup ceases. Censoring should be noninformative, that is censored patients should be no more or less likely to experience the event than noncensored patients.33 INTERPRETING STATISTICS IN UROLOGICAL LITERATURE Assume a hypothetical surgical series comparing patients with node positive and node negative renal cell carcinoma (fig. 4). The KM curve remains horizontal when a patient is censored. However, the number of patients at risk decreases and, therefore, the next event results in a larger percent decrease in survival than the previous one. In general each step down on a survival curve represents the occurrence of an event. If 2 patients experience the event of interest, the step down is twice as large as for 1 event. KM curves should show the number of patients at risk for each time interval and indicate when a patient is being censored, usually with a vertical tick mark on the curve. Showing censored subjects enables the reader to deduce how the number of patients at risk has decreased, particularly when 2 or more groups are compared. When patient groups become small, the proportion of event-free patients become less reliable.34 Statistical Hypothesis Testing, Type I and II Errors, and Sample Size Calculations Statistical hypothesis testing provides a formalized framework to evaluate scientific assumptions.2,6,35 Hypothesis testing involves formulating an H0, which is the statement that the investigator seeks to disprove, and an HA. For example, suppose an investigator wishes to compare the means of 2 groups. H0 would state that there is no difference between the means and HA would state that there is a difference. It is important to realize that, based on this framework, H0 is never proven to be true. It is only rejected or not rejected in favor of HA. The hypothesis testing framework allows the investigator to determine whether observed differences are greater than would be expected by chance alone, ie a statistically significant difference. The Appendix shows a court analogy to help illustrate this point. Investigators apply statistical methods to determine whether it is reasonable to reject H0 based on the prespecified threshold probability ␣. The ␣ is the threshold probability of obtaining a false-positive result, ie to erroneously reject H0 and accept HA, although in reality H0 is true. This type of mistake is referred to as type I error. Its probability increases with the number of statistical tests that are performed or so-called multiple testing. While ␣ is by convention commonly set at 0.05, it is important to realize that the choice of this particular value is arbitrary and a smaller, eg ␣ ⫽ 0.01, or greater, eg ␣ ⫽ 0.10, threshold may be appropriate under certain circumstances. The other incorrect conclusion investigators can draw from a study is to fail to reject H0 when in reality the null is not true, ie a false-negative result.36,37 This is called a type II error and it is referred to as . The statistical power of a study (1 ⫺ ) is the probability of finding a statistically significant result, ie of rejecting H0, when H0 is indeed truly false. It is commonly set at 80% ( ⫽ 0.20). The power of a study depends on the variance (scatter) of the outcome variable, the chosen level of ␣, sample size and effect size. For example, effect size in a RCT that compares the means of 2 different treatments is the minimum difference between group means that would be considered clinically meaningful. Underpowered studies, ie studies enrolling too few patients to detect a clinically meaningful difference, may be considered a waste of resources or even unethical since they potentially expose subjects to the risks of a treatment of unproven efficacy but are unlikely to yield definitive results.13,38 – 40 When a reader encounters a negative study or one in which no statistically significant difference was found between groups, the risk of a type II error should immediately come to mind. Authors should provide calculations to demonstrate the minimal detectable difference when presenting a negative study to inform the reader about the risk of a type II error. Table 1 shows an example of sample size calculation. Consider an RCT to compare the incidence of urinary leakage in patients with partial nephrectomy treated with a CSD or PR. The desired ␣, the statistical power and the effect size influence the necessary sample size (table 1). Specifically choosing a smaller ␣, a smaller , ie greater power, or a smaller effect size increases the sample size requirement. A noninferiority trial requires a larger sample size than a superiority trial, although they have exactly the same ␣ and 1.0 .9 Proportion of Patients FIG. 3. Hypothetical example shows how odds and probabilities of dying (filled circles) vs not dying (open circles) of prostate cancer are calculated. In several hypothetical series of 20 patients differing in number of cancer related deaths, as long as event rate is low (Series #1 and #2), calculated odds and probabilities are similar. With increasing event rates odds and probabilities diverge. Probabilities of 0.5 and 0.75, ie 50% and 75% of patients dying of prostate cancer (Series #4 and #5), correspond to odds of 1 and 3, respectively. Odds can be calculated from probabilities and vice versa. 1941 Node Negative (n=15) .8 .7 .6 .5 .4 .3 Node Positive (n=15) .2 .1 0.0 0 15 At Risk: 15 6 15 15 12 15 12 18 13 10 24 10 7 30 6 6 36 2 4 42 0 3 48 Months 0 Pts 1 Pts FIG. 4. Hypothetical survival curve compares 2 groups of patients with renal cell carcinoma, including time to death in 15 each with node negative (solid curve) and node positive (dotted curve) disease. In node negative group first patient dies after 12 months. At that point all patients in that group are still followed and curve drops by 1/15 to 0.93, ie 1.0 – 0.07. Second patient dies after 25 months, at which time 5 in that group have been censored (vertical tick marks). Since only 9 patients are followed at that point, death of 1 patient causes curve to drop by 1/9 from 0.93 to 0.82, ie 0.93 – 0.11. Survival curve then continues horizontally. No further patient dies and additional 8 are censored for total of 13 censored. In node positive group 7 patients die and each causes survival curve to drop to relatively greater extent due to smaller number followed. Median survival time in node positive group is calculated as 25 months, whereas it cannot be calculated in node negative group since fewer than half of patients are dead at study end. 1942 INTERPRETING STATISTICS IN UROLOGICAL LITERATURE about statistical significance, and so more complicated methods are often preferred, particularly when the number of hypotheses is large. Subset analyses are similar to multiple comparisons in that they increase the probability of spurious results and require cautious interpretation. They are common in the urological literature. Ideally subset analyses should be prespecified and appropriately powered. For example, if the overall outcome of a trial shows a difference between study groups, subset analysis might be performed to identify patients who may particularly benefit. However, subset analyses are frequently performed even if the overall study outcome is negative (no statistically significant difference between groups) to show a potential benefit in at least a certain subset of patients. This practice implies testing multiple hypotheses, which risks a high rate of false-positive findings. On the other hand, concluding from subset analyses that there is no difference between these subsets may also be erroneous because studies are rarely designed and powered to adequately detect these differences. Therefore, subset analyses should be regarded as exploratory and interpreted with caution.45 TABLE 1. Sample size calculation CSD vs PR Urinary Leakage Effect Size (No. pts/arm)*  % Power 0.08 vs 0.04 0.08 vs 0.02 0.08 vs 0.01 80 90 99 539 721 1,260 187 251 438 113 181 264 80 90 99 801 1,021 1,648 279 355 573 168 214 346 ␣ ⫽ 0.05: 0.20 0.10 0.01 ␣ ⫽ 0.01: 0.20 0.10 0.01 Assume a hypothetical RCT in patients undergoing partial nephrectomy to investigate whether PR is associated with a lower incidence of urinary leakage than a CSD. Study outcome of interest is the proportion of patients per arm with leakage. Assume that a retrospective review of patients with a CSD yields an 8% leakage rate. Shown are calculated sample sizes for 3 effect sizes, ie differences in leakage rates of 4%, 6% and 7% between arms. As estimated sample sizes indicate, the smaller the effect size, the larger the sample size. Sample size estimates also increase if a smaller ␣ or a greater power (smaller ) is selected. * Sample size calculations based on the arcsin approximation with 2-sided ␣. power, since the effect size must be set to a smaller value than in a superiority trial. Choosing an Appropriate Statistical Test Selecting an appropriate statistical method requires consideration of the analytical goal and the type of data under consideration (table 2).46 Factors that determine the choice of a statistical test are the type of outcome variable (categorical, ordinal, continuous and time-to-event), whether data are paired or independent and the assumed data distribution. Statistical tests commonly encountered in the urology literature can be grouped into 2 categories, including parametric tests, which rely on a specified data distribution (usually normal), and nonparametric tests, which do not. Parametric tests are generally more powerful than nonparametric tests if the underlying assumptions about data distribution are met. While there are specific statistical tests to analyze whether the assumptions of a normal distribution are met, histograms are helpful for making this determination. Nonparametric tests are often used if the outcome variable is ordinal, sample size is small (fewer than 20), there are multiple outliers or there is an obviously nonnormal distribution. When in doubt, it is preferable to proceed with nonparametric testing, which will yield a more conservative p value. Examples of nonparametric tests are the Beware of Multiple Comparisons Based on the hypothesis testing framework setting ␣ to 0.05 indicates that the investigator is willing to accept a 5% chance of a falsely rejecting H0 (type I error). If the investigator tests multiple independent H0s, the probability that 1 statistical test is significant by chance alone increases.41,42 It is common in the urology literature to see multiple hypotheses being tested, thereby increasing the probability of achieving a statistically significant result well beyond the 5% threshold.9 For example, if a study tests 10 H0s, the probability of obtaining at least 1 statistically significant result at an ␣ of 0.05 is 40% (1 ⫺ 0.9510), for 50 H0 it is 92% (1 ⫺ 0.9550) and for 100 H0 it is greater than 99% (1 ⫺ 0.95100).3 Several statistical methods are available to adjust for multiple comparisons.15,43 The Bonferroni method is the simplest and most commonly used method.44 This correction involves dividing ␣ by the number of independent hypotheses to be tested. For example, if an investigator tests 10 hypotheses, the level of ␣ to determine statistical significance would decrease from 0.05 to 0.05/10 ⫽ 0.005. The Bonferroni correction results in conservative determinations TABLE 2. Statistical road map Outcome Variable Type Analysis Goal Describe 1 group Compare 2 unpaired groups Compare 2 paired groups Compare 3 or more groups Examine association between 2 variables Predict 1 value from another Continuous Ordinal Categorical Mean (central tendency), SD (spread) CI for difference in means, 2-sample t test Median (central tendency), IQR (spread) CI for difference in medians, Mann-Whitney (Wilcoxon) test Proportions, frequency table KM survival curve Survival Time Log rank test CI for paired differences, paired t test 1-Way ANOVA Wilcoxon signed rank test CI for proportions, chisquare ⫹ Fisher’s exact tests, OR NcNemar’s test Kruskal-Wallis test Pearson correlation Linear regression Not available Proportional hazards regression Spearman correlation Cochran Mantel-Haenszel ⫹ chi-square tests Chi-square test Ordinal logistic regression Logistic regression Proportional hazards regression INTERPRETING STATISTICS IN UROLOGICAL LITERATURE 1943 TABLE 3. Hypothetical study of association between coffee drinking and bladder cancer Coffee: Ca No Ca No coffee: Ca No Ca OR (Ca vs coffee) Ca ⫹ no Ca: Coffee No coffee OR (coffee vs smoking) Coffee ⫹ no coffee: Ca No Ca OR (Ca vs smoking) No. Smokers No. Nonsmokers 80 40 10 20 20 10 (80 ⫻ 10)/(20 ⫻ 40) ⫽ 1 40 80 (10 ⫻ 80)/(40 ⫻ 20) ⫽ 1 120 30 30 120 100 50 ⬍ (120 ⫻ 120)/(30 ⫻ 30) ⫽ 16 ⬎ ⬍ (100 ⫻ 100)/(50 ⫻ 50) ⫽ 4 ⬎ 50 100 Of 150 coffee and 150 noncoffee drinkers 90 and 60, respectively, have bladder cancer, from which OR of 2.25 is calculated for association of coffee drinking and bladder cancer. However, when stratified by smoking status, coffee drinking is not associated with bladder cancer. Smoking is a confounder because it is strongly associated with coffee drinking and bladder cancer. Mann-Whitney (Wilcoxon rank sum) and Kruskal-Wallis tests (table 2). While many studies compare 2 groups that have no intrinsic association, there are situations when the outcome in a subject in 1 group is expected to be more similar to that in a particular subject in the second group than to that in a random subject. In this case we refer to paired data. An example of paired data is measurements that are performed in the same individual before and after intervention, ie the International Prostate Symptom Score, before and after transurethral resection of the prostate. Other examples of paired data in the urology literature are studies that recruit subjects in pairs, such as patients with prostate cancer who are matched in age, PSA and prostatectomy Gleason score. Paired tests account for the lack of independence between the groups and they should be used to obtain accurate p values. What is a Confounding Variable? Many studies are designed to demonstrate a link between a predictor variable, eg radiation treatment for prostate cancer with or without androgen ablation, and an outcome variable, eg recurrence-free survival. Under the ideal circumstances of a double-blind RCT one could assume that the only difference between the 2 groups would be the form of treatment. However, in some studies groups may differ in other ways, particularly in retrospective, observational designs. For example, patients in 1 of the 2 treatment groups may have had more locally advanced disease to begin with, which could explain a difference in survival. In this case the results are said to be confounded. In statistics confounding has a specific meaning. A confounding variable is one that is associated with the predictor variable as well as the outcome variable.15 Table 3 shows an example related to coffee drinking and the risk for bladder cancer. In this example smoking is a confounder that must be accounted for to interpret the data in a meaningful way. Although confounders are best controlled for at the design stage, they can be adjusted for in the analysis phase.47 The best way to control for confounders is to randomly assign patients to study groups. How to Interpret Multivariate Analyses Multivariate analysis seeks to assess the relationship between a predictor variable and an outcome while adjusting for the influence of 1 or more covariates. The most frequently used models are linear regression analysis for continuous outcomes that are preferably normally distributed, logistic regression analysis for dichotomous or ordinal outcomes and proportional hazards regression analysis for time-to-event data.48 Each model yields  coefficients that indicate the magnitude of the change in the outcome as a function of a change in the predictor variable after adjusting for all other covariates in the model. It is important to realize that multivariate analysis can account only for known confounder variables but cannot adjust for a lack of randomization.49,50 Also, if a study is inadequately powered to demonstrate an effect on univariate analysis, the same will be true when multivariate analysis is performed. In general large SEs and large CIs are clues to an inadequate sample size.3 Finally, each variable included in the model requires a certain number of events, such as the number of successes when the outcome is dichotomous. This issue is particularly relevant to logistic and proportional hazards regression models because only a subset of study subjects experiences the outcome of interest. As a rule of thumb, 10 to 15 events are necessary for each variable included in the model to solve the equation and allow the model to converge.49 Including too many variables in a multivariate model is referred to as overfitting and it is a common occurrence in the urological literature.9 CONCLUSIONS Basic knowledge of statistical concepts and testing is essential to the understanding and critical interpretation of the urological literature. Therefore, it is an integral component of an evidence based practice. We describe the core principles of applied statistics that are relevant to urology and we sought to illustrate common pitfalls. While this article can only provide a brief introduction, this topic is relevant to a large audience of practicing urologists. Statistical literacy should continue to be fostered through educational materials and courses from within the urological community. ACKNOWLEDGMENTS Dr. Ulrich Guller and Elizabeth R. DeLong inspired this review.3 1944 INTERPRETING STATISTICS IN UROLOGICAL LITERATURE APPENDIX Analogy Between Trial by Jury and Statistical Hypothesis Testing Trial by Jury Start with the presumption that the defendant is innocent Innocence: the defendant did not counterfeit money Guilt: the defendant counterfeited the money Listen to factual evidence presented in the trial. Do not consider outside evidence, such as newspaper stories that you have read Evaluate whether you believe the witnesses. Ignore testimony from witnesses whom you think have lied Standard for rejecting innocence: beyond reasonable doubt Think about whether the evidence is consistent with the assumption of innocence Correct judgments that a jury can make If evidence is inconsistent with the assumption of innocence, then reject the assumption of innocence and declare the defendant to be guilty (convict a counterfeiter) If evidence is consistent with the assumption of innocence, then declare the defendant to be innocent (acquit an innocent person) Incorrect judgments that a jury can make Convict an innocent person Acquit a counterfeiter Statistical Analogy Start with the presumption that the null hypothesis (H0) is true H0: there is no association between the incidence of perinephric abscesses and the type of drain used Alternative hypothesis (HA): there is an association between the incidence of perinephric abscesses and the type of drain used Base your conclusions only on data from this 1 analysis. For the purpose of the statistical analysis ignore whether the hypothesis is scientifically or clinically plausible and do not consider any other data Evaluate whether the study was performed appropriately. Ignore flawed data Standard for rejecting H0: level of statistical significance (␣, typically 0.05) Calculate a p value Correct inferences If the p value is lower than the preset threshold of ␣, conclude that the data are inconsistent with H0 and declare the difference to be statistically significant If the p value is equal or greater than the preset threshold (typically 0.05), conclude that the data are consistent with H0 and declare the difference to be not statistically significant Incorrect inferences Type I error: conclude that there is an association between abscess incidence and drain type when there actually is none Type II error: conclude that there is no association between abscess incidence and drain type when there actually is one 9. Abbreviations and Acronyms ARR BT CSD H0 HA KM PLESS ⫽ ⫽ ⫽ ⫽ ⫽ ⫽ ⫽ PR PSA RR RRR ⫽ ⫽ ⫽ ⫽ absolute risk reduction bladder tumor closed suction drain null hypothesis alternate hypothesis Kaplan-Meier Proscar™ Long-Term Efficacy and Safety Study Penrose drain prostate specific antigen risk ratio or relative risk relative risk reduction 10. 11. 12. 13. 14. REFERENCES 1. 2. 3. 4. 5. 6. 7. 8. Gup, D. I., Province, M. A. and Lepor, H.: A practical review of statistical analysis. AUA Update Series, vol. 9, lesson 24, 1990 Davis, J. W., Chang, S. S. and Schellhammer, P. F.: Clinical trials methodology. AUA Update Series, vol. 23, lesson 18, 2004 Guller, U. and DeLong, E. R.: Interpreting statistics in medical literature: a vade mecum for surgeons. J Am Coll Surg, 198: 441, 2004 Albertsen, P. C., Hanley, J. A. and Murphy-Setzko, M.: Statistical considerations when assessing outcomes following treatment for prostate cancer. J Urol, 162: 439, 1999 Montori, V. M., Jaeschke, R., Schunemann, H. J., Bhandari, M., Brozek, J. L., Devereaux, P. J. et al: Users’ guide to detecting misleading claims in clinical research reports. BMJ, 329: 1093, 2004 Kocher, M. S. and Zurakowski, D.: Clinical epidemiology and biostatistics: a primer for orthopaedic surgeons. J Bone Joint Surg Am, 86-A: 607, 2004 Altman, D. G.: Statistics and ethics in medical research. Misuse of statistics is unethical. Br Med J, 281: 1182, 1980 Ambrosius, W. T. and Manatunga, A. K.: Intensive short courses in biostatistics for fellows and physicians. Stat Med, 21: 2739, 2002 15. 16. 17. 18. 19. 20. 21. 22. 23. Scales, C. D., Jr., Norris, R. D., Peterson, B. L., Preminger, G. M. and Dahm, P.: Clinical research and statistical methods in the urology literature. J Urol, 174: 1374, 2005 Breau, R. H., Carnat, T. A. and Gaboury, I.: Inadequate statistical power of negative clinical trials in urology literature. J Urol, 176: 263, 2006 Altman, D. G.: Misleading interpretation of results from a small study. Urology, 43: 411, 1994 Ferraris, V. A. and Ferraris, S. P.: Assessing the medical literature: let the buyer beware. Ann Thorac Surg, 76: 4, 2003 Pocock, S. J., Hughes, M. D. and Lee, R. J.: Statistical problems in the reporting of clinical trials. A survey of three medical journals. N Engl J Med, 317: 426, 1987 Zaugg, C. E.: Common biostatistical errors in clinical studies. Schweiz Rundsch Med Prax, 92: 218, 2003 Motulsky, H.: Intuitive Biostatistics. Oxford, United Kingdom: Oxford University Press, p. 386, 1995 Dellabella, M., Milanese, G. and Muzzonigro, G.: Randomized trial of the efficacy of tamsulosin, nifedipine and phloroglucinol in medical expulsive therapy for distal ureteral calculi. J Urol, 174: 167, 2005 Steenkamp, J. W., Heyns, C. F. and de Kock, M. L.: Internal urethrotomy versus dilation as treatment for male urethral strictures: a prospective, randomized comparison. J Urol, 157: 98, 1997 Brown, G. W.: Standard deviation, standard error. Which “standard” should we use? Am J Dis Child, 136: 937, 1982 Zaugg, C. E.: Standard error as standard? Circulation, 107: e89, 2003 Altman, D. G., Gore, S. M., Gardner, M. J. and Pocock, S.J.: Statistical guidelines for contributors to medical journals. Br Med J (Clin Res Ed), 286: 1489, 1983 Altman, D. G., Schulz, K. F., Moher, D. et al: The revised CONSORT statement for reporting randomized trials: explanation and elaboration. Ann Intern Med, 134: 663, 2001 Altman, D. G.: Endorsement of the CONSORT statement by high impact medical journals: survey of instructions for authors. BMJ, 330: 1056, 2005 Montori, V. M., Kleinbart, J., Newman, T. B. et al: Tips for learners of evidence-based medicine: 2. Measures of precision (confidence intervals). CMAJ, 171: 611, 2004 INTERPRETING STATISTICS IN UROLOGICAL LITERATURE 24. 25. 26. 27. 28. 29. 30. 31. 32. 33. 34. 35. 36. Jaeschke, R., Guyatt, G., Shannon, H., Walter, S., Cook, D. and Heddle, N.: Basic statistics for clinicians: 3. Assessing the effects of treatment: measures of association. CMAJ, 152: 351, 1995 Greenhalgh, T.: How to read a paper. Statistics for the nonstatistician. II: “Significant” relations and their pitfalls BMJ, 315: 422, 1997 Barratt, A., Wyer, P. C., Hatala, R., McGinn, T., Dans, A. L., Keitz, S. et al: Tips for learners of evidence-based medicine: 1. Relative risk reduction, absolute risk reduction and number needed to treat. CMAJ, 171: 353, 2004 McConnell, J. D., Bruskewitz, R., Walsh, P., Andriole, G., Lieber, M., Holtgrewe, H. L. et al: The effect of finasteride on the risk of acute urinary retention and the need for surgical treatment among men with benign prostatic hyperplasia. Finasteride Long-Term Efficacy and Safety Study Group. N Engl J Med, 338: 557, 1998 Bland, J. M. and Altman, D. G.: Statistics notes. The odds ratio. BMJ, 320: 1468, 2000 Altman, D. G., Deeks, J. J. and Sackett, D. L.: Odds ratios should be avoided when events are common. BMJ, 317: 1318, 1998 Kaplan, E. L. and Meier, P.: Nonparametric estimation from incomplete observations. J Am Stat Assoc, 53: 457, 1958 Bland, J. M. and Altman, D. G.: Survival probabilities (the Kaplan-Meier method). BMJ, 317: 1572, 1998 Clark, T. G., Bradburn, M. J., Love, S. B. and Altman, D.G.: Survival analysis part I: basic concepts and first analyses. Br J Cancer, 89: 232, 2003 Mathew, A., Pandey, M. and Murthy, N. S.: Survival analysis: caveats and pitfalls. Eur J Surg Oncol, 25: 321, 1999 Pocock, S. J., Clayton, T. C. and Altman, D. G.: Survival plots of time-to-event outcomes in clinical trials: good practice and pitfalls. Lancet, 359: 1686, 2002 Albertsen, P. C.: Outcomes research: a primer for urologists. AUA Update Series, vol. 17, lesson 14, 1998 Berwick, D. M.: Experimental power: the other side of the coin. Pediatrics, 65: 1043, 1980 37. 38. 39. 40. 41. 42. 43. 44. 45. 46. 47. 48. 49. 50. 1945 Freiman, J. A., Chalmers, T. C., Smith, H., Jr. and Kuebler, R. R.: The importance of beta, the type II error and sample size in the design and interpretation of the randomized control trial. Survey of 71 “negative” trials. N Engl J Med, 299: 690, 1978 Altman, D. G.: Statistics and ethics in medical research: III. How large a sample? Br Med J, 281: 1336, 1980 Moher, D., Dulberg, C. S. and Wells, G. A.: Statistical power, sample size, and their reporting in randomized controlled trials. JAMA, 272: 122, 1994 Guller, U. and Oertli, D.: Sample size matters: a guide for surgeons. World J Surg, 29: 601, 2005 Lee, J. T.: Statistics in medical literature. J Am Coll Surg, 199: 170, 2004 Sterne, J. A. and Davey Smith, G.: Sifting the evidence— what’s wrong with significance tests? BMJ, 322: 226, 2001 Bender, R. and Lange, S.: Adjusting for multiple testing— when and how? J Clin Epidemiol, 54: 343, 2001 Bland, J. M. and Altman, D. G.: Multiple significance tests: the Bonferroni method. BMJ, 310: 170, 1995 Assmann, S. F., Pocock, S. J., Enos, L. E. and Kasten, L. E.: Subgroup analysis and other (mis)uses of baseline data in clinical trials. Lancet, 355: 1064, 2000 Greenhalgh, T.: How to read a paper. Statistics for the nonstatistician. I: Different types of data need different statistical tests BMJ, 315: 364, 1997 Mullner, M., Matthews, H. and Altman, D. G.: Reporting on statistical methods to adjust for confounding: a cross-sectional survey. Ann Intern Med, 136: 122, 2002 Clark, T. G., Bradburn, M. J., Love, S. B. and Altman, D.G.: Survival analysis part IV: further concepts and methods in survival analysis. Br J Cancer, 89: 781, 2003 Katz, M. H.: Multivariate Analysis: A Practical Guide for Clinicians. Cambridge, United Kingdom: Cambridge University Press, p. 192, 1999 Concato, J., Feinstein, A. R. and Holford, T. R.: The risk of determining risk with multivariable models. Ann Intern Med, 118: 201, 1993