Download Interpreting Statistics in the Urological Literature

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Categorical variable wikipedia , lookup

Foundations of statistics wikipedia , lookup

Time series wikipedia , lookup

Statistical inference wikipedia , lookup

History of statistics wikipedia , lookup

Misuse of statistics wikipedia , lookup

Transcript
Interpreting Statistics in the Urological Literature
Charles D. Scales, Jr., Bercedis Peterson and Philipp Dahm*
From the Departments of Surgery (Division of Urology) and Biostatistics and Bioinformatics (BP), Duke University Medical Center,
Durham, North Carolina
Purpose: Knowledge of statistical terminology and the ability to critically interpret research findings are critical skills in the
practice of evidence based medicine.
Materials and Methods: We provide a series of nontechnical explanations of basic statistical concepts commonly encountered in the urological literature. In addition, we provide examples of common statistical pitfalls to increase awareness of
limitations to consider when applying research findings to practice.
Results: Statistical goals encountered in the urological literature can be broadly categorized as summarizing outcome
variables, comparing 2 or more groups, measuring association among variables or predicting 1 variable from another. Errors
frequently include the use of an inappropriate test for the data type of interest or using statistical testing in a manner that
increases the likelihood of false-positive results. Such errors pose a threat to the validity of research findings and they may
undermine study conclusions.
Conclusions: Editors and reviewers alike should strive for high standards of statistical analysis and reporting, and promote
the publication of high quality evidence in the urological literature. The understanding of basic statistical concepts and the
principles of the hypothesis testing framework is essential to the critical appraisal process and, therefore, important to all
urologists. Statistical literacy should be fostered through educational materials and courses in the urological community.
Key Words: urology, statistics, evidence-based medicine
nowledge of statistical terminology and methods has
a pivotal role in the conduct of clinical research and
it is an essential tool for the urologist critically appraising the literature.1– 4 With the growing importance of
evidence based practice the producers and consumers of
biomedical research have a strong interest in statistical
literacy.5 Curricular trends in undergraduate and graduate
medical education now emphasize a basic knowledge of clinical research and statistical methods as part of a broader
preparation for evidence based patient care.6 – 8
Despite these efforts recent evidence suggests that statistical literacy and a knowledge of clinical research methods among urologists may be suboptimal, as among other
physicians. Studies in the urological literature and other
subspecialties frequently contain statistical errors or they
are underpowered.9 –12 These errors threaten to undermine
the validity of published studies and flawed investigations
may influence medical practice in undesirable ways.13,14
Given this context, we provide nontechnical explanations of
basic statistical concepts commonly encountered by urologists. To increase awareness of the limitations of statistical
testing we provide practical examples of common pitfalls
that urologists should consider when applying research findings to patient care.
K
Submitted for publication September 23, 2005.
* Correspondence: Department of Urology, University of Florida,
College of Medicine, Gainesville, Florida 32610 (telephone: 352-2736815; FAX: 352-392-8846; e-mail: [email protected]).
0022-5347/06/1765-1938/0
THE JOURNAL OF UROLOGY®
Copyright © 2006 by AMERICAN UROLOGICAL ASSOCIATION
RESULTS
Outcome Measures
Three types of outcome measures or variables are common
in biomedical research, including continuous, categorical
and time to event.15 Continuous variables have values, such
that the distance between the values 3 and 5, for example, is
the same as the distance between 20 and 22 or the distance
between 66 and 68. Examples of continuous measures are
age in years, tumor size in cm and length of hospital stay in
days. Categorical variables, of which dichotomous and ordinal are special cases, are measures that have 2 or more
categories with no intrinsic numerical value, eg renal cell
carcinoma histology (clear cell, papillary and other). Dichotomous (binary) variables have only 2 categories, eg side
right/left or stone recurrence yes/no. Many specialized statistical procedures, such as logistic regression, are used for
the analysis of dichotomous data. Ordinal variables are categorical variables in which the categories can be ordered.
Examples in the urology literature are Gleason score and
the visual analog pain scale. Time-to-event data are frequently found in oncological studies of mortality or disease
recurrence. Examples are time to death or biochemical failure as well as time to stone passage and time to urethral
stricture recurrence.16,17
Summarizing Continuous Data
The first question that should arise in the analysis of a
continuous outcome variable is whether the data have a
normal (also called Gaussian) distribution. Since normal
distributions have a bell-shaped symmetry, an approximate
assessment of normality can be made by making a histo-
1938
Vol. 176, 1938-1945, November 2006
Printed in U.S.A.
DOI:10.1016/j.juro.2006.07.001
INTERPRETING STATISTICS IN UROLOGICAL LITERATURE
gram (fig. 1). The assessment of normality determines the
appropriate measures for summarizing the data. The most
commonly used summary statistics are measures of central
tendency, eg the mean, and measures of spread, eg the SD.
These statistics are most appropriately used when there are
at least 20 data points.15 The mean is a way to characterize
the center of a distribution and it is calculated as the sum of
individual values divided by the number of observations. SD
or ␴, often called sigma, is a measure of the variability
(scatter) of a distribution and it describes the degree to
which individual values vary from the sample mean. SD is
independent of sample size and it is large if the data are
highly scattered. When the data are normally distributed,
all values within 1 SD encompass 68% of the data and all
values within 2 SD encompass 95% of the data. For example,
a mean of 100 and an SD of 15 tell us that 68% of the
observations are between 85 and 115. It also tells us that
values more extreme than 70 and 130 (2 SDs from the mean)
are uncommon.
In contrast, SEM measures variability in the distribution
of sample means from the population mean.15 In other
words, if several study samples were available, it describes
how the sample means would vary for a given SD. SEM can
be calculated from SD using the formula, SEM ⫽ SD/公n,
where n represents sample size. It is important to realize
that SD and SEM are 2 distinctly different measures. SD
describes the scatter of the data in a given sample, while
SEM estimates the scatter of sample means in a population.18,19 A common pitfall is the use of SEM instead of SD
to describe a data set, thereby suggesting a lesser degree of
data dispersion. However, SEM is not a measure of data
scatter and, therefore, it is misleading. Generally speaking
SDs are most appropriately presented along with means in
the table of results that describes the study population.
SEMs are used to construct CIs and they are presented in
the context of statistical hypothesis testing along with sample means, as described.
Summarizing Nonnormal
Continuous Data and Ordinal Data
For nonnormal continuous data or ordinal data the median
and IQR are often the preferred descriptive statistics. In
these cases the median is preferred over the mean because
the mean may be affected by outliers or a skewed distribution. For example, let us consider a simple case of 5 men with
prostate cancer who have PSA 3.5, 3.5, 4.0, 4.5 and 4.5
ng/ml, respectively. In this case the mean and median are
identical (4.0 ng/ml). However, if instead we have a group of
men with PSA 3.5, 3.5, 4.0, 4.5 and 15.0 ng/ml, respectively,
the mean becomes 6.1, while the median remains unchanged
(4.0 ng/ml). In this case when an outlier is present, the
median is less affected than the mean.
IQR is the range of values from the 25th to the 75th
percentile of the group of subjects and it gives the reader a
sense of the variability in the data. Although SD characterizes the variance of distribution even for nonnormal distributions, it cannot be interpreted in the same way. SD values
are easily interpretable only if the distribution is normal.
B
Normal with N=1000
Normal with N=200
0
0
50
10
20
100
30
150
A
1939
20
30
40
50
30
35
40
AGE
Skewed to right with N=1000
50
Bi-modal with N=1000
D
0
0
50
50
100
100
150
150
C
45
AGE
40
45
50
55
AGE
60
65
70
20
30
40
50
60
70
AGE
FIG. 1. Histograms show hypothetical distributions. Of 2 histograms with normal distributions (A and B) 1 appears more normal (A) only
because sample size is larger. Asymmetrical distribution is said to be skewed to right (C). Bimodal distribution has high point at around age
35 years and another at around age 65 years (D).
1940
INTERPRETING STATISTICS IN UROLOGICAL LITERATURE
Summarizing Unordered Categorical Data
Unordered categorical data are often summarized using a
frequency table, as exemplified by the 2 ⫻ 2 table (fig. 2).
In this table the frequency of acute urinary retention
(yes/no) is compared between 2 interventions. The frequency table completely describes the data, and reporting
absolute numbers and percents, rather than percents
alone, is encouraged by medical journal editors.20 –22 A
final step in summarizing dichotomous outcomes is to give
a CI around estimated percents, eg around the estimated
percent of patients with acute urinary retention in each of
2 intervention groups.
How to Interpret a CI
CIs are used to describe the estimated range of values that
is likely to include an unknown population parameter, such
as a mean, proportion or HR.23 Thus, a 95% CI gives a range
of values that would include the true population parameter
in 95% of random samples of the same size from the population of interest. CIs make sense only if the sample is
representative of a larger population, which is ensured if the
sample is truly randomly selected. The greater the dispersion of the data and the smaller the sample size, the wider
the CI. A wide CI indicates that the sample data are insufficient to precisely estimate the outcome measure in the
overall population.23
How to Interpret RRs
While the display and analysis of continuous outcomes, eg
operative time, are relatively intuitive, the interpretation of
percents for dichotomous end points, ie recurrence (yes/no),
can be challenging.24 Results of dichotomous end points are
frequently presented in table form and further described
using the terms RR, ARR, RRR, OR and number needed to
treat. In studies of therapy understanding these closely related measures of treatment effects greatly aids in the interpretation of study results.25,26
To review these terms consider an example from
PLESS, which investigated the effect of finasteride on the
risk of acute urinary retention in patients with benign
prostatic hyperplasia.27 In this double-blind study patients were randomized with equal probability to receive
finasteride (treatment) or placebo (control) and after 4
years acute retention episodes in each group were tabulated (fig. 2). The most interesting result in this trial is the
ARR or the difference in the proportion of patients expe-
Exposure
Finasteride
Placebo
Outcome (Acute Urinary Retention)
Yes
No
42 (a)
1461 (b)
99 (c)
1404 (d)
141
2865
Total
1513 (n1)
1503 (n2)
Absolute Risk (AR) Finasteride= a/n1 = 0.028
Absolute Risk (AR) Placebo= c/n2 = 0.066
Absolute Risk Reduction (ARR)Finasteride= c/n2 – a/n1 = [ARPlacebo-ARFinasteride] = 0.038
Relative Risk (RR)Finasteride = (a/n1)/(c/n2) = [ARFinasteride/ARPlacebo] = 0.42
Relative Risk Reduction (RRR)Finasteride = (c/n2) – ((a/n1)/(c/n2)) =[ARPlacebo-ARFinasteride/ ARPlacebo] = 0.58
Number Needed To Treat (NNT)Finasteride = 1/ARR = 26.3
OddsFinasteride= a/b = 42/1461 = 0.0287
OddsPlacebo= c/d = 99/1404 = 0.0705
Odds Ratio (OR) = (a/b) ?(c/d) = ac / bd = (42*1404) ?(1461*99) = 0.41
FIG. 2. PLESS results27
riencing retention in each group. In this example the ARR
was 3.8%. Another way to describe the difference in outcomes is with the RR, which is the probability of experiencing the outcome in the treatment group divided by the
probability of experiencing the outcome in the control
group, which was 0.42 in this trial. The RRR in this case
was 58%. The problem with the RR and RRR is that they
do not reflect the risk of the event without therapy and,
therefore, they cannot discriminate large treatment effects from small ones. In contrast, ARR preserves the
baseline risk. Thus, it is often considered a more meaningful measure of treatment effect than RRR. Finally, the
number of patients who must be treated to prevent 1
additional patient from having the outcome of interest
may be calculated as the inverse of the ARR, which in this
case was 26.3 or 27 patients. It is helpful for the reader to
be familiar with these measures and be able to convert one
into another.
Odds, OR and the Comparison with RR
Odds are defined as the probability of experiencing an
outcome divided by the probability of not experiencing the
outcome (fig. 2).28 In the PLESS study the odds of acute
urinary retention in the finasteride and placebo arms
were 0.029 and 0.071, respectively.27 An OR is calculated
by dividing the odds of patients in the study group by the
odds of patients in the control group or vice versa. In our
example the OR for acute urinary retention was 0.41 in
the treatment arm, which was similar to the calculated
RR. However, ORs and RRs are only similar if the overall
event rate is low.29 Therefore, only if an event is relatively
rare can RRs and ORs be used interchangeably. Figure 3
shows how probabilities and odds diverge as the event
rate increases. ORs are the preferred method of displaying
the results of case-control studies, meta-analyses and logistic regression analyses.
How to Interpret a KM Curve
KM curves are common in the urological literature, particularly in studies of the prognosis of and therapy for malignancies.9 A KM curve represents the estimated distribution
of a time-to-event outcome.30 –32 These survival curves show
not only time to death, but also time to any event. For
example, a KM curve can be used to estimate time to PSA
recurrence in patients with prostate cancer or time to spontaneous ureteral calculous passage. The starting point or
time zero marks the beginning of the observation period.
Usually the start of a KM curve is set to 100% since all
patients are alive or recurrence-free initially. In some cases
KM curves start at 0% to illustrate the time to a desired
outcome, ie at the beginning of the observation period all
patients have stones.
The specific advantage of a KM curve is that it allows
study subjects who do not experience the event of interest to
contribute information to the estimation of the time-to-event
curve.6 In addition, it allows patients with incomplete followup or who experience competing events to contribute to
the curve. These patients are censored at the time that
followup ceases. Censoring should be noninformative, that is
censored patients should be no more or less likely to experience the event than noncensored patients.33
INTERPRETING STATISTICS IN UROLOGICAL LITERATURE
Assume a hypothetical surgical series comparing patients with node positive and node negative renal cell
carcinoma (fig. 4). The KM curve remains horizontal when
a patient is censored. However, the number of patients at
risk decreases and, therefore, the next event results in a
larger percent decrease in survival than the previous one.
In general each step down on a survival curve represents
the occurrence of an event. If 2 patients experience the
event of interest, the step down is twice as large as for 1
event. KM curves should show the number of patients at
risk for each time interval and indicate when a patient is
being censored, usually with a vertical tick mark on the
curve. Showing censored subjects enables the reader to
deduce how the number of patients at risk has decreased,
particularly when 2 or more groups are compared. When
patient groups become small, the proportion of event-free
patients become less reliable.34
Statistical Hypothesis Testing,
Type I and II Errors, and Sample Size Calculations
Statistical hypothesis testing provides a formalized framework to evaluate scientific assumptions.2,6,35 Hypothesis
testing involves formulating an H0, which is the statement
that the investigator seeks to disprove, and an HA. For
example, suppose an investigator wishes to compare the
means of 2 groups. H0 would state that there is no difference
between the means and HA would state that there is a
difference. It is important to realize that, based on this
framework, H0 is never proven to be true. It is only rejected
or not rejected in favor of HA. The hypothesis testing framework allows the investigator to determine whether observed
differences are greater than would be expected by chance
alone, ie a statistically significant difference. The Appendix
shows a court analogy to help illustrate this point.
Investigators apply statistical methods to determine whether
it is reasonable to reject H0 based on the prespecified threshold
probability ␣. The ␣ is the threshold probability of obtaining a
false-positive result, ie to erroneously reject H0 and accept HA,
although in reality H0 is true. This type of mistake is referred to
as type I error. Its probability increases with the number of
statistical tests that are performed or so-called multiple testing.
While ␣ is by convention commonly set at 0.05, it is important to
realize that the choice of this particular value is arbitrary and a
smaller, eg ␣ ⫽ 0.01, or greater, eg ␣ ⫽ 0.10, threshold may be
appropriate under certain circumstances. The other incorrect conclusion investigators can draw from a study is to fail to reject H0
when in reality the null is not true, ie a false-negative result.36,37
This is called a type II error and it is referred to as ␤.
The statistical power of a study (1 ⫺ ␤) is the probability
of finding a statistically significant result, ie of rejecting H0,
when H0 is indeed truly false. It is commonly set at 80% (␤ ⫽
0.20). The power of a study depends on the variance (scatter)
of the outcome variable, the chosen level of ␣, sample size
and effect size. For example, effect size in a RCT that compares the means of 2 different treatments is the minimum
difference between group means that would be considered
clinically meaningful. Underpowered studies, ie studies enrolling too few patients to detect a clinically meaningful
difference, may be considered a waste of resources or even
unethical since they potentially expose subjects to the risks
of a treatment of unproven efficacy but are unlikely to yield
definitive results.13,38 – 40 When a reader encounters a negative study or one in which no statistically significant difference was found between groups, the risk of a type II error
should immediately come to mind. Authors should provide
calculations to demonstrate the minimal detectable difference when presenting a negative study to inform the reader
about the risk of a type II error.
Table 1 shows an example of sample size calculation.
Consider an RCT to compare the incidence of urinary leakage in patients with partial nephrectomy treated with a CSD
or PR. The desired ␣, the statistical power and the effect size
influence the necessary sample size (table 1). Specifically
choosing a smaller ␣, a smaller ␤, ie greater power, or a
smaller effect size increases the sample size requirement. A
noninferiority trial requires a larger sample size than a
superiority trial, although they have exactly the same ␣ and
1.0
.9
Proportion of Patients
FIG. 3. Hypothetical example shows how odds and probabilities of
dying (filled circles) vs not dying (open circles) of prostate cancer are
calculated. In several hypothetical series of 20 patients differing in
number of cancer related deaths, as long as event rate is low (Series
#1 and #2), calculated odds and probabilities are similar. With
increasing event rates odds and probabilities diverge. Probabilities
of 0.5 and 0.75, ie 50% and 75% of patients dying of prostate cancer
(Series #4 and #5), correspond to odds of 1 and 3, respectively. Odds
can be calculated from probabilities and vice versa.
1941
Node Negative (n=15)
.8
.7
.6
.5
.4
.3
Node Positive (n=15)
.2
.1
0.0
0
15
At Risk:
15
6
15
15
12
15
12
18
13
10
24
10
7
30
6
6
36
2
4
42
0
3
48 Months
0 Pts
1 Pts
FIG. 4. Hypothetical survival curve compares 2 groups of patients
with renal cell carcinoma, including time to death in 15 each with
node negative (solid curve) and node positive (dotted curve) disease.
In node negative group first patient dies after 12 months. At that
point all patients in that group are still followed and curve drops by
1/15 to 0.93, ie 1.0 – 0.07. Second patient dies after 25 months, at
which time 5 in that group have been censored (vertical tick marks).
Since only 9 patients are followed at that point, death of 1 patient
causes curve to drop by 1/9 from 0.93 to 0.82, ie 0.93 – 0.11. Survival
curve then continues horizontally. No further patient dies and additional 8 are censored for total of 13 censored. In node positive
group 7 patients die and each causes survival curve to drop to
relatively greater extent due to smaller number followed. Median
survival time in node positive group is calculated as 25 months,
whereas it cannot be calculated in node negative group since fewer
than half of patients are dead at study end.
1942
INTERPRETING STATISTICS IN UROLOGICAL LITERATURE
about statistical significance, and so more complicated
methods are often preferred, particularly when the number
of hypotheses is large.
Subset analyses are similar to multiple comparisons in
that they increase the probability of spurious results and
require cautious interpretation. They are common in the
urological literature. Ideally subset analyses should be prespecified and appropriately powered. For example, if the
overall outcome of a trial shows a difference between study
groups, subset analysis might be performed to identify patients who may particularly benefit. However, subset analyses are frequently performed even if the overall study outcome is negative (no statistically significant difference
between groups) to show a potential benefit in at least a
certain subset of patients. This practice implies testing multiple hypotheses, which risks a high rate of false-positive
findings. On the other hand, concluding from subset analyses that there is no difference between these subsets may
also be erroneous because studies are rarely designed and
powered to adequately detect these differences. Therefore,
subset analyses should be regarded as exploratory and interpreted with caution.45
TABLE 1. Sample size calculation
CSD vs PR Urinary Leakage Effect Size
(No. pts/arm)*
␤
% Power
0.08 vs 0.04
0.08 vs 0.02
0.08 vs 0.01
80
90
99
539
721
1,260
187
251
438
113
181
264
80
90
99
801
1,021
1,648
279
355
573
168
214
346
␣ ⫽ 0.05:
0.20
0.10
0.01
␣ ⫽ 0.01:
0.20
0.10
0.01
Assume a hypothetical RCT in patients undergoing partial nephrectomy to
investigate whether PR is associated with a lower incidence of urinary
leakage than a CSD. Study outcome of interest is the proportion of patients
per arm with leakage. Assume that a retrospective review of patients with
a CSD yields an 8% leakage rate. Shown are calculated sample sizes for 3
effect sizes, ie differences in leakage rates of 4%, 6% and 7% between arms.
As estimated sample sizes indicate, the smaller the effect size, the larger
the sample size. Sample size estimates also increase if a smaller ␣ or a
greater power (smaller ␤) is selected.
* Sample size calculations based on the arcsin approximation with 2-sided ␣.
power, since the effect size must be set to a smaller value
than in a superiority trial.
Choosing an Appropriate Statistical Test
Selecting an appropriate statistical method requires consideration of the analytical goal and the type of data under
consideration (table 2).46 Factors that determine the choice
of a statistical test are the type of outcome variable (categorical, ordinal, continuous and time-to-event), whether
data are paired or independent and the assumed data distribution.
Statistical tests commonly encountered in the urology
literature can be grouped into 2 categories, including parametric tests, which rely on a specified data distribution
(usually normal), and nonparametric tests, which do not.
Parametric tests are generally more powerful than nonparametric tests if the underlying assumptions about data distribution are met. While there are specific statistical tests to
analyze whether the assumptions of a normal distribution
are met, histograms are helpful for making this determination. Nonparametric tests are often used if the outcome
variable is ordinal, sample size is small (fewer than 20),
there are multiple outliers or there is an obviously nonnormal distribution. When in doubt, it is preferable to proceed
with nonparametric testing, which will yield a more conservative p value. Examples of nonparametric tests are the
Beware of Multiple Comparisons
Based on the hypothesis testing framework setting ␣ to 0.05
indicates that the investigator is willing to accept a 5%
chance of a falsely rejecting H0 (type I error). If the investigator tests multiple independent H0s, the probability that 1
statistical test is significant by chance alone increases.41,42
It is common in the urology literature to see multiple hypotheses being tested, thereby increasing the probability of
achieving a statistically significant result well beyond the
5% threshold.9 For example, if a study tests 10 H0s, the
probability of obtaining at least 1 statistically significant
result at an ␣ of 0.05 is 40% (1 ⫺ 0.9510), for 50 H0 it is 92%
(1 ⫺ 0.9550) and for 100 H0 it is greater than 99% (1 ⫺
0.95100).3
Several statistical methods are available to adjust for
multiple comparisons.15,43 The Bonferroni method is the
simplest and most commonly used method.44 This correction
involves dividing ␣ by the number of independent hypotheses to be tested. For example, if an investigator tests 10
hypotheses, the level of ␣ to determine statistical significance would decrease from 0.05 to 0.05/10 ⫽ 0.005. The
Bonferroni correction results in conservative determinations
TABLE 2. Statistical road map
Outcome Variable Type
Analysis Goal
Describe 1 group
Compare 2 unpaired
groups
Compare 2 paired
groups
Compare 3 or more
groups
Examine association
between 2 variables
Predict 1 value from
another
Continuous
Ordinal
Categorical
Mean (central tendency),
SD (spread)
CI for difference in means,
2-sample t test
Median (central tendency), IQR
(spread)
CI for difference in medians,
Mann-Whitney (Wilcoxon) test
Proportions, frequency table
KM survival curve
Survival Time
Log rank test
CI for paired differences,
paired t test
1-Way ANOVA
Wilcoxon signed rank test
CI for proportions, chisquare ⫹ Fisher’s exact
tests, OR
NcNemar’s test
Kruskal-Wallis test
Pearson correlation
Linear regression
Not available
Proportional hazards
regression
Spearman correlation
Cochran Mantel-Haenszel ⫹
chi-square tests
Chi-square test
Ordinal logistic regression
Logistic regression
Proportional hazards
regression
INTERPRETING STATISTICS IN UROLOGICAL LITERATURE
1943
TABLE 3. Hypothetical study of association between coffee drinking and bladder cancer
Coffee:
Ca
No Ca
No coffee:
Ca
No Ca
OR (Ca vs coffee)
Ca ⫹ no Ca:
Coffee
No coffee
OR (coffee vs smoking)
Coffee ⫹ no coffee:
Ca
No Ca
OR (Ca vs smoking)
No. Smokers
No. Nonsmokers
80
40
10
20
20
10
(80 ⫻ 10)/(20 ⫻ 40) ⫽ 1
40
80
(10 ⫻ 80)/(40 ⫻ 20) ⫽ 1
120
30
30
120
100
50
⬍ (120 ⫻ 120)/(30 ⫻ 30) ⫽ 16 ⬎
⬍ (100 ⫻ 100)/(50 ⫻ 50) ⫽ 4 ⬎
50
100
Of 150 coffee and 150 noncoffee drinkers 90 and 60, respectively, have bladder cancer, from which OR of 2.25 is calculated for association of coffee drinking
and bladder cancer. However, when stratified by smoking status, coffee drinking is not associated with bladder cancer. Smoking is a confounder because it is
strongly associated with coffee drinking and bladder cancer.
Mann-Whitney (Wilcoxon rank sum) and Kruskal-Wallis
tests (table 2).
While many studies compare 2 groups that have no intrinsic association, there are situations when the outcome in
a subject in 1 group is expected to be more similar to that in
a particular subject in the second group than to that in a
random subject. In this case we refer to paired data. An
example of paired data is measurements that are performed
in the same individual before and after intervention, ie the
International Prostate Symptom Score, before and after
transurethral resection of the prostate. Other examples of
paired data in the urology literature are studies that recruit
subjects in pairs, such as patients with prostate cancer who
are matched in age, PSA and prostatectomy Gleason score.
Paired tests account for the lack of independence between
the groups and they should be used to obtain accurate p
values.
What is a Confounding Variable?
Many studies are designed to demonstrate a link between a
predictor variable, eg radiation treatment for prostate cancer with or without androgen ablation, and an outcome
variable, eg recurrence-free survival. Under the ideal circumstances of a double-blind RCT one could assume that the
only difference between the 2 groups would be the form of
treatment. However, in some studies groups may differ in
other ways, particularly in retrospective, observational designs. For example, patients in 1 of the 2 treatment groups
may have had more locally advanced disease to begin with,
which could explain a difference in survival. In this case the
results are said to be confounded. In statistics confounding
has a specific meaning. A confounding variable is one that is
associated with the predictor variable as well as the outcome
variable.15 Table 3 shows an example related to coffee drinking and the risk for bladder cancer. In this example smoking
is a confounder that must be accounted for to interpret the
data in a meaningful way. Although confounders are best
controlled for at the design stage, they can be adjusted for in
the analysis phase.47 The best way to control for confounders
is to randomly assign patients to study groups.
How to Interpret Multivariate Analyses
Multivariate analysis seeks to assess the relationship between a predictor variable and an outcome while adjusting
for the influence of 1 or more covariates. The most frequently
used models are linear regression analysis for continuous
outcomes that are preferably normally distributed, logistic
regression analysis for dichotomous or ordinal outcomes and
proportional hazards regression analysis for time-to-event
data.48 Each model yields ␤ coefficients that indicate the
magnitude of the change in the outcome as a function of a
change in the predictor variable after adjusting for all other
covariates in the model.
It is important to realize that multivariate analysis can
account only for known confounder variables but cannot
adjust for a lack of randomization.49,50 Also, if a study is
inadequately powered to demonstrate an effect on univariate analysis, the same will be true when multivariate analysis is performed. In general large SEs and large CIs are
clues to an inadequate sample size.3
Finally, each variable included in the model requires a
certain number of events, such as the number of successes
when the outcome is dichotomous. This issue is particularly
relevant to logistic and proportional hazards regression
models because only a subset of study subjects experiences
the outcome of interest. As a rule of thumb, 10 to 15 events
are necessary for each variable included in the model to
solve the equation and allow the model to converge.49 Including too many variables in a multivariate model is referred to as overfitting and it is a common occurrence in the
urological literature.9
CONCLUSIONS
Basic knowledge of statistical concepts and testing is essential to the understanding and critical interpretation of the
urological literature. Therefore, it is an integral component
of an evidence based practice. We describe the core principles of applied statistics that are relevant to urology and we
sought to illustrate common pitfalls. While this article can
only provide a brief introduction, this topic is relevant to a
large audience of practicing urologists. Statistical literacy
should continue to be fostered through educational materials and courses from within the urological community.
ACKNOWLEDGMENTS
Dr. Ulrich Guller and Elizabeth R. DeLong inspired this
review.3
1944
INTERPRETING STATISTICS IN UROLOGICAL LITERATURE
APPENDIX
Analogy Between Trial by Jury and Statistical Hypothesis Testing
Trial by Jury
Start with the presumption that the defendant is innocent
Innocence: the defendant did not counterfeit money
Guilt: the defendant counterfeited the money
Listen to factual evidence presented in the trial. Do not consider
outside evidence, such as newspaper stories that you have read
Evaluate whether you believe the witnesses. Ignore testimony from
witnesses whom you think have lied
Standard for rejecting innocence: beyond reasonable doubt
Think about whether the evidence is consistent with the
assumption of innocence
Correct judgments that a jury can make
If evidence is inconsistent with the assumption of innocence, then
reject the assumption of innocence and declare the defendant to
be guilty (convict a counterfeiter)
If evidence is consistent with the assumption of innocence, then
declare the defendant to be innocent (acquit an innocent person)
Incorrect judgments that a jury can make
Convict an innocent person
Acquit a counterfeiter
Statistical Analogy
Start with the presumption that the null hypothesis (H0) is true
H0: there is no association between the incidence of perinephric abscesses
and the type of drain used
Alternative hypothesis (HA): there is an association between the incidence
of perinephric abscesses and the type of drain used
Base your conclusions only on data from this 1 analysis. For the purpose
of the statistical analysis ignore whether the hypothesis is scientifically
or clinically plausible and do not consider any other data
Evaluate whether the study was performed appropriately. Ignore flawed
data
Standard for rejecting H0: level of statistical significance (␣, typically 0.05)
Calculate a p value
Correct inferences
If the p value is lower than the preset threshold of ␣, conclude that the
data are inconsistent with H0 and declare the difference to be
statistically significant
If the p value is equal or greater than the preset threshold (typically 0.05),
conclude that the data are consistent with H0 and declare the difference
to be not statistically significant
Incorrect inferences
Type I error: conclude that there is an association between abscess
incidence and drain type when there actually is none
Type II error: conclude that there is no association between abscess
incidence and drain type when there actually is one
9.
Abbreviations and Acronyms
ARR
BT
CSD
H0
HA
KM
PLESS
⫽
⫽
⫽
⫽
⫽
⫽
⫽
PR
PSA
RR
RRR
⫽
⫽
⫽
⫽
absolute risk reduction
bladder tumor
closed suction drain
null hypothesis
alternate hypothesis
Kaplan-Meier
Proscar™ Long-Term Efficacy and Safety
Study
Penrose drain
prostate specific antigen
risk ratio or relative risk
relative risk reduction
10.
11.
12.
13.
14.
REFERENCES
1.
2.
3.
4.
5.
6.
7.
8.
Gup, D. I., Province, M. A. and Lepor, H.: A practical review of
statistical analysis. AUA Update Series, vol. 9, lesson 24,
1990
Davis, J. W., Chang, S. S. and Schellhammer, P. F.: Clinical
trials methodology. AUA Update Series, vol. 23, lesson 18,
2004
Guller, U. and DeLong, E. R.: Interpreting statistics in medical
literature: a vade mecum for surgeons. J Am Coll Surg, 198:
441, 2004
Albertsen, P. C., Hanley, J. A. and Murphy-Setzko, M.: Statistical considerations when assessing outcomes following
treatment for prostate cancer. J Urol, 162: 439, 1999
Montori, V. M., Jaeschke, R., Schunemann, H. J., Bhandari,
M., Brozek, J. L., Devereaux, P. J. et al: Users’ guide to
detecting misleading claims in clinical research reports.
BMJ, 329: 1093, 2004
Kocher, M. S. and Zurakowski, D.: Clinical epidemiology and
biostatistics: a primer for orthopaedic surgeons. J Bone
Joint Surg Am, 86-A: 607, 2004
Altman, D. G.: Statistics and ethics in medical research. Misuse of statistics is unethical. Br Med J, 281: 1182, 1980
Ambrosius, W. T. and Manatunga, A. K.: Intensive short
courses in biostatistics for fellows and physicians. Stat
Med, 21: 2739, 2002
15.
16.
17.
18.
19.
20.
21.
22.
23.
Scales, C. D., Jr., Norris, R. D., Peterson, B. L., Preminger,
G. M. and Dahm, P.: Clinical research and statistical methods in the urology literature. J Urol, 174: 1374, 2005
Breau, R. H., Carnat, T. A. and Gaboury, I.: Inadequate statistical power of negative clinical trials in urology literature. J Urol, 176: 263, 2006
Altman, D. G.: Misleading interpretation of results from a
small study. Urology, 43: 411, 1994
Ferraris, V. A. and Ferraris, S. P.: Assessing the medical
literature: let the buyer beware. Ann Thorac Surg, 76: 4,
2003
Pocock, S. J., Hughes, M. D. and Lee, R. J.: Statistical problems in the reporting of clinical trials. A survey of three
medical journals. N Engl J Med, 317: 426, 1987
Zaugg, C. E.: Common biostatistical errors in clinical studies.
Schweiz Rundsch Med Prax, 92: 218, 2003
Motulsky, H.: Intuitive Biostatistics. Oxford, United Kingdom:
Oxford University Press, p. 386, 1995
Dellabella, M., Milanese, G. and Muzzonigro, G.: Randomized
trial of the efficacy of tamsulosin, nifedipine and phloroglucinol in medical expulsive therapy for distal ureteral calculi. J Urol, 174: 167, 2005
Steenkamp, J. W., Heyns, C. F. and de Kock, M. L.: Internal
urethrotomy versus dilation as treatment for male urethral
strictures: a prospective, randomized comparison. J Urol,
157: 98, 1997
Brown, G. W.: Standard deviation, standard error. Which
“standard” should we use? Am J Dis Child, 136: 937, 1982
Zaugg, C. E.: Standard error as standard? Circulation, 107:
e89, 2003
Altman, D. G., Gore, S. M., Gardner, M. J. and Pocock, S.J.:
Statistical guidelines for contributors to medical journals.
Br Med J (Clin Res Ed), 286: 1489, 1983
Altman, D. G., Schulz, K. F., Moher, D. et al: The revised
CONSORT statement for reporting randomized trials: explanation and elaboration. Ann Intern Med, 134: 663, 2001
Altman, D. G.: Endorsement of the CONSORT statement by
high impact medical journals: survey of instructions for
authors. BMJ, 330: 1056, 2005
Montori, V. M., Kleinbart, J., Newman, T. B. et al: Tips for
learners of evidence-based medicine: 2. Measures of precision (confidence intervals). CMAJ, 171: 611, 2004
INTERPRETING STATISTICS IN UROLOGICAL LITERATURE
24.
25.
26.
27.
28.
29.
30.
31.
32.
33.
34.
35.
36.
Jaeschke, R., Guyatt, G., Shannon, H., Walter, S., Cook, D. and
Heddle, N.: Basic statistics for clinicians: 3. Assessing the
effects of treatment: measures of association. CMAJ, 152:
351, 1995
Greenhalgh, T.: How to read a paper. Statistics for the nonstatistician. II: “Significant” relations and their pitfalls
BMJ, 315: 422, 1997
Barratt, A., Wyer, P. C., Hatala, R., McGinn, T., Dans, A. L.,
Keitz, S. et al: Tips for learners of evidence-based medicine:
1. Relative risk reduction, absolute risk reduction and number needed to treat. CMAJ, 171: 353, 2004
McConnell, J. D., Bruskewitz, R., Walsh, P., Andriole, G.,
Lieber, M., Holtgrewe, H. L. et al: The effect of finasteride
on the risk of acute urinary retention and the need for
surgical treatment among men with benign prostatic hyperplasia. Finasteride Long-Term Efficacy and Safety
Study Group. N Engl J Med, 338: 557, 1998
Bland, J. M. and Altman, D. G.: Statistics notes. The odds
ratio. BMJ, 320: 1468, 2000
Altman, D. G., Deeks, J. J. and Sackett, D. L.: Odds ratios
should be avoided when events are common. BMJ, 317:
1318, 1998
Kaplan, E. L. and Meier, P.: Nonparametric estimation from
incomplete observations. J Am Stat Assoc, 53: 457, 1958
Bland, J. M. and Altman, D. G.: Survival probabilities (the
Kaplan-Meier method). BMJ, 317: 1572, 1998
Clark, T. G., Bradburn, M. J., Love, S. B. and Altman, D.G.:
Survival analysis part I: basic concepts and first analyses.
Br J Cancer, 89: 232, 2003
Mathew, A., Pandey, M. and Murthy, N. S.: Survival analysis:
caveats and pitfalls. Eur J Surg Oncol, 25: 321, 1999
Pocock, S. J., Clayton, T. C. and Altman, D. G.: Survival plots
of time-to-event outcomes in clinical trials: good practice
and pitfalls. Lancet, 359: 1686, 2002
Albertsen, P. C.: Outcomes research: a primer for urologists.
AUA Update Series, vol. 17, lesson 14, 1998
Berwick, D. M.: Experimental power: the other side of the coin.
Pediatrics, 65: 1043, 1980
37.
38.
39.
40.
41.
42.
43.
44.
45.
46.
47.
48.
49.
50.
1945
Freiman, J. A., Chalmers, T. C., Smith, H., Jr. and Kuebler,
R. R.: The importance of beta, the type II error and sample
size in the design and interpretation of the randomized
control trial. Survey of 71 “negative” trials. N Engl J Med,
299: 690, 1978
Altman, D. G.: Statistics and ethics in medical research: III.
How large a sample? Br Med J, 281: 1336, 1980
Moher, D., Dulberg, C. S. and Wells, G. A.: Statistical power,
sample size, and their reporting in randomized controlled
trials. JAMA, 272: 122, 1994
Guller, U. and Oertli, D.: Sample size matters: a guide for
surgeons. World J Surg, 29: 601, 2005
Lee, J. T.: Statistics in medical literature. J Am Coll Surg, 199:
170, 2004
Sterne, J. A. and Davey Smith, G.: Sifting the evidence—
what’s wrong with significance tests? BMJ, 322: 226, 2001
Bender, R. and Lange, S.: Adjusting for multiple testing—
when and how? J Clin Epidemiol, 54: 343, 2001
Bland, J. M. and Altman, D. G.: Multiple significance tests: the
Bonferroni method. BMJ, 310: 170, 1995
Assmann, S. F., Pocock, S. J., Enos, L. E. and Kasten, L. E.:
Subgroup analysis and other (mis)uses of baseline data in
clinical trials. Lancet, 355: 1064, 2000
Greenhalgh, T.: How to read a paper. Statistics for the nonstatistician. I: Different types of data need different statistical tests BMJ, 315: 364, 1997
Mullner, M., Matthews, H. and Altman, D. G.: Reporting on
statistical methods to adjust for confounding: a cross-sectional survey. Ann Intern Med, 136: 122, 2002
Clark, T. G., Bradburn, M. J., Love, S. B. and Altman, D.G.:
Survival analysis part IV: further concepts and methods in
survival analysis. Br J Cancer, 89: 781, 2003
Katz, M. H.: Multivariate Analysis: A Practical Guide for Clinicians. Cambridge, United Kingdom: Cambridge University Press, p. 192, 1999
Concato, J., Feinstein, A. R. and Holford, T. R.: The risk of
determining risk with multivariable models. Ann Intern
Med, 118: 201, 1993