Download Refresher in statistics

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

Taylor's law wikipedia , lookup

Psychometrics wikipedia , lookup

Confidence interval wikipedia , lookup

Foundations of statistics wikipedia , lookup

Omnibus test wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Student's t-test wikipedia , lookup

Misuse of statistics wikipedia , lookup

Transcript
ICRO Course
Statistics Refresher
DR. INDRANIL MALLICK
TATA MEDICAL CENTER, KOLKATA
NOV 2016, BANGALORE
What are we discussing today?
Not everything about medical statistics!
A few important concepts:
A) Population and Sampling
B) Hypothesis testing, the p-value and ‘significance’
B) Choosing a statistical test
C) Survival analysis
A free question round… ask me what you want.
Populations and
samples
Answer these questions
1. What is the mean age of breast cancer patients in India?
2. What is the mean dose of Radiotherapy to the left parotid gland to
the head and neck cancers treated in your hospital in the last 20
years?
3. What proportion of breast cancer patients in India are Her-2-neu
positive?
4. What is the mean ADC value in DW-MRI of cervical cancer
primaries?
We may not have data on all patients
Populations and samples
Population – entire group of individuals in
whom we are interested
Sample – a smaller group of individuals who
are representative of the population. We often
calculate summary measures from a sample to
draw conclusions about the population
Sampling error – an error introduced in the
‘point estimate’ or sample statistic by using a
sample instead of a population
The error can be kept small by taking a
random sample, or a representative sample
Estimating the population from the sample
We can estimate several measures of the population from a
sample
◦ Mean or Proportion
◦ Standard deviation
The value for a sample is never identical to the population
How do we know how precisely we have estimated?
If we can estimate the sampling error, then we can estimate
where the actual population summary measures could lie.
◦ Standard error of mean
◦ Standard error of proportion
Standard errors
Standard error of mean
◦ Assumption: sample is large/small but follows a normal distribution – the
means of several samples then also follow a normal distribution, and the
standard deviation of the means gives an estimate of the standard error
𝜎
𝑠
◦ SEM = =
√𝑛
√𝑛
◦ Factors affecting the standard error: SD increases, sample size decreases
Standard error of proportion
◦ SE(p) =
𝑝(1−𝑝)
𝑛
Standard deviation vs. Standard Error
Standard deviation (SD)
◦ variability of data values in a dataset
◦ we are not concerned estimation of a larger population
Standard error (SE)
◦ precision of mean
◦ Concerned about estimation of the mean/proportion of a larger population
Confidence intervals
95% Confidence intervals of the mean
◦ = sample mean +/- 1.96*SEM
95% confidence intervals of the proportion
◦ Sample proportion +/- 1.96*SE(p)
Interpretation:
◦ How wide is it?
◦ Is it clinically important?
◦ Does it contain the hypothesized value?
Confidence intervals
A tricky concept!
When we try to estimate a population parameter (e.g. mean, proportion,
difference between values) from a sample estimate, the estimate is often
described with a ‘confidence interval’.
The mean weight of a random sample of 100 students from a group of
10,000 is 34 kg (95% CI = 31 to 38 kg)
95% of the students weight will be between 31 to 38 kg
There is 95% probability that the true mean weight of the whole class is
between 31 to 38 kg
If we take many such samples and calculate interval estimates, then 95%
of interval estimates will contain the true mean of the population.
Confidence intervals are a comment on the sampling method – not the
‘probability’ of the true mean
Probability and
Hypothesis testing
Probability
Basically: How likely is an event?
Properties of probability:
◦Lies between 0 and 1
◦When an outcome can never happen, p=0
◦When an outcome must happen, p=1
Hypothesis testing
What is a hypothesis?
A theory (based on observations/ expectations)
Expressing the hypothesis
Null hypothesis (H0)= assumes no effect in the population (difference in
means = 0)
◦ Exercising does not change blood cholesterol levels
Alternative hypothesis (H1) = when the null hypothesis is false (usually
what we are trying to investigate)
◦ Exercising changes blood cholesterol levels
◦ ‘one tailed’ vs ‘two tailed’ testing
Steps of hypothesis testing
Define the null and alternative hypothesis
Obtain the values from the control and test populations/situations
Calculate the test statistic
Compare the test statistic to a table of known probability distributions
Obtain and interpret the p values
The p value
The mean duration of hospital stay for open vs. robot assisted
radical prostatectomy was 12 days vs 9 days (p=0.03)
The incidence of grade 3/4 haematological toxicity was
◦ 15% with RT alone vs. 20% with RT + cetuximab (p=0.002)
◦ 16% with RT alone vs. 22% with RT + cisplatin (p=0.02)
The median survival with RT alone was 23 months vs 30
months with chemoradiotherapy (p=0.09)
The p value
The probability of obtaining these results, or something more
extreme, if the null hypothesis is true
Does not quantify difference
Is not the same as = probability of the null hypothesis being true.
The null hypothesis is either true (accepted) /false (rejected).
Using the p value to derive conclusions
Can we reject the null hypothesis?
An arbitrary cut-off of 5% (0.05) is often used.
◦ If p <0.05 – the probability of getting this result/difference if the null
hypothesis is true is <5% - we reject H0 (at a significance level of 5%)
◦ If p>0.05 – ‘we cannot reject the null hypothesis’ (not the same as there is no
difference)
The cut-off value can be changed (made stricter) by using 1%
(p<0.01) or 0.1% (p<0.001)
◦ If the implications of rejecting the null hypothesis are very severe
◦ Multiple comparisons are being made
◦ Must be decide before data is collected
What does a non-significant p value
mean?
The median survival with RT alone was 23 months vs 30
months with chemoradiotherapy (p=0.09)
It does not mean that the two groups being studied are the same (or
that there is no difference)
It simply means that from the results obtained – we cannot conclude
that there is a difference (we cannot reject the null hypothesis)
Some Reasons –inadequate sample size (power)
Are we slaves to the p value?
Using confidence intervals when
comparing groups
Hypothesis test - make a decision and provide an exact p-value.
Confidence interval –
◦ quantifies the effect of interest (e.g. the difference in means)
◦ enables us to assess the clinical implications of the results.
Provides a range of plausible values for the true effect - can also be
used to make a decision about the p value even though the exact Pvalue is not provided.
◦ For example, if the hypothesized value for the effect (e.g. zero) lies outside
the 95% confidence interval then we believe the hypothesized value is
implausible and would reject H0. In this instance, we know that the P-value
is less than 0.05 but do not know its exact value
Statistical significance vs clinical
significance
Association is not causation
Errors in hypothesis
testing
Errors in hypothesis testing
Null hypothesis accepted Null hypothesis rejected
(non-significant p value) (significant p value)
Null hypothesis is true
(no difference in groups)
Null hypothesis is false
(there is actually a
difference)
Correct interpretation
Type I error () – false
positive result
Type II error () – false
negative result
Power (1-)
Acceptable error (commonly used)
Type I = 5% or 0.05
Type II = 20% or 0.2
Power = ‘ability to detect a difference if there is one’
Factors affecting power
The sample size
Larger sample = higher power
The variability
Larger SD = less power
The effect size
Larger effect size = higher power
The significance level
p<0.01 = higher power
Principle of sample size calculations
Calculation of an appropriate sample size in studies is crucial
The methodology used depends on the type of
estimation/comparison
Example:
◦ Difference of means between two groups:
◦ Expected means in the two groups (includes effect size)
◦ Standard deviation
◦ Type I error and Power
Online calculator – example
Equivalence and non-inferiority trials
When?
◦ New treatment is less toxic/simpler/less
expensive
◦ Bio-equivalence studies of drugs
What’s the difference: The traditional null
and alternate hypothesis does not hold.
(their roles are essentially reversed)
The significance test or p value is of
limited use
The confidence intervals are important
Clinically important effect size is
important
Multiple comparisons – what is wrong?
Examples
Very common in research
Subgroup testing
Multiple comparisons – between 2+ groups, different timepoints
Multiple outcome variables
Interim analyses
Greatly increases the chance of false positive results
If =0.05, then the rate of acceptable false positive results is 5%. If we make
multiple comparisons then the false positive rate is much higher – > 60% for
20 comparisons
Multiple comparisons - solutions
Define a stricter type I error () threshold e.g. 0.01
Correct the p value obtained by multiplying it with the number of
comparisons carried out (Bonferroni correction)
◦ E.g. p value obtained = 0.02, no of comparisons made = 6, corrected p value
= 0.02 x 6 = 0.12
Plan a subgroup analysis a priori and make sure that it is
adequately powered to detect a significant difference.
Choosing the right
statistical test
Tests for comparison of two or more
groups
What kind of data is it?
◦ Numerical or categorical
◦ Numerical: likely to be normally distributed or very skewed?
◦ Categorical: are the categories nominal or ordinal?
Who is being compared?
◦ Same group at different times/circumstances
◦ Different groups
◦ How many groups?
What type of data is this?
 Age
 Sex
 Tumor size (maximum dimension)
 N stage
 Type of treatment: Surgery vs. surgery + RT
 Moderate dose RT (60Gy) vs High dose RT (70Gy)
 Severity of reactions: Grade 1, Grade 2, Grade 3, Grade 4
Numerical data
Catgeorical -
1 group
1 sample
Sign test
2 groups
>2 groups
Paired
Unpaired
Unpaired
Paired T test
Unpaired T test
ANOVA
Wilcoxon signed
rank test
Mann Whitney
U test
Kruskall Wallis
test
Mann-Whitney
U test
Categorical data
Categorical – 2 categories
1 group
1 sample
Z-test for
proportion
Sign test
2 groups
Paired
McNemar test
>2 groups
Unpaired
Chi-square test
Fisher exact test
Mann-Whitney
U test
Unpaired
Chi-square test
Chi-square test
for trend
Chi-square test - proportions
◦2 x 2 or r x c contingency table
◦Observed and expected values
◦Chi-square value
◦Degrees of freedom
◦Chi-square table
Chi-square test – important variants
Chi-square test for trend (ordinal values)
McNemar test (paired value)
Fisher exact test = when the expected value in a cell of
the 2 x 2 contingency table is less than 5
Correlation and Regression
Correlation: Measures the degree of association between two
variables (x and y).
Regression: Measures how one variable (x) is affected by one or
more other variables (y). Tries to predict what x will be based on
the value of y (creates an equation)
Correlation
Pearson’s correlation – 2
numerical variables
◦ Assumptions: Paired,
Linearity, outliers, bivariate
normality
Spearman’s correlation –
numerical or ordinal
variables
◦ Assumptions: Monotonic,
paired
Interpreting correlation
Coefficient: -1 to +1. Zero means no correlation.
Positive and negative correlation
r increases when the range of measurements of x and y increases
r2 = what percentage of the variability in y is explained by its
linear relationship with x
Linear Regression
Simple linear regression: predicting x from y
Multilinear regression: predicting x from y1, y2, y3, y4
Logistic regression
Logistic regression: when x is a dichotomous variable (yes/no)
Output is in the form of odds ratios with confidence intervals for
each variable
Survival analysis
Summarizing and comparing survival
How do we summarize and compare survival (in months) between
two groups of patients, randomized to radiotherapy vs.
chemoradiotherapy for advanced cervical cancer?
a) Paired T test
b) Unpaired T test
c) Wilcoxon signed rank test
d) Mann-Whitney U test
Time-to-event data
Described by
• Event
• The time of the event
• Not all patients have had the event
at the time of analysis
Understanding censoring
Censoring can occur by
virtue of patients lost
to follow up or end of
study.
Patients
The event in question
could have happened
to the individual after
the last follow-up or
end of study.
Follow up and outcomes
The Kaplan
Meier Curve
• Cumulative survival
probability plotted over time
• Step ladder pattern
• Step occurs at each event
time point
• The size of the step is
determined by the number of
events and the number of
cases at risk at that time
point
Can you read this curve?
Tabulating and Plotting a KM survival
curve
Survival starts at 1 (or 100%)
Note each time-point (ti where i=1 to n)
Each each timepoint t note how many events (di) and number of
individuals at risk (ni)
The survival probability right after each event is 1-di/ni
The cumulative survival probability is the running product of each
calculation.
Let’s plot a survival curve!
Let’s plot a survival curve!
Comparing time-to-event data
Testing one factor: Log rank test (univariate
analysis)
◦ Prospective comparative study – treatment
◦ Retrospective study – treatment, prognostic factor
Testing many factors simultaneously: Cox regression
(multivariate analysis)
◦ Cox proportional hazards model
◦ Tests if a factor affects the rate of the event occurring after the influence of
other factors have been eliminated
◦ Shows how much more likely an event is based on the change of a factor
Hazard ratio
Represents the increased risk of an event happening if the
independent factor changes
For categorical factors
For numerical factors
Your questions now…
Educational activities at TMC Kolkata
FRCR (Clinical Oncology)
◦ Examinations – Part 1 and 2a – Spring and Autumn
◦ Part 1 course (with Christie Hospital, Manchester) – Jan 2017
◦ Part 2a course (with Leeds Oncology Center) – Dec 2016
www.igrtonline.com – 1100 participants from 53 countries certification
2 year Clinical Oncology Fellowship – modelled on the
FRCR requirements