Download 2 + - Metcardio

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

History of statistics wikipedia , lookup

Regression toward the mean wikipedia , lookup

Categorical variable wikipedia , lookup

Transcript
Primer on Statistics
for Interventional
Cardiologists
Giuseppe Sangiorgi, MD
Pierfrancesco Agostoni, MD
Giuseppe Biondi-Zoccai, MD
What you will learn
•
•
•
•
•
•
•
•
•
•
•
•
Introduction
Basics
Descriptive statistics
Probability distributions
Inferential statistics
Finding differences in mean between two groups
Finding differences in mean between more than 2 groups
Linear regression and correlation for bivariate analysis
Analysis of categorical data (contingency tables)
Analysis of time-to-event data (survival analysis)
Advanced statistics at a glance
Conclusions and take home messages
What you will learn
•
•
•
•
•
•
•
•
•
•
•
•
Introduction
Basics
Descriptive statistics
Probability distributions
Inferential statistics
Finding differences in mean between two groups
Finding differences in mean between more than 2 groups
Linear regression and correlation for bivariate analysis
Analysis of categorical data (contingency tables)
Analysis of time-to-event data (survival analysis)
Advanced statistics at a glance
Conclusions and take home messages
What you will learn
• Descriptive statistics
– frequency distributions
– contingency tables
– measures of location: mean, median, mode
– measures of dispersion: variance, standard
deviation, range, interquartile range
– coefficient of variation
– graphical presentation: histogram, box-plot,
scatter plot
– correlation
What you will learn
• Descriptive statistics
– frequency distributions
– contingency tables
– measures of location: mean, median, mode
– measures of dispersion: variance, standard
deviation, range, interquartile range
– coefficient of variation
– graphical presentation: histogram, box-plot,
scatter plot
– correlation
Counting and displaying data
Cardiology
After we have collected our data, we need to display
them (tables, graphics and figures)
Raw enumeration (eg lesion length by visual estimation in
patients treated in Endeavor II trial: 14-27 mm)
15
14
15
18
24
23
17
16
23
14
26
15
16
15
17
25
27
19
14
23
25
18
16
14
15
18
24
19
18
15
14
19
25
24
17
15
18
24
20
18
15
26
21
14
18
21
15
18
24
23
20
21
15
18
24
18
15
16
17
22
23
14
21
15
18
24
20
18
15
24
25
15
20
19
15
18
24
24
16
25
17
14
23
16
17
15
18
24
25
15
23
17
15
24
18
20
16
18
24
26
…
Tabular display
Cardiology
example
Tabular display
Cardiology
example
DELAYED RRISC, JACC 2007
Tabular display
Cardiology
example
DELAYED RRISC, JACC 2007
Types of variables
Variables
CATEGORY
nominal
QUANTITY
ordinal
ordered
categories
ranks
discrete
continuous
counting
measuring
Counting and displaying data
Cardiology
Create a
database!
Variable
type
Nominal
Ordinal
Continuous
Patient ID
Diabetes
AHA/ACC
Type
Lesion Length
1
Y
A
18
2
N
B1
24
3
N
A
17
4
N
C
25
5
Y
B2
23
6
N
A
15
7
N
A
16
8
Y
B2
18
9
N
B1
21
10
Y
B2
19
11
N
B1
14
12
Y
C
22
13
N
C
27
Frequency distribution
Cardiology
A frequency distribution is a list of the values that
a variable takes in a sample. It is usually a list,
ordered by quantity, showing the number of times
each value appears
Diabetes
n=13
AHA/ACC n=13
Type
Yes
5
A
4
No
8
B1
3
B2
3
C
3
Frequency distribution
Cardiology
A frequency distribution is a list of the values that
a variable takes in a sample. It is usually a list,
ordered by quantity, showing the number of times
each value appears
Diabetes
n=13
AHA/ACC n=13
Type
Yes
5
38.5%
A
4
30.7%
No
8
61.5%
B1
3
23.1%
B2
3
23.1%
C
3
23.1%
This introduces the concept of percentage or rate
Frequency distribution
Cardiology
ENDEAVOR III, JACC 2006
Frequency distribution
Cardiology
Lesion
length
This simple tabulation has
drawbacks. When a variable can
take continuous values instead of
discrete values or when the
number of possible values is too
large, the table construction is
cumbersome, if not impossible
n=13
14
1
7.7%
15
1
7.7%
16
1
7.7%
17
1
7.7%
18
2
15.3%
19
1
7.7%
21
1
7.7%
22
1
7.7%
23
1
7.7%
24
1
7.7%
25
1
7.7%
27
1
7.7%
Frequency distribution
Cardiology
A slightly different tabulation scheme based on the
range of values can be a solution in such cases
Lesion length
n=13
14-20 mm
7
53.8%
21-27 mm
6
46.2%
However better solutions are coming later…
What you will learn
• Descriptive statistics
– frequency distributions
– contingency tables
– measures of location: mean, median, mode
– measures of dispersion: variance, standard
deviation, range, interquartile range
– coefficient of variation
– graphical presentation: histogram, box-plot,
scatter plot
– correlation
Counting and displaying data
Cardiology
Contingency tables are used to record and analyse
the relationship between two (or more) variables,
most usually categorical variables
Diabetes
n=13
AHA/ACC n=13
Type
Yes
5
38.5%
A
4
30.7%
No
8
61.5%
B1
3
23.1%
B2
3
23.1%
C
3
23.1%
Counting and displaying data
Cardiology
Contingency tables are used to record and analyse
the relationship between two (or more) variables,
most usually categorical variables
AHA/ACC type
A
DIABETES
Total
B1
B2
C
Total
no
3
3
0
2
8
yes
1
0
3
1
5
4
3
3
3
13
Counting and displaying data
Cardiology
Contingency tables are used to record and analyse
the relationship between two (or more) variables,
most usually categorical variables
A
no
DIABETE
S
Total
yes
Count
% within DIABETES
Count
% within DIABETES
Count
% within DIABETES
3
37,5%
1
20,0%
4
30,8%
AHA/ACC type
B1
B2
3
0
37,5%
,0%
0
3
,0%
60,0%
3
3
23,1%
23,1%
C
Total
2
25,0%
1
20,0%
3
23,1%
8
100,0%
5
100,0%
13
100,0%
Is there a difference between diabetics and nondabetics in the rate of AHA/ACC type lesions?
The answer will follow…
What you will learn
• Descriptive statistics
– frequency distributions
– contingency tables
– measures of location: mean, median, mode
– measures of dispersion: variance, standard
deviation, range, interquartile range
– coefficient of variation
– graphical presentation: histogram, box-plot,
scatter plot
– correlation
Measures of central tendency: rationale
Cardiology
We need to describe the kind of values that we have
(eg lesion length by visual estimation in patients treated in Endeavor
II trial: 14-27 mm)
Raw enumeration
15
14
15
18
24
23
17
16
23
14
26
15
16
15
17
25
27
19
14
23
25
18
16
14
15
18
24
19
18
15
14
19
25
24
17
15
18
24
20
18
15
26
21
14
18
21
15
18
24
23
20
21
15
18
24
18
15
16
17
22
23
14
21
15
18
24
20
18
15
24
25
15
20
19
15
18
24
24
16
25
17
14
23
16
17
15
18
24
25
15
23
17
15
24
18
20
16
18
24
26
…
Mean (arithmetic)
Cardiology
Characteristics:
-summarises information well
-discards a lot of information
(dispersion??)
x
x
N
Assumptions:
-data are not skewed
– distorts the mean
– outliers make the mean very different
-Measured on measurement scale
– cannot find mean of a categorical measure
‘average’ stent diameter may be meaningless
Mean (arithmetic)
Cardiology
x
x
N
14+15+16+17+18+18+19+21+22+23+24+25+27
13
Mean = 19.92
Lesion
length
n=13
14
1
7.7%
15
1
7.7%
16
1
7.7%
17
1
7.7%
18
2
15.3%
19
1
7.7%
21
1
7.7%
22
1
7.7%
23
1
7.7%
24
1
7.7%
25
1
7.7%
27
1
7.7%
Mean (arithmetic)
Cardiology
TAPAS, Lancet 2008
Median
Cardiology
What is it?
– The one in the middle
– Place values in order
– Median is central
Definition:
– Equally distant from all other values
Used for:
– Ordinal data
– Skewed data / outliers
Median
Cardiology
Variable
type
Continuous
Patient ID
Lesion Length
1
18
2
24
3
17
4
25
5
23
6
15
7
16
8
18
9
21
10
19
11
14
12
22
13
27
Median
Cardiology
Variable
type
Continuous
Variable
type
Continuous
Patient ID
Lesion Length
Patient ID
Lesion Length
1
18
11
14
2
24
6
15
3
17
7
16
4
25
3
17
5
23
1
18
6
15
8
18
7
16
10
19
8
18
9
21
9
21
12
22
10
19
5
23
11
14
2
24
12
22
4
25
13
27
13
27
Mode
Cardiology
What is it?
Definition:
– The most common value
Used (rarely) for:
– Discrete non interval data
– E.g. stent length, stent diameter…………
– MicroDriver is only available in  2.25, 2.50, 2.75
reporting the mean  is meaningless
Mode
Cardiology
Variable
type
Continuous
Patient ID
Lesion Length
1
18
2
24
3
17
4
Lesion
length
n=13
14
1
7.7%
15
1
7.7%
16
1
7.7%
25
17
1
7.7%
5
23
18
2
15.3%
6
15
19
1
7.7%
7
16
21
1
7.7%
8
18
22
1
7.7%
9
21
23
1
7.7%
10
19
24
1
7.7%
11
14
25
1
7.7%
12
22
13
27
27
1
7.7%
Comparing Measures of central tendency
Cardiology
Mean is usually best
– If it works
– Useful properties (with standard deviation [SD])
– But…
Lesion length
Mean
Median
Driver
Endeavor
17
19
19
17
18
21
21
21
21
6
18
18
18
21
Comparing Measures of central tendency
Cardiology
It also depends on the underlying distribution…
mean = median = mode
Frequency
Symmetric?
Value
Comparing Measures of central tendency
Cardiology
It also depends on the underlying distribution…
mean ≠ median ≠ mode
Asymmetric?
30
Mode
Median
Mean
Frequency
25
20
15
10
5
0
0
1
2
3
4
5
6
7
8
Number of Endeavor implanted per patient
9
Median
Cardiology
Agostoni et al, AJC 2007
What you will learn
• Descriptive statistics
– frequency distributions
– contingency tables
– measures of location: mean, median, mode
– measures of dispersion: variance, standard
deviation, range, interquartile range
– coefficient of variation
– graphical presentation: histogram, box-plot,
scatter plot
– correlation
Measures of dispersion: rationale
Cardiology
Central tendency doesn’t tell us everything
– We need to know about the spread, or
dispersion of the scores
Group
Endeavor
Driver
Late loss(mm)
0.61
1.03
Is there a difference? And if yes, how big is it?
We can only tell if we know data dispersion
ENDEAVOR II, Circulation 2006
Measures of dispersion: examples
Frequency
Cardiology
0
0.30
0.60
0.90
Late loss
Endeavor
Driver
1.20
1.50
Measures of dispersion: examples
Frequency
Cardiology
0
0.30
0.60
0.90
Late loss
Endeavor
Driver
1.20
1.50
Measures of dispersion: examples
Frequency
Cardiology
0
0.30
0.60
0.90
Late loss
Endeavor
Driver
1.20
1.50
Shape of distribution
Frequency
Cardiology
Value
Gaussian, normal or “parametric” distribution
Departing from normality
Frequency
Cardiology
Value
Non-normal, right-skewed
Departing from normality
Frequency
Cardiology
Value
Non-normal, left-skewed
Departing from normality
Cardiology
Frequency
20
Outliers
10
0
Value
Measures of dispersion: types
Cardiology
• Standard deviation (SD)
– Used with mean
– Parametric tests
• Range
– First to last value
– Not commonly used
• Interquartile range
– Used with median
– 25% (1/4) to 75% (3/4) percentile
– Non-parametric tests
Standard deviation
Cardiology
Standard deviation (SD):
– approximates population σ
SD

as N increases
Advantages:
– with mean enables powerful synthesis
mean±1*SD 68% of data
mean±2*SD 95% of data (1.96)
mean±3*SD 99% of data (2.86)
Disadvantages:
– is based on normal assumptions
2
( x x )
N-1
Variance
Standard deviation
Cardiology
SD
Variable
type
Continuous
Patient ID
Lesion Length
1
18
2
24
3
17
4
25
5
23
6
15
7
16
8
18
9
21
Variance = 16.58
10
19
11
14
SD = √16.58 = 4.07
12
22
13
27
Mean
19.92

2
( x x )
N-1
(18-19.92)2+(24-19.92)2+(17-19.92)2+…+(27-19.92)2
12
Mean ± Standard deviation
Cardiology
Frequency
68%
-1 SD mean +1 SD
Mean ± Standard deviation
Cardiology
Frequency
95%
-2 SD -1 SD mean +1 SD
+2 SD
Mean ± Standard deviation
Cardiology
Frequency
99%
-3 SD -2 SD -1 SD mean +1 SD
+2 SD +3 SD
Standard deviation
Cardiology
TAPAS, Lancet 2008
Standard deviation
Cardiology
TAPAS, NEJM 2008
Why not mean ± SD?
Cardiology
TAPAS, NEJM 2008
Testing normality assumptions
Cardiology
Rules of thumb
1. Refer to previous data or analyses
(eg landmark articles, large databases)
2. Inspect tables and graphs (eg outliers, histograms)
3. Check rough equality of mean, median,
mode
4. Perform ad hoc statistical tests
•
•
•
Levene’s test for equality of means
Kolmogodorov-Smirnov tests
…
Range
Cardiology
Lesion
length
First to last value
Range = 14 – 27
or
Range = 13
n=13
14
1
7.7%
15
1
7.7%
16
1
7.7%
17
1
7.7%
18
2
15.3%
19
1
7.7%
21
1
7.7%
22
1
7.7%
23
1
7.7%
24
1
7.7%
25
1
7.7%
27
1
7.7%
Range
Cardiology
RRISC, JACC 2006
Interquartile range
Cardiology
25% to 75% percentile
or
1° to 3° quartile
16.5
Interquartile Range
=
16.5 – 23.5
Median
23.5
Variable
type
Continuous
Patient ID
Lesion Length
11
14
6
15
7
16
3
17
1
18
8
18
10
19
9
21
12
22
5
23
2
24
4
25
13
27
Interquartile range
Cardiology
Agostoni et al, AJC 2007
Cardiology
Lesion Length
Valid
14,00
15,00
16,00
17,00
18,00
19,00
21,00
22,00
23,00
24,00
25,00
27,00
Total
Frequency
1
1
1
1
2
1
1
1
1
1
1
1
13
Percent
7,7
7,7
7,7
7,7
15,4
7,7
7,7
7,7
7,7
7,7
7,7
7,7
100,0
Valid Percent
7,7
7,7
7,7
7,7
15,4
7,7
7,7
7,7
7,7
7,7
7,7
7,7
100,0
Cumulative
Percent
7,7
15,4
23,1
30,8
46,2
53,8
61,5
69,2
76,9
84,6
92,3
100,0
Cardiology
Statistics
Les ion Length
N
Valid
Mis sing
Mean
Median
Mode
Std. Deviation
Range
Minimum
Maximum
Percentiles
25
50
75
13
0
19,9231
19,0000
18,00
4,07148
13,00
14,00
27,00
16,5000
19,0000
23,5000
Reporting data
Cardiology
If parametric:
Mean and
Standard Deviation
If non-parametric:
Median and InterQuartile Range
Median [IQR]
Mean ± SD
Mean (SD)
Age (y): 63 ± 13
Age (y): 63 (13)
NIH vol (mm3): 1.3 [0–13.1]
Mode and Range less commonly used
What you will learn
• Descriptive statistics
– frequency distributions
– contingency tables
– measures of location: mean, median, mode
– measures of dispersion: variance, standard
deviation, range, interquartile range
– coefficient of variation
– graphical presentation: histogram, box-plot,
scatter plot
– correlation
Coefficient of Variation
•The coefficient of variation (CV) is a normalized measure
of dispersion of a probability distribution. It is defined as
the ratio of the standard deviation to the mean
•This is only defined for non-zero mean, and is most useful
for variables that are always positive. The coefficient of
variation should only be computed for continuous data
•A given standard deviation indicates a high or low degree
of variability only in relation to the mean value
•It is easier to get an idea of variability in a distribution by
dividing the standard deviation with the mean
Coefficient of Variation
•Advantages
•The CV is a dimensionless number
•The CV is particularly useful when comparing dispersion
in datasets with: markedly different means or, different
units of measurement
•Distributions with CV<1 are considered low-variance,
while those with CV>1 are considered high-variance
•Disadvantages
•When the mean is near zero, the CV is sensitive to
small changes in the mean, limiting its usefulness
•Unlike the standard deviation, it cannot be used to
construct confidence intervals for the mean
What you will learn
• Descriptive statistics
– frequency distributions
– contingency tables
– measures of location: mean, median, mode
– measures of dispersion: variance, standard
deviation, range, interquartile range
– coefficient of variation
– graphical presentation: histogram, box-plot,
scatter plot
– correlation
Histograms
Cardiology
Very good for
categorical variables
10
8
6
4
2
0
1
yes
0
no
DIABETES
ENDEAVOR II, Circulation 2006
Histograms
Cardiology
Not so good for continuous variables, but…
7
3
6
5
2
4
3
1
2
1
0
0
1
3
5
7
9
11
13 15
17
Lesion Length
19
21
23
25
27
0-6
6 - 12
12 - 18
Lesion Length
18 - 24
24 - 30
Histograms
Cardiology
A
50
Frequency
40
both restenotic and
non-restenotic SES
30
example
20
10
0
-,8
-,0
,8
1,6
late loss (mm)
Agostoni et al, AJC 2007
2,4
3,2
Shape of distributions
Cardiology
A
non-restenotic SES
50
Frequency
40
30
example
20
shape of distribution
10
0
-,8
-,0
,8
1,6
late loss (mm)
Agostoni et al, AJC 2007
2,4
3,2
Box (& whiskers) plots
Cardiology
30
25
20
15
10
5
0
Lesion Length
Box (& whiskers) plots
Cardiology
30
Max (Q4) or Q3+1.5(IQR)
25
Q3
20
Interquartile
range
Median (Q2)
Q1
15
Min (Q0) or Q1-1.5(IQR)
10
5
0
Lesion Length
Box (& whiskers) plots
Cardiology
Margheri, Biondi Zoccai, et al, AJC 2008
Scatter plots
Cardiology
A scatter plot is a type of display using Cartesian
coordinates to display values for two variables for a set of
data. The data is displayed as a collection of points, each
having the value of one variable determining the position
on the horizontal axis and the value of the other variable
determining the position on the vertical axis
Usually it is done with 2 continuous variables to visually
assess the degree of correlation between them
But it can be also used with one categorical variable and
one continuous variable (mainly if sample size is small)
Scatter plots
Cardiology
Abbate, Biondi Zoccai, et al, Circulation 2002
Scatter plots
Cardiology
Mintz, et al, AJC 2005
Scatter plots
Cardiology
Agostoni, et al, IJC 2007
Thank you for your attention
For any correspondence:
[email protected]
For further slides on these topics feel
free to visit the metcardio.org website:
http://www.metcardio.org/slides.html