Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Primer on Statistics for Interventional Cardiologists Giuseppe Sangiorgi, MD Pierfrancesco Agostoni, MD Giuseppe Biondi-Zoccai, MD What you will learn • • • • • • • • • • • • Introduction Basics Descriptive statistics Probability distributions Inferential statistics Finding differences in mean between two groups Finding differences in mean between more than 2 groups Linear regression and correlation for bivariate analysis Analysis of categorical data (contingency tables) Analysis of time-to-event data (survival analysis) Advanced statistics at a glance Conclusions and take home messages What you will learn • • • • • • • • • • • • Introduction Basics Descriptive statistics Probability distributions Inferential statistics Finding differences in mean between two groups Finding differences in mean between more than 2 groups Linear regression and correlation for bivariate analysis Analysis of categorical data (contingency tables) Analysis of time-to-event data (survival analysis) Advanced statistics at a glance Conclusions and take home messages What you will learn • Descriptive statistics – frequency distributions – contingency tables – measures of location: mean, median, mode – measures of dispersion: variance, standard deviation, range, interquartile range – coefficient of variation – graphical presentation: histogram, box-plot, scatter plot – correlation What you will learn • Descriptive statistics – frequency distributions – contingency tables – measures of location: mean, median, mode – measures of dispersion: variance, standard deviation, range, interquartile range – coefficient of variation – graphical presentation: histogram, box-plot, scatter plot – correlation Counting and displaying data Cardiology After we have collected our data, we need to display them (tables, graphics and figures) Raw enumeration (eg lesion length by visual estimation in patients treated in Endeavor II trial: 14-27 mm) 15 14 15 18 24 23 17 16 23 14 26 15 16 15 17 25 27 19 14 23 25 18 16 14 15 18 24 19 18 15 14 19 25 24 17 15 18 24 20 18 15 26 21 14 18 21 15 18 24 23 20 21 15 18 24 18 15 16 17 22 23 14 21 15 18 24 20 18 15 24 25 15 20 19 15 18 24 24 16 25 17 14 23 16 17 15 18 24 25 15 23 17 15 24 18 20 16 18 24 26 … Tabular display Cardiology example Tabular display Cardiology example DELAYED RRISC, JACC 2007 Tabular display Cardiology example DELAYED RRISC, JACC 2007 Types of variables Variables CATEGORY nominal QUANTITY ordinal ordered categories ranks discrete continuous counting measuring Counting and displaying data Cardiology Create a database! Variable type Nominal Ordinal Continuous Patient ID Diabetes AHA/ACC Type Lesion Length 1 Y A 18 2 N B1 24 3 N A 17 4 N C 25 5 Y B2 23 6 N A 15 7 N A 16 8 Y B2 18 9 N B1 21 10 Y B2 19 11 N B1 14 12 Y C 22 13 N C 27 Frequency distribution Cardiology A frequency distribution is a list of the values that a variable takes in a sample. It is usually a list, ordered by quantity, showing the number of times each value appears Diabetes n=13 AHA/ACC n=13 Type Yes 5 A 4 No 8 B1 3 B2 3 C 3 Frequency distribution Cardiology A frequency distribution is a list of the values that a variable takes in a sample. It is usually a list, ordered by quantity, showing the number of times each value appears Diabetes n=13 AHA/ACC n=13 Type Yes 5 38.5% A 4 30.7% No 8 61.5% B1 3 23.1% B2 3 23.1% C 3 23.1% This introduces the concept of percentage or rate Frequency distribution Cardiology ENDEAVOR III, JACC 2006 Frequency distribution Cardiology Lesion length This simple tabulation has drawbacks. When a variable can take continuous values instead of discrete values or when the number of possible values is too large, the table construction is cumbersome, if not impossible n=13 14 1 7.7% 15 1 7.7% 16 1 7.7% 17 1 7.7% 18 2 15.3% 19 1 7.7% 21 1 7.7% 22 1 7.7% 23 1 7.7% 24 1 7.7% 25 1 7.7% 27 1 7.7% Frequency distribution Cardiology A slightly different tabulation scheme based on the range of values can be a solution in such cases Lesion length n=13 14-20 mm 7 53.8% 21-27 mm 6 46.2% However better solutions are coming later… What you will learn • Descriptive statistics – frequency distributions – contingency tables – measures of location: mean, median, mode – measures of dispersion: variance, standard deviation, range, interquartile range – coefficient of variation – graphical presentation: histogram, box-plot, scatter plot – correlation Counting and displaying data Cardiology Contingency tables are used to record and analyse the relationship between two (or more) variables, most usually categorical variables Diabetes n=13 AHA/ACC n=13 Type Yes 5 38.5% A 4 30.7% No 8 61.5% B1 3 23.1% B2 3 23.1% C 3 23.1% Counting and displaying data Cardiology Contingency tables are used to record and analyse the relationship between two (or more) variables, most usually categorical variables AHA/ACC type A DIABETES Total B1 B2 C Total no 3 3 0 2 8 yes 1 0 3 1 5 4 3 3 3 13 Counting and displaying data Cardiology Contingency tables are used to record and analyse the relationship between two (or more) variables, most usually categorical variables A no DIABETE S Total yes Count % within DIABETES Count % within DIABETES Count % within DIABETES 3 37,5% 1 20,0% 4 30,8% AHA/ACC type B1 B2 3 0 37,5% ,0% 0 3 ,0% 60,0% 3 3 23,1% 23,1% C Total 2 25,0% 1 20,0% 3 23,1% 8 100,0% 5 100,0% 13 100,0% Is there a difference between diabetics and nondabetics in the rate of AHA/ACC type lesions? The answer will follow… What you will learn • Descriptive statistics – frequency distributions – contingency tables – measures of location: mean, median, mode – measures of dispersion: variance, standard deviation, range, interquartile range – coefficient of variation – graphical presentation: histogram, box-plot, scatter plot – correlation Measures of central tendency: rationale Cardiology We need to describe the kind of values that we have (eg lesion length by visual estimation in patients treated in Endeavor II trial: 14-27 mm) Raw enumeration 15 14 15 18 24 23 17 16 23 14 26 15 16 15 17 25 27 19 14 23 25 18 16 14 15 18 24 19 18 15 14 19 25 24 17 15 18 24 20 18 15 26 21 14 18 21 15 18 24 23 20 21 15 18 24 18 15 16 17 22 23 14 21 15 18 24 20 18 15 24 25 15 20 19 15 18 24 24 16 25 17 14 23 16 17 15 18 24 25 15 23 17 15 24 18 20 16 18 24 26 … Mean (arithmetic) Cardiology Characteristics: -summarises information well -discards a lot of information (dispersion??) x x N Assumptions: -data are not skewed – distorts the mean – outliers make the mean very different -Measured on measurement scale – cannot find mean of a categorical measure ‘average’ stent diameter may be meaningless Mean (arithmetic) Cardiology x x N 14+15+16+17+18+18+19+21+22+23+24+25+27 13 Mean = 19.92 Lesion length n=13 14 1 7.7% 15 1 7.7% 16 1 7.7% 17 1 7.7% 18 2 15.3% 19 1 7.7% 21 1 7.7% 22 1 7.7% 23 1 7.7% 24 1 7.7% 25 1 7.7% 27 1 7.7% Mean (arithmetic) Cardiology TAPAS, Lancet 2008 Median Cardiology What is it? – The one in the middle – Place values in order – Median is central Definition: – Equally distant from all other values Used for: – Ordinal data – Skewed data / outliers Median Cardiology Variable type Continuous Patient ID Lesion Length 1 18 2 24 3 17 4 25 5 23 6 15 7 16 8 18 9 21 10 19 11 14 12 22 13 27 Median Cardiology Variable type Continuous Variable type Continuous Patient ID Lesion Length Patient ID Lesion Length 1 18 11 14 2 24 6 15 3 17 7 16 4 25 3 17 5 23 1 18 6 15 8 18 7 16 10 19 8 18 9 21 9 21 12 22 10 19 5 23 11 14 2 24 12 22 4 25 13 27 13 27 Mode Cardiology What is it? Definition: – The most common value Used (rarely) for: – Discrete non interval data – E.g. stent length, stent diameter………… – MicroDriver is only available in 2.25, 2.50, 2.75 reporting the mean is meaningless Mode Cardiology Variable type Continuous Patient ID Lesion Length 1 18 2 24 3 17 4 Lesion length n=13 14 1 7.7% 15 1 7.7% 16 1 7.7% 25 17 1 7.7% 5 23 18 2 15.3% 6 15 19 1 7.7% 7 16 21 1 7.7% 8 18 22 1 7.7% 9 21 23 1 7.7% 10 19 24 1 7.7% 11 14 25 1 7.7% 12 22 13 27 27 1 7.7% Comparing Measures of central tendency Cardiology Mean is usually best – If it works – Useful properties (with standard deviation [SD]) – But… Lesion length Mean Median Driver Endeavor 17 19 19 17 18 21 21 21 21 6 18 18 18 21 Comparing Measures of central tendency Cardiology It also depends on the underlying distribution… mean = median = mode Frequency Symmetric? Value Comparing Measures of central tendency Cardiology It also depends on the underlying distribution… mean ≠ median ≠ mode Asymmetric? 30 Mode Median Mean Frequency 25 20 15 10 5 0 0 1 2 3 4 5 6 7 8 Number of Endeavor implanted per patient 9 Median Cardiology Agostoni et al, AJC 2007 What you will learn • Descriptive statistics – frequency distributions – contingency tables – measures of location: mean, median, mode – measures of dispersion: variance, standard deviation, range, interquartile range – coefficient of variation – graphical presentation: histogram, box-plot, scatter plot – correlation Measures of dispersion: rationale Cardiology Central tendency doesn’t tell us everything – We need to know about the spread, or dispersion of the scores Group Endeavor Driver Late loss(mm) 0.61 1.03 Is there a difference? And if yes, how big is it? We can only tell if we know data dispersion ENDEAVOR II, Circulation 2006 Measures of dispersion: examples Frequency Cardiology 0 0.30 0.60 0.90 Late loss Endeavor Driver 1.20 1.50 Measures of dispersion: examples Frequency Cardiology 0 0.30 0.60 0.90 Late loss Endeavor Driver 1.20 1.50 Measures of dispersion: examples Frequency Cardiology 0 0.30 0.60 0.90 Late loss Endeavor Driver 1.20 1.50 Shape of distribution Frequency Cardiology Value Gaussian, normal or “parametric” distribution Departing from normality Frequency Cardiology Value Non-normal, right-skewed Departing from normality Frequency Cardiology Value Non-normal, left-skewed Departing from normality Cardiology Frequency 20 Outliers 10 0 Value Measures of dispersion: types Cardiology • Standard deviation (SD) – Used with mean – Parametric tests • Range – First to last value – Not commonly used • Interquartile range – Used with median – 25% (1/4) to 75% (3/4) percentile – Non-parametric tests Standard deviation Cardiology Standard deviation (SD): – approximates population σ SD as N increases Advantages: – with mean enables powerful synthesis mean±1*SD 68% of data mean±2*SD 95% of data (1.96) mean±3*SD 99% of data (2.86) Disadvantages: – is based on normal assumptions 2 ( x x ) N-1 Variance Standard deviation Cardiology SD Variable type Continuous Patient ID Lesion Length 1 18 2 24 3 17 4 25 5 23 6 15 7 16 8 18 9 21 Variance = 16.58 10 19 11 14 SD = √16.58 = 4.07 12 22 13 27 Mean 19.92 2 ( x x ) N-1 (18-19.92)2+(24-19.92)2+(17-19.92)2+…+(27-19.92)2 12 Mean ± Standard deviation Cardiology Frequency 68% -1 SD mean +1 SD Mean ± Standard deviation Cardiology Frequency 95% -2 SD -1 SD mean +1 SD +2 SD Mean ± Standard deviation Cardiology Frequency 99% -3 SD -2 SD -1 SD mean +1 SD +2 SD +3 SD Standard deviation Cardiology TAPAS, Lancet 2008 Standard deviation Cardiology TAPAS, NEJM 2008 Why not mean ± SD? Cardiology TAPAS, NEJM 2008 Testing normality assumptions Cardiology Rules of thumb 1. Refer to previous data or analyses (eg landmark articles, large databases) 2. Inspect tables and graphs (eg outliers, histograms) 3. Check rough equality of mean, median, mode 4. Perform ad hoc statistical tests • • • Levene’s test for equality of means Kolmogodorov-Smirnov tests … Range Cardiology Lesion length First to last value Range = 14 – 27 or Range = 13 n=13 14 1 7.7% 15 1 7.7% 16 1 7.7% 17 1 7.7% 18 2 15.3% 19 1 7.7% 21 1 7.7% 22 1 7.7% 23 1 7.7% 24 1 7.7% 25 1 7.7% 27 1 7.7% Range Cardiology RRISC, JACC 2006 Interquartile range Cardiology 25% to 75% percentile or 1° to 3° quartile 16.5 Interquartile Range = 16.5 – 23.5 Median 23.5 Variable type Continuous Patient ID Lesion Length 11 14 6 15 7 16 3 17 1 18 8 18 10 19 9 21 12 22 5 23 2 24 4 25 13 27 Interquartile range Cardiology Agostoni et al, AJC 2007 Cardiology Lesion Length Valid 14,00 15,00 16,00 17,00 18,00 19,00 21,00 22,00 23,00 24,00 25,00 27,00 Total Frequency 1 1 1 1 2 1 1 1 1 1 1 1 13 Percent 7,7 7,7 7,7 7,7 15,4 7,7 7,7 7,7 7,7 7,7 7,7 7,7 100,0 Valid Percent 7,7 7,7 7,7 7,7 15,4 7,7 7,7 7,7 7,7 7,7 7,7 7,7 100,0 Cumulative Percent 7,7 15,4 23,1 30,8 46,2 53,8 61,5 69,2 76,9 84,6 92,3 100,0 Cardiology Statistics Les ion Length N Valid Mis sing Mean Median Mode Std. Deviation Range Minimum Maximum Percentiles 25 50 75 13 0 19,9231 19,0000 18,00 4,07148 13,00 14,00 27,00 16,5000 19,0000 23,5000 Reporting data Cardiology If parametric: Mean and Standard Deviation If non-parametric: Median and InterQuartile Range Median [IQR] Mean ± SD Mean (SD) Age (y): 63 ± 13 Age (y): 63 (13) NIH vol (mm3): 1.3 [0–13.1] Mode and Range less commonly used What you will learn • Descriptive statistics – frequency distributions – contingency tables – measures of location: mean, median, mode – measures of dispersion: variance, standard deviation, range, interquartile range – coefficient of variation – graphical presentation: histogram, box-plot, scatter plot – correlation Coefficient of Variation •The coefficient of variation (CV) is a normalized measure of dispersion of a probability distribution. It is defined as the ratio of the standard deviation to the mean •This is only defined for non-zero mean, and is most useful for variables that are always positive. The coefficient of variation should only be computed for continuous data •A given standard deviation indicates a high or low degree of variability only in relation to the mean value •It is easier to get an idea of variability in a distribution by dividing the standard deviation with the mean Coefficient of Variation •Advantages •The CV is a dimensionless number •The CV is particularly useful when comparing dispersion in datasets with: markedly different means or, different units of measurement •Distributions with CV<1 are considered low-variance, while those with CV>1 are considered high-variance •Disadvantages •When the mean is near zero, the CV is sensitive to small changes in the mean, limiting its usefulness •Unlike the standard deviation, it cannot be used to construct confidence intervals for the mean What you will learn • Descriptive statistics – frequency distributions – contingency tables – measures of location: mean, median, mode – measures of dispersion: variance, standard deviation, range, interquartile range – coefficient of variation – graphical presentation: histogram, box-plot, scatter plot – correlation Histograms Cardiology Very good for categorical variables 10 8 6 4 2 0 1 yes 0 no DIABETES ENDEAVOR II, Circulation 2006 Histograms Cardiology Not so good for continuous variables, but… 7 3 6 5 2 4 3 1 2 1 0 0 1 3 5 7 9 11 13 15 17 Lesion Length 19 21 23 25 27 0-6 6 - 12 12 - 18 Lesion Length 18 - 24 24 - 30 Histograms Cardiology A 50 Frequency 40 both restenotic and non-restenotic SES 30 example 20 10 0 -,8 -,0 ,8 1,6 late loss (mm) Agostoni et al, AJC 2007 2,4 3,2 Shape of distributions Cardiology A non-restenotic SES 50 Frequency 40 30 example 20 shape of distribution 10 0 -,8 -,0 ,8 1,6 late loss (mm) Agostoni et al, AJC 2007 2,4 3,2 Box (& whiskers) plots Cardiology 30 25 20 15 10 5 0 Lesion Length Box (& whiskers) plots Cardiology 30 Max (Q4) or Q3+1.5(IQR) 25 Q3 20 Interquartile range Median (Q2) Q1 15 Min (Q0) or Q1-1.5(IQR) 10 5 0 Lesion Length Box (& whiskers) plots Cardiology Margheri, Biondi Zoccai, et al, AJC 2008 Scatter plots Cardiology A scatter plot is a type of display using Cartesian coordinates to display values for two variables for a set of data. The data is displayed as a collection of points, each having the value of one variable determining the position on the horizontal axis and the value of the other variable determining the position on the vertical axis Usually it is done with 2 continuous variables to visually assess the degree of correlation between them But it can be also used with one categorical variable and one continuous variable (mainly if sample size is small) Scatter plots Cardiology Abbate, Biondi Zoccai, et al, Circulation 2002 Scatter plots Cardiology Mintz, et al, AJC 2005 Scatter plots Cardiology Agostoni, et al, IJC 2007 Thank you for your attention For any correspondence: [email protected] For further slides on these topics feel free to visit the metcardio.org website: http://www.metcardio.org/slides.html