Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Bootstrapping (statistics) wikipedia , lookup
Psychometrics wikipedia , lookup
History of statistics wikipedia , lookup
Analysis of variance wikipedia , lookup
Foundations of statistics wikipedia , lookup
Omnibus test wikipedia , lookup
Student's t-test wikipedia , lookup
Statistics for non-statisticians Marco Pavesi Lead Statistician Liver Unit – Hospital Clínic i Provincial Ferran Torres Statistics and Methodology Support Unit. Hospital Clínic Barcelona Biostatistics Unit. School of Medicine. Universitat Autònoma Barcelona (UAB) Outline • Why Statistics? • Descriptive Statistics. Populations and Samples. Type of errors • Inferential Statistics. Hypothesis testing Statistical errors p-value Confidence Intervals • Multiplicity issues. Type of tests. Sample size • Multivariate analysis. More on p-values • Conclusion: “little shop of horrors” Intro. Why should we learn statistics ? Inducción y Verdad Bertrand Russell presents… The inductivist turkey Troubles for the plain researchers: Induction and statistics ARE NOT a method to get a sort of mathematical demonstration of Truth The results observed for a population sample are not necessarily true for the whole population Smart turkeys / researchers… 1) …are aware that the relevance (weight) of statistical inferences always depends on the sample size Smart turkeys / researchers… 2) …do know that we can only model /estimate the real world with a specific approximation error. Smart turkeys / researchers… 3) …understand that true hipotheses do not exist, and we can only reject or keep a hypothesis based on the available evidence What is statistics ? • “I know (I’m making the assumption) that these dice are fair: what is the probability of always getting a 1 in 15 runs?“ ==> Probability mathematics • “I have got always a 1 in 15 runs. Are these dice fair ?” ==> Inferential STATISTICS So, why statistics? To account for chance & variability! Why is Statistics needed? • Statistics tells us whether events are likely to have happened simply by chance • Statistics is needed because we always work with sample observation (variability) and never with populations • Statistics is the only mean to predict what is more likely to happen in new situations and helps us to make decisions Introduction to descriptive statistics Population and Samples Sample Study Population Target Population Random vs Sistematic error Example: Systolic Blood Pressure (mm Hg) 130 Random Systematic (Bias) True Value True Value 150 01 02 03 04 170 05 130 150 01 05 02 03 04 170 What Statistics? • Descriptive Statistics Position statistics (central tendency measures): mean, median Dispersion statistics: variance, standard deviation, standard error Shape statistics: symmetry, skewness and kurtosis measures. The mean and the median Arithmetic mean (average): Median: (50% of sample individuals have a value higher than or equal to the median) n x X i 1 i n 1,3,3,4,6,13,14,14,18 6 1,3,3,4,6,13,14,14,17,18 6 - 13 Median=(6+13)/2=9.5 • Unlikely the median, the mean is affected by outliers • Especially relevant for specific distributions (survival times) Mean 1 Mean 2 New outlier Median 1 Median 2 Dispersion measures • The Variance is the mean of squared differences from the distribution mean: • The Standard Deviation is the square root of the Variance: • The Standard Error is generally • expressed as the ratio between the Variance and the sample size: It is considered as the true SD of the population mean (or parameter) 2 SD 1 n 1 x 1 n 1 n 2 i 1 i 2 x i1 i n SE = σ2 / N Inference & tests • Inferential Statistics Draw conclusions (inferences) from incomplete (sample) data. Allow us to make predictions about the target population based on the results observed in the sample Are computed in hypothesis testing • Examples 95%CI’, t-test, chi square test, ANOVA, regression Basic pattern of statistical tests Observed Expected Test Statistic Variabilit y • Based on the total number of observations and the size of the test statistic, one can determine the P value. How many noise units? Signal Test Statistic Noise • Test statistic & sample size (degrees of freedom) convert to a probability or P Value. Overall hypothesis testing flow chart Test Statistics value Corresponding P-value (from known distribution) Comparison with significance level (previously defined) P<α Reject null hypothesis P >= α Keep null hypothesis Introduction to inferential statistics The role of statistics “Thus statistical methods are no substitute for common sense and objectivity. They should never aim to confuse the reader, but instead should be a major contributor to the clarity of a scientific argument.” The role of statistics. Pocock SJ . Br J Psychiat 1980; 137:188-190 23 Extrapolation Sample Study Results Inferential analysis Statistical Tests Confidence Intervals Population “Conclusions” Statistical Inference Statistical Tests=> p-value Confidence Intervals 25 Valid samples? Population Likely to occur Invalid Sample and Conclusions Unlikely to occur P-value • The p-value is a “tool” to answer the question: Could the observed results have occurred by chance*? p < .05 “statistically significant” Remember: – Decision given the observed results in a SAMPLE – Extrapolating results to POPULATION *: accounts exclusively for the random error, not bias 27 A intuitive definition • The p-value is the probability of having observed our data when the null hypothesis is true • Steps: 1) 2) 3) 4) Calculate the treatment differences in the sample (A-B) Assume that both treatments are equal (A=B) and then… …calculate the probability of obtaining a magnitude of at least the observed differences, given the assumption 2 We conclude according the probability: a. p<0.05: the differences are unlikely to be explained by random, we assume that the treatment explains the differences b. p>0.05: the differences could be explained by random, we assume that random explains the differences HYPOTHESIS TESTING • Testing two hypotheses H0: A=B H1: A≠B (Null hypothesis – no difference) (Alternative hypothesis) • Calculate test statistic based on the assumption that H0 is true (i.e. there is no real difference) • Test will give us a p-value: how likely are the collected data if H0 is true • If this is unlikely (small p-value), we reject H0 RCT from a statistical point of view Treatment A Randomisation Treatment B (control) 1 homogeneous population 2 distinct populations RCT Sample Population Statistical significance/Confidence • A>B p<0.05 means: • “I can conclude that the higher values ? observed with treatment A vs treatment B are linked to the treatment rather to chance, with a risk of error of less than 5%” Factors influencing statistical significance • Signal • Difference • Noise (background) • Variance (SD) • Quantity • Quantity of data P-value •A “very low” p-value do NOT imply: Clinical relevance (NO!!!) Magnitude With n of the treatment effect (NO!!) or variability p •Please never compare p-values!! (NO!!!) P-value • A “statistically significant” result (p<.05) tells us NOTHING about clinical or scientific importance. Only, that the results were not due to chance. A p-value does NOT account for bias only by random error STAT REPORT THE BASIC IDEA •Statistics can never PROVE anything beyond any doubt, just beyond reasonable doubt!! •… because of working with samples and random error Type I & II Error & Power Reality (Population) Conclusion (sample) A=B A≠B “A=B” p>0.05 OK Type II error () A≠B p<0.05 Type I error () OK Type I & II Error & Power • Type I Error () False positive Rejecting the null hypothesis when in fact it is true Standard: =0.05 In words, chance of finding statistical significance when in fact there truly was no effect • Type II Error () False negative Accepting the null hypothesis when in fact alternative is true Standard: =0.20 or 0.10 In words, chance of not finding statistical significance when in fact there was an effect Type I & II Error & Power • Power 1-Type II Error () Usually in percentage: 80% or 90% (for =0.1 or 0.2, respectively) In words, chance of finding statistical significance when in fact there is an effect Reality (Population) Conclusion (sample) A=B A≠B “A=B” p>0.05 OK Type II error () A≠B p<0.05 Type I error () POWER 95%CI • Better than p-values… …use the data collected in the trial to give an estimate of the treatment effect size, together with a measure of how certain we are of our estimate • CI is a range of values within which the “true” treatment effect is believed to be found, with a given level of confidence. 95% CI is a range of values within which the ‘true’ treatment effect will lie 95% of the time • Generally, 95% CI is calculated as Sample Estimate ± 1.96 x Standard Error Interval Estimation A probability that the population parameter falls somewhere within the interval. Confidence interval Confidence limit (lower) Sample statistic (point estimate) Confidence limit (upper) Superiority study Control better Test better IC95% d < 0 - effect d = 0 No differences d > 0 + effect Superiority study Control better Test better IC95% d < 0 - effect d = 0 No differences d > 0 + effect Multiplicity To say it colloquially, torture the data until they speak... Lancet 2005; 365: 1591–95 45 Torturing data… Investigators examine additional endpoints, manipulate group comparisons, do many subgroup analyses, and undertake repeated interim analyses. Investigators should report all analytical comparisons implemented. Unfortunately, they sometimes hide the complete analysis, handicapping the reader’s understanding of the results. Lancet 2005; 365: 1591–95 46 Design Conduction Results 47 Multiplicity K independent hypothesis : H01 , H02 , ... , H0K S significant results ( p< ) Pr (S 1 | H01 H02 ... H0K = H0.) = 1 - Pr (S=0|H0.) = 1- (1 - )K K Pr(S>=1|Ho.) K Pr(S>=1|Ho.) 1 0.0500 10 0.4013 2 0.0975 15 0.5367 3 0.1426 20 0.6415 4 0.1855 25 0.7226 5 0.2262 30 0.7854 48 Sources of multiplicity in RCT • Multiple assessment criteria (variables) • Multiple times of assessment (repeated measurements) • Multiple inspections (interim analyses) • Multiple comparisons (more than two treatments) • Multiple subsets and subgroups 49 Same examples Variables Times Subgroups Comparisons case A 2 2 2 1 case B 5 4 3 1 case C 5 4 3 3 total False positive rate 8 33.66% 60 96.61% 180 99.99% 50 Multiplicity • Bonferroni correction (simplified version) K tests with level of signification of Each test can be tested at the /k level • Example: 5 independent tests Global level of significance=5% Each test shoud be tested at the 1% level 5% /5 => 1% 51 Interim Analyses in the CDP +2 +1 Z Value 0 -1 -2 10 20 30 40 50 60 70 80 90 100 Month of Follow-up (Month 0 = March 1966, Month 100 = July 1974) Coronary Drug Project Mortality Surveillance Circulation. 1973;47:I-1 http://clinicaltrials.gov/ct/show/NCT00000483;jsessionid=C4EA2EA9C3351138F8CAB6AFB7238 20A?order=23 52 Lancet 2005; 365: 1657–61 53 Sample Size Sample Size • The planned number of participants is calculated on the basis of: Expected effect of treatment(s) Variability of the chosen endpoint Accepted risks in conclusion ↗ effect ↘ number ↗ variability ↗ number ↗ risk ↘ number Sample Size • The planned number of participants is calculated on the basis of: Expected effect of treatment(s) Variability of the chosen endpoint Accepted risks in conclusion ↗ effect ↘ number ↗ variability ↗ number ↗ risk ↘ number ALTURA ALT URA ALT URA 300 300 120 100 200 200 80 60 40 Desv. típ. = 25.54 0 Frecuencia Frecuencia 100 Media = 165.1 Desv. típ. = 26.94 N = 2000.00 Media = 165.0 0 Media = 165.1 N = 2000.00 0 0 0. 22 0 0. 21 0 0. 20 0 0. 19 0 0. 18 0 0. 17 0 0. 16 0 0. 15 0 0. 14 0 0. 13 0 0. 12 0 0. 11 5 2. 20 .5 7 19 .5 2 19 .5 7 18 .5 2 18 .5 7 17 .5 2 17 .5 7 16 .5 2 16 .5 7 15 .5 2 15 .5 7 14 .5 2 14 .5 7 13 .5 2 13 .5 7 12 .5 2 12 N = 2000.00 Desv. típ. = 32.27 20 0 0. 25 .0 0 24 .0 0 23 .0 0 22 .0 0 21 .0 0 20 .0 0 19 .0 0 18 .0 0 17 .0 0 16 .0 0 15 .0 0 14 .0 0 13 .0 0 12 .0 0 11 0 0. 10 .0 90 .0 80 Frecuencia 100 ALTURA ALTURA ALTURA Sample Size • The planned number of participants is calculated on the basis of: Expected effect of treatment(s) Variability of the chosen endpoint Accepted risks in conclusion ↗ effect ↘ number ↗ variability ↗ number ↗ risk ↘ number Reality (Population) Conclusion (sample) A=B A≠B “A=B” p>0.05 OK Type II error () A≠B p<0.05 Type I error () POWER Which statistical test 58 Normal vs. Skewed Distributions • Parametric statistical test can be used to assess variables that have a “normal” or symmetrical bell-shaped distribution curve for a histogram. • Nonparametric statistical test can be used to assess variables that are skewed or nonnormal. • “Inferential tests” vs Look at a histogram to decide. Examples of Normal and Skewed 44-DAYS IN ICU 35-SYSTOLIC BLOOD PRESSURE FIRST ER 1000 160 140 800 120 100 600 80 400 40 Frequency 60 200 Std. Dev = 3.99 Std. Dev = 27.74 20 Mean = .9 Mean = 146.9 N = 925.00 0 0 0. 250.0 24 .0 0 230.0 220.0 21 .0 0 200.0 19 .0 0 180.0 170.0 16 .0 0 150.0 140.0 13 .0 0 120.0 11 0 0. 10 0 . 90.0 80 0 . 70.0 60 35-SYSTOLIC BLOOD PRESSURE FIRST ER N = 933.00 0 0.0 10.0 5.0 20.0 15.0 30.0 25.0 44-DAYS IN ICU 40.0 35.0 50.0 45.0 60.0 55.0 70.0 65.0 Parametric vs. Nonparametric • • • • • Student’s t-test One-way ANOVA Paired t-test Pearson correlation Correlated F ratio (repeatedmeasures ANOVA) • • • • • Mann-Whitney U test Kruskal-Wallis test Wilcoxon signed-rank Spearman’s r Friedman ANOVA The type of Inferential Tests depend on data • Repeated measures ? UnMatched groups: different subsets of the population in each condition: – Independent data (paired) Matched groups : the same individuals in each condition: – dependent data (unpaired) • Type of data Continuous Gaussian, Metric mean, SD, …. Continuous non-Gaussian, ordinal Ranks: 1,2,3,4,5,6,7,8,9,10 Median, interquartile range Nominal, Categories: 49% “yes”, 33 “no”, 18% “no opinion”, frequencies and percentages Qualitative dependent variable Independent Variable Qualitative (nominal) No. of groups 2 >2 Independent Data Fisher's test (chi-square) Dependent Data McNemar's test Cochrane Q Quantitative independent variable, Independent (unpaired) data Independent Variable Qualitative (nominal) Quantitative No. of groups Parametric Non-Parametric Measurement (from Gaussian Population) Rank, Score, or Measurement (from Non- Gaussian Population) 2 t-Test Mann-Whitney test >2 One-way ANOVA Kruskal-Wallis test 2 Pearson correlation Spearman correlation Quantitative independent variable, dependent (paired) data Independent Variable Qualitative (nominal) No. of groups Parametric Non-Parametric Measurement (from Gaussian Population) Rank, Score, or Measurement (from Non- Gaussian Population) 2 t-Test (paired) Wilcoxon test >2 One-way ANOVA (paired) Friedman test Type of Data Goal Measurement (from Gaussian Population) Rank, Score, or Measurement (from NonGaussian Population) Binomial (Two Possible Outcomes) Survival Time Describe one group Mean, SD Median, interquartile range Proportion Kaplan Meier survival curve Compare one group to a hypothetical value One-sample t test Wilcoxon test Chi-square or Binomial test ** Compare two unpaired groups Unpaired t test Mann-Whitney test Fisher's test (chi-square for large samples) Log-rank test or MantelHaenszel* Compare two paired groups Paired t test Wilcoxon test McNemar's test Conditional proportional hazards regression* Compare three or more unmatched groups One-way ANOVA Kruskal-Wallis test Chi-square test Cox proportional hazard regression** Compare three or more matched groups Repeatedmeasures ANOVA Friedman test Cochrane Q** Conditional proportional hazards regression** Quantify association between two variables Pearson correlation Spearman correlation Contingency coefficients** Predict value from another measured variable Simple linear regression or Nonlinear regression Nonparametric regression** Simple logistic Cox regression* proportional hazard regression* Predict value from several measured or binomial variables Multiple linear regression* or Multiple nonlinear regression** Multiple logistic regression* Cox proportional hazard regression* • http://statpages.org/ • http://www.microsiris.com/Statistical%20Decision%20Tree/ • http://www.socialresearchmethods.net/selstat/ssstart.htm • http://www.wadsworth.com/psychology_d/templates/student_res ources/workshops/stat_workshp/chose_stat/chose_stat_01.html • http://www.graphpad.com/www/Book/Choose.htm A Good Rule to Follow • Always check your results with a nonparametric (sensitivity analysis) • If you test your null hypothesis with a Student’s t-test, also check it with a Mann-Whitney U test. • It will only take an extra 25 seconds. • Use common sense and prior knowledge!! Multivariate statistics: why and when ? Marco Pavesi Lead Statistician Liver Unit – Hospital Clínic i Provincial Barcelona 2 or 3 more things on p-values • P-values only depend on the magnitude of the test statistic computed based on observed (sample) data. • They are related to the evidence against the null hypothesis and tell us how confortable we should feel when we reject it. • They are not related in any way to the clinical relevance of the “signal” (or effect, or difference, or whatever result) observed !! Clinical study design chart YES EXPERIMENTAL STUDY NO Repeated measurements taken? (Ex. Randomized Clinical Trial) Any intervention applied & studied? YES NO PROSPECTIVE STUDY CROSS-SECTIONAL STUDY (Ex. Cohort study designs) (Ex. Case-control study designs) Randomization 1. Eliminates assignment bias 2. Tends to produce comparable groups for known and unknown, recoded and unrecorded factors Design Sources of Imbalance Randomized Concurrent (prospective) Chance Chance & Selection Bias (Non-randomized) Historical (retrospective) Chance, Selection Bias & Time Bias (Non-randomized) 3. Adds validity (extrapolability) to the results of statistical tests Reference: Byar et al (1976) NEJM Confounding • No randomization Lack of homogeneity between groups in the distribution of risk (protection) factors • A potential confounder is: Associated to the outcome Associated to the main factor studied Not involved in the causal association between factor and outcome as a midway step EXPOSURE (coffee intake) OUTCOME (stroke) CONFOUNDING FACTOR (smoking) Interactions • Effect modification • Different risk (effect) estimates are associated to different strata of a specific factor. 20% Factor A, stratum 2 Outcome associated to a specific factor “A” (ex. death) (ex. Female) 10% Factor A, stratum 1 7% (ex. Male) Factor B, stratum 1 (ex. Age < 65) Factor B, stratum 2 (ex. Age 65) Multivariate analysis and statistical models • A model is “a simplified representation (usually mathematical) used to explain the workings of a real world system or event” (Wikipedia) • Two types of statistical models are used in clinical reasearch /epidemiology: Predictive models Explanatory models • Both are fitted by means of multivariate analysis techniques Predictive models • Used when we are interested in predicting the probability of a specific outcome or the value of a specific dependent variable • Focused on selection of the best subset of predictors and highest precision of estimates • The selection of predictors is based on their contribution to the predictive ability of the model (i.e., on p-values) • Ex. Framingham equations to predict the probability of developing coronary events at 10 years (http://www.framinghamheartstudy.org/risk/index.html) Framingham predictive equation for CHD Estimated Coefficients Underlying CHD Prediction Sheets Using Total Cholesterol Categories Variable Men Age,-y 0.04826 Age-squared,-y Women 0.33766 -0.00268 TC,-mg/dL <160 -0.65945 -0.26138 160-199 Referent Referent 200-239 0.17692 0.20771 240-279 0.50539 0.24385 >=280 0.65713 0.53513 <35 0.49744 0.84312 35-44 0.2431 0.37796 45-49 Referent 0.19785 50-59 -0.05107 Referent >=60 -0.4866 -0.42951 Optimal -0.00226 -0.53363 Normal Referent Referent High-normal 0.2832 -0.06773 Stage-I-hypertension 0.52168 0.26288 Stage-II-IV-hypertension 0.61859 0.46573 Diabetes 0.42839 0.59626 Smoker 0.52337 0.29246 Baseline-survival-function-at-10-years, S0(10) 0.90015 0.96246 Linear predictor at risk factor means 3.09750 9.92545 HDL-C,-mg/dL Blood-pressure Explanatory models • Study objective: to assess (estimate) the effect of a specific factor on the study outcome • Multivariate analysis aimed at getting the best (most valid) estimate of the studied effect • Confounders must be accounted for in the model • Evaluation of confounding variables is based on the change of model estimates, NOT ON STATISTICAL SIGNIFICANCE. • Rule of thumb: add each potential confounder into the model one by one and keep only those modifying by more than 10% the estimate of the main factor Adjusting for confounders: an example Outcome variables and statistical models A summary table • Continuous (normally distr.) outcome: ANOVA, or ANCOVA or Linear Regression model • Bivariate (YES/NO): Logistic regression • Categorical (with a ref.group): Multinomial logistic regression • Time-to-event (different follow-up times & censored cases): Survival models (ex. Cox PH) • Number of counts: Poisson or Negative Binomial regression Some “take home” hints Marco Pavesi Lead Statistician Liver Unit – Hospital Clínic i Provincial Barcelona The p-value… … is the probability of a result like that observed in our sample when the null hypothesis is true in the population (i.e., simply due to chance) …is related to the evidence against the null hypothesis and to the reliability of the observed result …IT DOES NOT TELL US ANYTHING ON THE CLINICAL RELEVANCE OF THE RESULT WE HAVE OBSERVED !! Interpretation of a p-value • The highest the p-value, the highest the probability that the observed result is due simply to chance: p = 0.75 75% probability (3 out of 4 studies) to reject a true H0 p = 0.015 1.5% probability (15 out of 1,000 studies) to reject a true H0 • A “small” p-value (significance level) is established conventionally as the highest rate of false-positive results that we consider acceptable (for instance, the common 5% rate) Evidence and p-value: an example (1) Drug A. Efficacy rate: 22% Drug B. Efficacy rate: 11% …observed results: Drug A. Efficacy rate: 2 / 9 Drug B. Efficacy rate: 1 / 9 P-value=0,98 Evidence and p-value: an example (2) Drug A. Efficacy rate: 22% Drug B. Efficacy rate: 11% …observed results: Drug A. Efficacy rate: 35 / 154 Drug B. Efficacy rate: 18 / 158 P-value=0,008 Evidence and p-value: an example (3) ….on the other hand… Drug A. Known efficacy rate: 50% Drug B. Expected efficacy rate: 52% Δ=2%; Type I error: 0.05; Type II error: 0.20 N (per arm): 9.806 Conclusion: little shop of horrors (1) • “No significant difference is observed between the treatment arms. Conclusion: the treatments are equally effective…” …AAAAAARGH !!!! • “Absence of evidence is not evidence of absence” (Altman DG, Bland JM. BMJ 1995;311:485) Conclusion: little shop of horrors (2) • The p-value of the comparison A vs. Placebo is lower than the pvalue for the comparison B vs. Placebo Conclusion: treatment A is better than B…” …AAAAAARGH !!!! • The p-value gives us a measure of the evidence against that specific null hypothesis in that specific hypothesis test. Conclusion: little shop of horrors (3) • A clinician speaking to the poor, helpless statistician: “Can we just test variable A vs. the rest of variables and check if some difference is significant…?” …AAAAAARGH !!!! • Type I error increases exponentially together with the number of hypothesis tests performed: 1 test: Type I error = 5%……5 tests: Type I error > 20%