Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Short overview of statistical methods Hein Stigum Presentation, data and programs at: http://folk.uio.no/heins/ courses May-17 H.S. 1 Agenda • Concepts • Bivariate analysis – Continuous symmetrical – Continuous skewed – Categorical • Multivariable analysis – Linear regression – Logistic regression Outcome variable decides analysis May-17 H.S. 2 CONCEPTS May-17 H.S. 3 Precision and bias • Measures of populations – precision - random error - statistics – bias - systematic error - epidemiology Precision Bias True value May-17 Estimate H.S. 4 Precision: Estimation Population Sample Estimate True value Estimate with confidence interval ( | ) 95% confidence interval: 95% of repeated intervals will contain the true value May-17 H.S. 5 Precision: Testing Population Sample Estimate 1 Estimate 2 True value group 1 True value group 2 | group 1 | group 2 p-value=P(observing this difference or more, when the true difference is zero) May-17 H.S. 6 Precision: Significance level Birth weight, 500 newborn, observe difference H0: boys=girls 10 gr 50 gr 100 gr 130 gr 150 gr Significance level p=0.90 p=0.40 p=0.10 p=0.04 p=0.02 p<0.05 Ha: boys≠girls May-17 H.S. 7 Precision: Test situations • 1 sample test • Weight =10 • 2 independent samples • Weight by sex • K independent samples • Weight by age groups • 2 dependent samples • Weight last year = Weight today May-17 H.S. 8 Bias: DAGs C2 C1 parity sex E D gest age birth weight Associations Causal effects Bivariate (unadjusted) Multivariable (adjusted) Draw your assumptions before your conclusions May-17 H.S. 9 WHY USE GRAPHS? May-17 H.S. 10 Problem example • Lunch meals per week 30 0 10 20 Percent 40 50 – Table of means (around 5 per week) – Linear regression 1 May-17 2 3 4 5 Lunch meals per week H.S. 6 7 11 Problem example 2 • Iron level by sex .02 .04 .06 .08 – Both linear and logistic regression – Opposite results 0 mean mean girls boys 75 May-17 90 100 104 110 Irom levelininblood blood Iron level H.S. 129 12 Datatypes • Categorical data – Nominal: – Ordinal: married/ single/ divorced small/ medium/ large • Numerical data – Discrete: number of children – Continuous: weight May-17 H.S. 13 Outcome data type dictates type of analysis Data type Numerical Yes Means T-test Linear regression May-17 Normal data Categorical No Medians Non-par tests H.S. Freq table Cross, Chisquare Logistic regression 14 Continuous symmetric outcome: Birth weight BIVARIATE ANALYSIS 1 May-17 H.S. 15 Distribution drop if weight<2000 kdensity weight 0 .0002 .0004 .0006 .0008 0 Density kdensity weight 0 2000 4000 6000 weight 0 2,000 4,000 2000 3000 4000 weight 5000 6000 6,000 weight May-17 H.S. 16 Central tendency and dispersion Mean and standard deviation: Mean with confidence interval: May-17 H.S. 17 Compare groups, equal variance? • Equal 2 May-17 0 • Not equal 2 4 2 H.S. 0 2 4 18 2 independent samples Are birth weights the same for boys and girls? Density plot 2000 3000 4000 5000 6000 Scatterplot Boys Girls 2000 3000 sex May-17 H.S. 4000 Birth weight 5000 6000 19 2 independent samples test ttest weight, by(sex) unequal ttest var1==var2 May-17 unequal variances paired test H.S. 20 K independent samples • Is birth weight the same over parity? Density plot 6000 Scatterplot Parity: 2000 3000 4000 5000 0 1 2-7 0 May-17 1 Parity 2-7 2000 3000 H.S. 4000 Birth weight 5000 6000 21 K independent samples test equal means? Equal variances? May-17 H.S. 22 Continuous by continuous • Does birth weight depend on gestational age? Scatterplot 4000 3000 2000 2000 3000 4000 Birth weight 5000 5000 6000 Scatterplot, outlier dropped 200 May-17 300 400 500 600 Gestational age 700 200 220 240 260 280 300 Gestational age H.S. 23 Continuous by continuous tests • Cut gestational age up in groups, then use T-test or ANOVA or • Use linear regression with 1 covariate May-17 H.S. 24 Test situations • 1 sample test • ttest weight =10 • 2 independent samples • test weight, by(sex) • K independent samples • oneway weight parity • 2 dependent samples (Paired) • ttest weight_last_year == weight_today May-17 H.S. 25 Continuous skewed outcome: Number of sexual partners BIVARIATE ANALYSIS 2 May-17 H.S. 26 Distribution kdensity partners if partners<=50 0 .02 .04 .06 .08 .1 Distribution of number of lifetime partners 25%50% 75% 95% 1 4 9 20 50 Partners N=394 May-17 H.S. 27 Central tendency and dispersion Median and percentiles: May-17 H.S. 28 2 independent samples Do males and females have the same number of partners? Density plot 0 50 100 150 200 Scatterplot Males Females 0 Gender May-17 H.S. 10 20 30 partners 40 50 29 2 independent samples test equal medians? May-17 H.S. 30 K independent samples Do partners vary with age? Density plot 200 Scatterplot 0 50 100 150 Age: 18-29 30-44 45-60 18-29 May-17 30-44 agegr3 45-60 0 H.S. 10 20 30 partners 40 50 31 K independent samples test equal medians? May-17 H.S. 32 Table of descriptives Normal Numerical data Skewed Proportions Descriptives Center Dispersion Mean Standard deviation Median Fractiles p Confidence intervals for center estimates Standard error 95% Confidence interval May-17 se(mean) mean ± 2*se(mean) H.S. se(p) p ± 2*se(p) 33 Table of tests Numerical data Normal Skewed 1 sample One sample T-test Kolmogorov-Smirnov 2 independent samples Independent sample T-test Mann-Whitney U K independent samples ANOVA Kruskal-Wallis 2 dependent samples Paired sample T-test Wilcoxon signed rank test Remarks: If unequal variance in ANOVA: Use linear regression with robust variance estimation May-17 If N is large: may use parametric tests H.S. Proportions Binomial Chi-square Chi-square Mc-Nemar (2x2) Categorical ordered: use nonparametric tests 34 Categorical outcome: Being bullied BIVARIATE ANALYSIS 3 May-17 H.S. 35 Frequency and proportion Frequency: Proportion with CI: May-17 H.S. 36 Proportion, confidence interval proportion: x=”disease” n=total number x p n p (1 p ) n standard error: se( p ) confidence interval: CI ( p ) p 2 se( p ) May-17 H.S. 37 Crosstables Are boys bullied as much as girls? equal proportions? May-17 H.S. 38 Ordered categories, trend Trend? equal proportions? May-17 H.S. 39 Table of tests Numerical data Normal Skewed 1 sample One sample T-test Kolmogorov-Smirnov 2 independent samples Independent sample T-test Mann-Whitney U K independent samples ANOVA Kruskal-Wallis 2 dependent samples Paired sample T-test Wilcoxon signed rank test Remarks: If unequal variance in ANOVA: Use linear regression with robust variance estimation May-17 If N is large: may use parametric tests H.S. Proportions Binomial Chi-square Chi-square Mc-Nemar (2x2) Categorical ordered: use nonparametric tests 40 Continuous outcome: Linear regression, Birth weight MULTIVARIABLE ANALYSIS 1 May-17 H.S. 41 Regression idea 2500 3000 3500 4000 4500 5000 model : y b0 b1 x e y = outcome x = covariate b1 coefficien t , effect of x e error, residual 250 260 270 280 290 gestational age (days) 300 310 model with many cofactors : y b0 b1 x1 b2 x2 e x1 , x 2 = covariate May-17 H.S. 42 Model and assumptions • Model y 0 1 x1 2 x2 , N (0, 2 ) • Association measure 1 = increase in y for one unit increase in x1 • Assumptions – Independent errors – Linear effects – Constant error variance • Robustness – influence May-17 H.S. 43 Workflow C2 • DAG parity C1 sex • Scatterplots • Bivariate analysis gest age birth weight – Robustness 4000 539 3000 • Independent errors • Linear effects • Constant error variance 2000 birth weight (gram) 5000 – Model estimation – Test of assumptions 200 • Influence May-17 D 6000 • Regression E H.S. 300 400 500 gestational age (days) 600 44 700 Categorical covariates • 2 categories – OK • 3+ categories – Use “dummies” • • • • “Dummies” are 0/1 variables used to create contrasts Want 3 categories for parity: 0, 1 and 2-7 children Choose 0 as reference Make dummies for the two other categories generate Parity1 generate Parity2_7 May-17 = = (parity==1) if parity<. (parity>=2) if parity<. H.S. 45 Create meaningful constant Expected b irth weigh t E ( y ) 0 1 gest 2 sex 3 Parity1 4 Parity2 _ 7 Expected birth weight at: 0 1925gr 0 1 280 2 1 3524gr gest= 0, sex=0, parity=0, not meaningful gest=280, sex=1, parity=0 Model estimation Birth weight at ref Gestational age per day Sex Boy Girl Parity 0 1 2-7 May-17 coeff 3524.3 6.0 95% conf. Int. (3.9 , 8.2) 0 -139.2 (-228.9 , -49.5) 0 232.0 226.0 (130.6 , 333.5) (106.9 , 345) H.S. 47 Test of assumptions 500 -1000 -500 0 Residuals – Independent residuals? – Linear effects? – constant variance? 1000 1500 • Plot residuals versus predicted y 3200 3400 3600 Linear prediction 3800 4000 Outlier not included May-17 H.S. 48 Violations of assumptions • Dependent residuals .5 1 Use mixed models or GEE -.5 0 • Non linear effects -1 Add square term 220 240 260 gest 280 300 2 200 0 -1 -2 Use robust variance estimation res 1 • Non-constant variance 3400 May-17 H.S. 3500 3600 p 3700 49 3800 6000 Influence 5000 Regression without outlier 4000 Regression with outlier 2000 3000 Outlier 200 May-17 300 400 500 Gestational age H.S. 600 700 50 .2 Measures of influence -.6 -.4 -.2 0 Remove obs 1, see change remove obs 2, see change 1 2 10 Id • Measure change in: – Predicted outcome – Deviance – Coefficients (beta) • Delta beta May-17 H.S. 51 -10 -8 -6 -4 -2 0 Delta beta for gestational age 539 2000 3000 4000 weight 5000 beta for gestational age= 6.04 May-17 H.S. 6000 If obs nr 539 is removed, beta will change from 6 to 16 52 Removing outlier Full model Birth weight at ref Gestational age per day Sex Boy Girl Parity 0 1 2-7 Outlier removed coeff 95% conf. Int. 3524 6 0 -139 0 232 226 Birth weight at ref Gestational age per day Sex Boy Girl Parity 0 1 2-7 (4 , 8) (-229 , -49) (131 , 333) (107 , 345) One outlier affected two estimates May-17 coeff 95% conf. Int. 3531 17 (13 , 20) 0 -166 (-252 , -80) 0 229 225 (132 , 326) (112 , 339) Final model H.S. 53 Binary outcome: Logistic regression, Being bullied MULTIVARIABLE ANALYSIS 2 May-17 H.S. 54 Ordered categories and model Interval versus ordered scale: Interval scale 1 2 3 Ordered scale low May-17 medium high Categories Regression model 2 Logistic 3-7 Ordinal logistic >7 Linear (treat as interval) H.S. 55 Logistic model and assumptions • Association measure OR1 e 1 Odds ratio in y for 1 unit increase in x1 • Assumptions – Independent errors – Linear effects on the log odds scale • Robustness – influence May-17 H.S. 56 Being bullied • We want the total effect of country on being bullied. C1 age E D country bullied C2 sex – The risk of being bullied depends on age and sex. – The age and sex distribution may differ between countries. • Should we adjust for age and sex? No, age and sex are mediating variables May-17 H.S. 57 Logistic: being bullied N Country Sweden Island Norway Finland Denmark 407 448 379 409 436 % p-value <0.001 8.7 10.9 16.2 25.9 23.4 OR 95% conf. Int. 1 1.3 2.0 3.7 3.2 (0.8 , 2) (1.3 , 3.2) (2.4 , 5.6) (2.1 , 4.9) Roughly: Same risk of being bullied in Island as in Sweden. 2 times the risk in Norway as in Sweden. 3 times the risk in Finnland as in Sweden. Prevalence of being bullied=17% ORRR if outcome is rare OR>RR (further from 1) if the outcome is common May-17 H.S. 58 Summing up • DAGs – State prior knowledge. Guide analysis • Plots – Linearity, variance, outliers • Bivariate analysis – Continuous symmetrical Mean, T-test, anova – Continuous skewed Median, nonparametric – Categorical Freq, cross, chi-square • Multivariable analysis – Continuous – Binary May-17 Linear regression Logistic regression H.S. 59