Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
ANOVA and linear regression July 15, 2004 ANOVA for comparing means between more than 2 groups ANOVA (ANalysis Of VAriance) Idea: For two or more groups, test difference between means, for quantitative normally distributed variables. Just an extension of the t-test (an ANOVA with only two groups is mathematically equivalent to a t-test). Like the t-test, ANOVA is “parametric” test— assumes that the outcome variable is roughly normally distributed with a mean and standard deviation (parameters) that we can estimate ANOVA Assumptions Assumptions: Normally distributed outcome variable; homogeneity of variances (like t-test) The “F-test” Is the difference in the means of the groups more than background noise (=variability within groups)? Variabilit y between groups F Variabilit y within groups Spine bone density vs. menstrual regularity 1.2 1.1 1.0 S P I N E 0.9 Within group variability Between group variation Within group variability Within group variability 0.8 0.7 amenorrheic oligomenorrheic eumenorrheic Group means and standard deviations Amenorrheic group (n=11): – Mean spine BMD = .92 g/cm2 – standard deviation = .10 g/cm2 Oligomenorrheic group (n=11) – Mean spine BMD = .94 g/cm2 – standard deviation = .08 g/cm2 Eumenrroheic group (n=11) – Mean spine BMD =1.06 g/cm2 – standard deviation = .11 g/cm2 The size of the groups. Between-group variation. The F-Test 2 sbetween The difference of each group’s mean from the overall mean. 2 2 2 (. 92 . 97 ) (. 94 . 97 ) ( 1 . 06 . 97 ) ns x2 11* ( ) .063 3 1 2 swithin avg s 2 1 (.102 .082 .112 ) .0095 3 F2,30 The average amount of variation within groups. 2 between 2 within s s .063 6.6 .0095 Large F value indicates Each group’s variance. that the between group variation exceeds the within group variation (=the background noise). The F-distribution The F-distribution is a continuous probability distribution that depends on two parameters n and m (numerator and denominator degrees of freedom, respectively): The F-distribution A ratio of sample variances follows an Fdistribution: 2 between 2 within The F ~ Fn ,m F-test tests the hypothesis that two sample variances are equal. will be close to 1 if sample variances are equal. 2 2 H 0 : between within H a : 2 between 2 within ANOVA Table Source of variation d.f. Between k-1 (k groups) Sum of squares Mean Sum of Squares SSB SSB/k-1 (sum of squared deviations of group means from F-statistic SSB SSW p-value Go to k 1 nk k Fk-1,nk-k chart grand mean) Within nk-k (n individuals per group) Total nk-1 variation SSW (sum of squared deviations of observations from their group mean) s2=SSW/nk-k TSS (sum of squared deviations of observations from grand mean) TSS=SSB + SSW ANOVA=t-test Source of variation Between (2 groups) Within d.f. 1 2n-2 Sum of squares SSB Squared (squared difference difference in means in means) SSW equivalent to numerator of pooled variance Total 2n-1 variation Mean Sum of Squares TSS Pooled variance F-statistic p-value Go to (X Y ) sp 2 2 ( X Y 2 ) (t 2 n 2 ) 2 sp F1, 2n-2 Chart notice values are just (t 2 2n-2) ANOVA summary A statistically significant ANOVA (F-test) only tells you that at least two of the groups differ, but not which ones differ. Determining which groups differ (when it’s unclear) requires more sophisticated analyses to correct for the problem of multiple comparisons… Question: Why not just do 3 pairwise ttests? Answer: because, at an error rate of 5% each test, this means you have an overall chance of up to 1(.95)3= 14% of making a type-I error (if all 3 comparisons were independent) If you wanted to compare 6 groups, you’d have to do 6C2 = 15 pairwise ttests; which would give you a high chance of finding something significant just by chance (if all tests were independent with a type-I error rate of 5% each); probability of at least one type-I error = 1-(.95)15=54%. Multiple comparisons With 18 independent comparisons, we have 60% chance of at least 1 false positive. Multiple comparisons With 18 independent comparisons, we expect about 1 false positive. Correction for multiple comparisons How to correct for multiple comparisons posthoc… Bonferroni’s correction (adjusts p by most conservative amount; assuming all tests independent, divide p by the number of tests) Holm/Hochberg (gives p-cutoff beyond which not significant) Tukey’s (adjusts p) Scheffe’s (adjusts p) Non-parametric ANOVA Kruskal-Wallis one-way ANOVA Extension of the Wilcoxon Rank-Sum test for 2 groups; based on ranks Proc NPAR1WAY in SAS Linear regression Outline 1. Simple linear regression and prediction 2. Multiple linear regression and multivariate analysis 3. Dummy coding categorical predictors Review: what is “Linear”? Remember this: Y=mX+B? m B Review: what’s slope? A slope of 2 means that every 1-unit change in X yields a 2-unit change in Y. Example What’s the relationship between gestation time and birth-weight? Birth-weight depends on gestation time (hypothetical data) Y=birthweight (g) Best fit line is chosen such that the sum of the squared (why squared?) distances of the points (Yi’s) from the line is minimized: Slope of the line = 100 g/wk X=gestation time (weeks) Linear regression equation: Birth-weight (g)= + *(X weeks) + random variation Birth-weight (g)= 0 + 100*(X wks) Prediction If you know something about X, this knowledge helps you predict something about Y. Prediction Baby weights at Stanford are normally distributed with a mean value of 3400 grams. Your “Best guess” at a random baby’s weight, given no information about the baby, is what? 3400 grams But, what if you have relevant information? Can you make a better guess? Prediction A new baby is born that had gestated for just 30 weeks. What’s your best guess at the birth-weight? Are you still best off guessing 3400? NO! At 30 weeks… Y=birthweight 3000 (g) X=gestation time (weeks) 30 At 30 weeks… Y=birth weight 3000 (x,y)= (g) (30,3000) X=gestation time (weeks) 30 At 30 weeks… The babies that gestate for 30 weeks appear to center around a weight of 3000 grams. Our linear regression equation predicts that a baby of 30 weeks gestation will weigh 3000g: Expected weight (g) = 100*(30 weeks) And, if X=20, 30, or 40… Y=birthweight (g) X=gestation time (weeks) 20 30 40 If X=20, 30, or 40… Y=baby weights (g) X=gestation times (weeks) 20 30 40 Mean values fall on the line At 40 weeks, expected weight = 4000 At 30 weeks, expected weight =3000 At 20 weeks, expected weight = 2000 In general, Expected weight = 100 grams/week*X wks Assumptions (or the fine print) Linear regression assumes that… – 1. The relationship between X and Y is linear – 2. Y is distributed normally at each value of X – 3. The variance of Y at every value of X is the same (homogeneity of variances) Non-homogenous variance Y=birthweight (100g) X=gestation time (weeks) A ttest is linear regression! A t-test is an example of linear regression with a binary predictor. For example, if the mean difference in spine bone density between a sample of men and a sample of women is .11 g/cm2 and the women have an average value of .99, then the t-test for the difference in the means is mathematically equivalent to the linear regression model: Spine BMD (g/cm2) = .99 (intercept) + .11 (1 if male) Multiple Linear Regression More than one predictor… = + 1*X + 2 *W + 3 *Z Each regression coefficient is the amount of change in the outcome variable that would be expected per one-unit change of the predictor, if all other variables in the model were held constant. ANOVA is linear regression! A categorical variable with more than two groups: E.g.: groups 1, 2, and 3 (mutually exclusive) = (=value for group 1) + 1*(1 if in group 2) + 2 *(1 if in group 3) This is called “dummy coding”—where multiple binary variables are created to represent being in each category (or not) of a categorical variable Example: ANOVA = linear regression In SAS: data stats210.runners; set stats210.runners; if mencat=1 then amenorrheic=1; else amenorrheic=0; if mencat=2 then oligomenorrheic=1; else oligomenorrheic =0; run; The good news is that SAS will often do this for you with a class statement! Functions of multivariate analysis: Control for confounders Test for interactions between predictors (effect modification) Improve predictions Multiple linear regression caveats Multicollinearity arises when two variables that measure the same thing or similar things (e.g., weight and BMI) are both included in a multiple regression model; they will, in effect, cancel each other out and generally destroy your model. Model building and diagnostics are tricky business! Other types of multivariate regression Multiple linear regression is for normally distributed outcomes Logistic regression is for binary outcomes Cox proportional hazards regression is used when time-to-event is the outcome Reading for this week Chapters 6-8, 10 Note: Midterm next week One “cheat” sheet allowed for in-class portion and one for in-lab portion