Download ANOVA and linear regression

ANOVA and linear regression July 15, 2004 ANOVA for comparing means between more than 2 groups ANOVA (ANalysis Of VAriance)  Idea: For two or more groups, test difference between means, for quantitative normally distributed variables.  Just an extension of the t-test (an ANOVA with only two groups is mathematically equivalent to a t-test).  Like the t-test, ANOVA is “parametric” test— assumes that the outcome variable is roughly normally distributed with a mean and standard deviation (parameters) that we can estimate ANOVA Assumptions  Assumptions: Normally distributed outcome variable; homogeneity of variances (like t-test) The “F-test” Is the difference in the means of the groups more than background noise (=variability within groups)? Variabilit y between groups F Variabilit y within groups Spine bone density vs. menstrual regularity 1.2 1.1 1.0 S P I N E 0.9 Within group variability Between group variation Within group variability Within group variability 0.8 0.7 amenorrheic oligomenorrheic eumenorrheic Group means and standard deviations  Amenorrheic group (n=11): – Mean spine BMD = .92 g/cm2 – standard deviation = .10 g/cm2  Oligomenorrheic group (n=11) – Mean spine BMD = .94 g/cm2 – standard deviation = .08 g/cm2  Eumenrroheic group (n=11) – Mean spine BMD =1.06 g/cm2 – standard deviation = .11 g/cm2 The size of the groups. Between-group variation. The F-Test 2 sbetween The difference of each group’s mean from the overall mean. 2 2 2 (. 92  . 97 )  (. 94  . 97 )  ( 1 . 06  . 97 )  ns x2  11* ( )  .063 3 1 2 swithin  avg s 2  1 (.102  .082  .112 )  .0095 3 F2,30 The average amount of variation within groups. 2 between 2 within s  s .063   6.6 .0095 Large F value indicates Each group’s variance. that the between group variation exceeds the within group variation (=the background noise). The F-distribution  The F-distribution is a continuous probability distribution that depends on two parameters n and m (numerator and denominator degrees of freedom, respectively): The F-distribution  A ratio of sample variances follows an Fdistribution:   2 between 2 within The F ~ Fn ,m F-test tests the hypothesis that two sample variances are equal. will be close to 1 if sample variances are equal. 2 2 H 0 :  between   within H a : 2 between  2 within ANOVA Table Source of variation d.f. Between k-1 (k groups) Sum of squares Mean Sum of Squares SSB SSB/k-1 (sum of squared deviations of group means from F-statistic SSB SSW p-value Go to k 1 nk  k Fk-1,nk-k chart grand mean) Within nk-k (n individuals per group) Total nk-1 variation SSW (sum of squared deviations of observations from their group mean) s2=SSW/nk-k TSS (sum of squared deviations of observations from grand mean) TSS=SSB + SSW ANOVA=t-test Source of variation Between (2 groups) Within d.f. 1 2n-2 Sum of squares SSB Squared (squared difference difference in means in means) SSW equivalent to numerator of pooled variance Total 2n-1 variation Mean Sum of Squares TSS Pooled variance F-statistic p-value Go to (X  Y ) sp 2 2 ( X Y 2 )  (t 2 n  2 ) 2 sp F1, 2n-2 Chart notice values are just (t 2 2n-2) ANOVA summary  A statistically significant ANOVA (F-test) only tells you that at least two of the groups differ, but not which ones differ.  Determining which groups differ (when it’s unclear) requires more sophisticated analyses to correct for the problem of multiple comparisons… Question: Why not just do 3 pairwise ttests? Answer: because, at an error rate of 5% each test, this means you have an overall chance of up to 1(.95)3= 14% of making a type-I error (if all 3 comparisons were independent)  If you wanted to compare 6 groups, you’d have to do 6C2 = 15 pairwise ttests; which would give you a high chance of finding something significant just by chance (if all tests were independent with a type-I error rate of 5% each); probability of at least one type-I error = 1-(.95)15=54%.  Multiple comparisons With 18 independent comparisons, we have 60% chance of at least 1 false positive. Multiple comparisons With 18 independent comparisons, we expect about 1 false positive. Correction for multiple comparisons      How to correct for multiple comparisons posthoc… Bonferroni’s correction (adjusts p by most conservative amount; assuming all tests independent, divide p by the number of tests)  Holm/Hochberg (gives p-cutoff beyond which not significant)  Tukey’s (adjusts p)  Scheffe’s (adjusts p) Non-parametric ANOVA Kruskal-Wallis one-way ANOVA Extension of the Wilcoxon Rank-Sum test for 2 groups; based on ranks Proc NPAR1WAY in SAS Linear regression Outline  1. Simple linear regression and prediction  2. Multiple linear regression and multivariate analysis  3. Dummy coding categorical predictors Review: what is “Linear”?  Remember this:  Y=mX+B? m B Review: what’s slope? A slope of 2 means that every 1-unit change in X yields a 2-unit change in Y. Example  What’s the relationship between gestation time and birth-weight? Birth-weight depends on gestation time (hypothetical data) Y=birthweight (g) Best fit line is chosen such that the sum of the squared (why squared?) distances of the points (Yi’s) from the line is minimized: Slope of the line = 100 g/wk X=gestation time (weeks) Linear regression equation:  Birth-weight (g)=  + *(X weeks) + random variation  Birth-weight (g)= 0 + 100*(X wks) Prediction If you know something about X, this knowledge helps you predict something about Y. Prediction Baby weights at Stanford are normally distributed with a mean value of 3400 grams. Your “Best guess” at a random baby’s weight, given no information about the baby, is what? 3400 grams But, what if you have relevant information? Can you make a better guess? Prediction  A new baby is born that had gestated for just 30 weeks. What’s your best guess at the birth-weight?  Are you still best off guessing 3400?  NO! At 30 weeks… Y=birthweight 3000 (g) X=gestation time (weeks) 30 At 30 weeks… Y=birth weight 3000 (x,y)= (g) (30,3000) X=gestation time (weeks) 30 At 30 weeks…   The babies that gestate for 30 weeks appear to center around a weight of 3000 grams. Our linear regression equation predicts that a baby of 30 weeks gestation will weigh 3000g:  Expected weight (g) = 100*(30 weeks) And, if X=20, 30, or 40… Y=birthweight (g) X=gestation time (weeks) 20 30 40 If X=20, 30, or 40… Y=baby weights (g) X=gestation times (weeks) 20 30 40 Mean values fall on the line  At 40 weeks, expected weight = 4000  At 30 weeks, expected weight =3000  At 20 weeks, expected weight = 2000  In general, Expected weight = 100 grams/week*X wks Assumptions (or the fine print)  Linear regression assumes that… – 1. The relationship between X and Y is linear – 2. Y is distributed normally at each value of X – 3. The variance of Y at every value of X is the same (homogeneity of variances) Non-homogenous variance Y=birthweight (100g) X=gestation time (weeks) A ttest is linear regression!  A t-test is an example of linear regression with a binary predictor.  For example, if the mean difference in spine bone density between a sample of men and a sample of women is .11 g/cm2 and the women have an average value of .99, then the t-test for the difference in the means is mathematically equivalent to the linear regression model: Spine BMD (g/cm2) = .99 (intercept) + .11 (1 if male) Multiple Linear Regression  More than one predictor… =  + 1*X + 2 *W + 3 *Z Each regression coefficient is the amount of change in the outcome variable that would be expected per one-unit change of the predictor, if all other variables in the model were held constant. ANOVA is linear regression! A categorical variable with more than two groups: E.g.: groups 1, 2, and 3 (mutually exclusive) =  (=value for group 1) + 1*(1 if in group 2) + 2 *(1 if in group 3) This is called “dummy coding”—where multiple binary variables are created to represent being in each category (or not) of a categorical variable Example: ANOVA = linear regression In SAS: data stats210.runners; set stats210.runners; if mencat=1 then amenorrheic=1; else amenorrheic=0; if mencat=2 then oligomenorrheic=1; else oligomenorrheic =0; run; The good news is that SAS will often do this for you with a class statement! Functions of multivariate analysis: Control for confounders  Test for interactions between predictors (effect modification)  Improve predictions  Multiple linear regression caveats  Multicollinearity arises when two variables that measure the same thing or similar things (e.g., weight and BMI) are both included in a multiple regression model; they will, in effect, cancel each other out and generally destroy your model.  Model building and diagnostics are tricky business! Other types of multivariate regression  Multiple linear regression is for normally distributed outcomes  Logistic regression is for binary outcomes  Cox proportional hazards regression is used when time-to-event is the outcome Reading for this week  Chapters 6-8, 10 Note: Midterm next week  One “cheat” sheet allowed for in-class portion and one for in-lab portion

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download ANOVA and linear regression