* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download ANOVAmath
Survey
Document related concepts
Transcript
Math Interlude: Signs, Symbols, and ANOVA nuts and bolts Remember these? Σ= Summation = Product ! = Factorial Just shorthand!! Example Take 5 data points: 5, 6, 7, 9, 10 Represent them generally as X1 to X5 (the observations are indexed by subscripts i): n 5 X i 1 i X1 X 2 X 3 X 4 X 5 5 6 7 9 10 37 Just a shorthand way to write “add up all five data points”! Example Take 5 data points: 5, 6, 7, 9, 10 Represent them generally as X1 to X5 (the observations are indexed by subscripts i): n 5 X i 1 2 i ? 5 6 7 9 10 291 2 2 2 2 2 More summation… 10 i ? i 0 10 i 0 1 2 3 4 5 6 7 8 9 10 55 i 0 In general, n(n 1) i 1 2 3 . n - 3 n - 2 n - 1 n 2 i 1 n Working with summation… 4 10i ? i 1 4 10(1) 10(2) 10(3) 10(4) 10 [1 2 3 4] 10 i i 1 In general, n n i 1 i 1 ci c i Practice (just for fun!)… 5 i 1 i j i 1 j 0 2 ? Products… 5 data points again: 5, 6, 7, 9, 10 5 X i ? i 1 = 5(6)(7)(9)(10) = 18,900 Factorials… n i n! i 1 5! = 5(4)(3)(2)(1) note: 0! = 1 by convention Review of math symbols… X = often used to indicate the independent (predictor) variable Y = often used to indicate the dependent (or outcome) variable Xi = the ith observation of X Xij = in a table, the observation in row i and column j µy = the “true” (population or theoretical) mean of Y Y = the sample mean of Y x2 = the true variance of X / x = the true standard deviation Sx2= Sample variance/ Sx = Sample standard dev ANOVA: Concepts Review When do you use ANOVA? What assumptions are you making when you use ANOVA? What is the null hypothesis in ANOVA? What does a “significant” ANOVA mean? Why not just do a bunch of t-tests to compare means? ANOVA example Does this example meet the assumptions of ANOVA? Mean micronutrient intake from the school lunch by school Calcium (mg) Iron (mg) Folate (μg) Zinc (mg) a Mean SDe Mean SD Mean SD Mean SD S1a, n=28 117.8 62.4 2.0 0.6 26.6 13.1 1.9 1.0 S2b, n=25 158.7 70.5 2.0 0.6 38.7 14.5 1.5 1.2 S3c, n=21 206.5 86.2 2.0 0.6 42.6 15.1 1.3 0.4 School 1 (most deprived; 40% subsidized lunches). b School 2 (medium deprived; <10% subsidized). c School 3 (least deprived; no subsidization, private school). d ANOVA; significant differences are highlighted in bold (P<0.05). P-valued 0.000 0.854 0.000 0.055 FROM: Gould R, Russell J, Barker ME. School lunch menus and 11 to 12 year old children's food choice in three secondary schools in Englandare the nutritional standards being met? Appetite. 2006 Jan;46(1):86-92. ANOVA: Concepts Review On a piece of paper, represent the relationship between school group and calcium as a linear regression equation (no you do not need a computer to calculate this!) predictor=school group outcome=calcium no other predictors in the model set the most deprived school as your reference group round numbers for ease! New Stuff: The Math Behind ANOVA It’s like this: If I have three groups to compare: I could do three pair-wise ttests, but this would increase my type I error So, instead I want to look at the pairwise differences “all at once.” To do this, I can recognize that variance is a statistic that let’s me look at more than one difference at a time… The ANOVA “F-test” Is the difference in the means of the groups more than background noise (=variability within groups)? Summarizes the mean differences between all groups at once. Variabilit y between groups F Variabilit y within groups Analogous to pooled variance from a ttest. Side Note: The F-distribution The F-distribution is a continuous probability distribution that depends on two parameters n and m (numerator and denominator degrees of freedom, respectively): http://www.econtools.com/jevons/java/Graphics2D/FDist.html The F-distribution A ratio of variances follows an F-distribution: 2 between 2 within ~ Fn ,m The F-test tests the hypothesis that two variances are equal. F will be close to 1 if sample variances are equal. 2 2 H 0 : between within H a : 2 between 2 within For example… Randomize 33 subjects to three groups: 800 mg calcium supplement vs. 1500 mg calcium supplement vs. placebo. Compare the spine bone density of all 3 groups after 1 year. Group means and standard deviations Placebo group (n=11): 800 mg calcium supplement group (n=11) Mean spine BMD = .92 g/cm2 standard deviation = .10 g/cm2 Mean spine BMD = .94 g/cm2 standard deviation = .08 g/cm2 1500 mg calcium supplement group (n=11) Mean spine BMD =1.06 g/cm2 standard deviation = .11 g/cm2 Spine bone density vs. treatment 1.2 1.1 1.0 S P I N E 0.9 Within group variability Between group variation Within group variability Within group variability 0.8 0.7 PLACEBO 800mg CALCIUM 1500 mg CALCIUM The size of the groups. Between-group variation. The F-Test 2 sbetween The difference of each group’s mean from the overall mean. 2 2 2 (. 92 . 97 ) (. 94 . 97 ) ( 1 . 06 . 97 ) ns x2 11* ( ) .063 3 1 2 swithin avg s 2 1 (.102 .082 .112 ) .0095 3 F2,30 The average amount of variation within groups. 2 between 2 within s s .063 6.6 .0095 Large F value indicates Each group’s variance. that the between group variation exceeds the within group variation (=the background noise). How to calculate ANOVA’s by hand… Treatment 1 Treatment 2 Treatment 3 Treatment 4 y11 y21 y31 y41 y12 y22 y32 y42 y13 y23 y33 y43 y14 y24 y34 y44 y15 y25 y35 y45 y16 y26 y36 y46 y17 y27 y37 y47 y18 y28 y38 y48 y19 y29 y39 y49 y110 y210 y310 y410 10 y1 j 1 y 2 10 10 10 (y 1j y1 ) j 1 10 1 2 y 2j j 1 10 ( y 2 j y 2 ) 2 j 1 y 3 10 (y 3j y 3j y 4 j 1 y 10 y 3 ) j 1 10 1 k=4 groups 10 10 10 y1 j n=10 obs./group 10 1 10 2 (y 4j 4j j 1 The group means 10 y 4 ) 2 j 1 10 1 The (within) group variances Sum of Squares Within (SSW), or Sum of Squares Error (SSE) 10 10 (y 1j y1 ) 2 (y 1j 10 (y 10 y1 ) + j 1 2 3j y 3 ) 10 2 ( y 2 j y 2 ) 2 j 1 10 4j y 4 ) 2 The (within) group variances 10 1 10 1 10 4 (y j 1 j 1 10 1 10 1 (y y 2 ) j 1 j 1 10 2j 2 + ( y 3 j y 3 ) + j 3 ( y ij y i ) 2 2 10 (y 4j y 4 ) 2 j 1 Sum of Squares Within (SSW) (or SSE, for chance error) i 1 j 1 Terminology Note: “Sum of squares” is just a fancy way of saying the numerator of a variance. Sum of squares divided by degrees of freedom = variance Variance times degrees of freedom = sum of squares Sum of Squares Between (SSB), or Sum of Squares Regression (SSR) 4 Overall mean of all 40 observations (“grand mean”) y y (y i 1 ij i 1 j 1 4 10 x 10 i 40 y ) 2 Sum of Squares Between (SSB). Variability of the group means compared to the grand mean (the variability due to the treatment). Total Sum of Squares (SST) 4 10 i 1 j 1 ( y ij y ) 2 Total sum of squares(TSS). Squared difference of every observation from the overall mean. (numerator of variance of Y!) Partitioning of Variance 4 10 ( y i 1 j 1 ij y i ) 4 2 +10x i 1 ( y i y ) 4 2 = 10 i 1 j 1 SSW + SSB = TSS ( y ij y ) 2 ANOVA Table Source of variation Between (k groups) Within d.f. Sum of squares k-1 SSB F-statistic SSB/k-1 (sum of squared deviations of group means from grand mean) nk-k (n individuals per group) Total variation Mean Sum of Squares nk-1 SSW (sum of squared deviations of observations from their group mean) SSB SSW Go to k 1 nk k s2=SSW/nk-k TSS (sum of squared deviations of observations from grand mean) p-value TSS=SSB + SSW Fk-1,nk-k chart n X n Yn 2 X Yn 2 SSB n (X n ( )) n (Yn ( n )) 2 2 i 1 i 1 n ANOVA=t-test n n X n Yn 2 Y X n ( ) n ( n n )2 2 2 2 2 i 1 i 1 X n 2 Yn 2 X *Y Y X X *Y ) ( ) 2 n n ( n )2 ( n )2 2 n n ) 2 2 2 2 2 2 2 2 n( X n 2 X n * Yn Yn ) n( X n Yn ) 2 n(( Source of variation Between (2 groups) Within d.f. 1 2n-2 Sum of squares SSB Squared (squared difference difference in means in means times n multiplied by n) SSW equivalent to numerator of pooled variance Total 2n-1 variation Mean Sum of Squares TSS Pooled variance F-statistic n( X Y ) 2 sp 2 ( Go to (X Y ) sp n 2 sp n p-value ) (t 2 n 2 ) 2 2 2 F1, 2n-2 Chart notice values are just (t 2 2n-2) Numerical Example Treatment 1 Treatment 2 Treatment 3 Treatment 4 60 inches 50 48 47 67 52 49 67 42 43 50 54 67 67 55 67 56 67 56 68 62 59 61 65 64 67 61 65 59 64 60 56 72 63 59 60 71 65 64 65 Example Step 1) calculate the sum of squares between groups: Treatment 1 Treatment 2 Treatment 3 Treatment 4 60 inches 50 48 47 67 52 49 67 Mean for group 1 = 62.0 42 43 50 54 67 67 55 67 Mean for group 2 = 59.7 56 67 56 68 62 59 61 65 Mean for group 3 = 56.3 64 67 61 65 59 64 60 56 72 63 59 60 71 65 64 65 Mean for group 4 = 61.4 Grand mean= 59.85 SSB = [(62-59.85)2 + (59.7-59.85)2 + (56.3-59.85)2 + (61.4-59.85)2 ] xn per group= 19.65x10 = 196.5 Example Step 2) calculate the sum of squares within groups: (60-62) 2+(67-62) 2+ (42-62) 2+ (67-62) 2+ (56-62) 2+ (6262) 2+ (64-62) 2+ (59-62) 2+ (72-62) 2+ (71-62) 2+ (5059.7) 2+ (52-59.7) 2+ (4359.7) 2+67-59.7) 2+ (6759.7) 2+ (69-59.7) 2…+….(sum of 40 squared deviations) = 2060.6 Treatment 1 Treatment 2 Treatment 3 Treatment 4 60 inches 50 48 47 67 52 49 67 42 43 50 54 67 67 55 67 56 67 56 68 62 59 61 65 64 67 61 65 59 64 60 56 72 63 59 60 71 65 64 65 Step 3) Fill in the ANOVA table Source of variation d.f. Sum of squares Mean Sum of Squares F-statistic p-value Between 3 196.5 65.5 1.14 .344 Within 36 2060.6 57.2 Total 39 2257.1 Step 3) Fill in the ANOVA table Source of variation d.f. Sum of squares Mean Sum of Squares F-statistic p-value Between 3 196.5 65.5 1.14 .344 Within 36 2060.6 57.2 Total 39 2257.1 Coefficient of Determination: How much of the variance in height is “explained by” treatment group? R2=“Coefficient of Determination” = SSB/TSS = 196.5/2275.1=9% Coefficient of Determination SSB SSB R SSB SSE SST 2 The amount of variation in the outcome variable (dependent variable) that is “explained by” the predictor (independent variable). Beyond one-way ANOVA Often, you may want to test more than 1 treatment. ANOVA can accommodate more than 1 treatment or factor, so long as they are independent. Again, the variation partitions beautifully! TSS = SSB1 + SSB2 + SSW Calculating ANOVA from grouped data… Table 6. Mean micronutrient intake from the school lunch by school Calcium (mg) Iron (mg) Folate (μg) Zinc (mg) a Mean SDe Mean SD Mean SD Mean SD S1a, n=25 117.8 62.4 2.0 0.6 26.6 13.1 1.9 1.0 S2b, n=25 158.7 70.5 2.0 0.6 38.7 14.5 1.5 1.2 S3c, n=25 206.5 86.2 2.0 0.6 42.6 15.1 1.3 0.4 School 1 (most deprived; 40% subsidized lunches). b School 2 (medium deprived; <10% subsidized). c School 3 (least deprived; no subsidization, private school). d ANOVA; significant differences are highlighted in bold (P<0.05). P-valued 0.000 0.854 0.000 0.055 FROM: Gould R, Russell J, Barker ME. School lunch menus and 11 to 12 year old children's food choice in three secondary schools in Englandare the nutritional standards being met? Appetite. 2006 Jan;46(1):86-92. Answer Step 1) calculate the sum of squares between groups: Mean for School 1 = 117.8 Mean for School 2 = 158.7 Mean for School 3 = 206.5 Grand mean: 161 SSB = [(117.8-161)2 + (158.7-161)2 + (206.5-161)2] x25 per group= 98,113 Answer Step 2) calculate the sum of squares within groups: S.D. for S1 = 62.4 S.D. for S2 = 70.5 S.D. for S3 = 86.2 Therefore, sum of squares within is: (24)[ 62.42 + 70.5 2+ 86.22]=391,066 Answer Step 3) Fill in your ANOVA table Source of variation d.f. Sum of squares Mean Sum of Squares F-statistic p-value Between 2 98,113 49056 9 <.05 Within 72 391,066 5431 Total 74 489,179 **R2=98113/489179=20% School “explains” 20% of the variance in lunchtime calcium intake in these kids. ANOVA reminders… A statistically significant ANOVA (F-test) only tells you that at least two of the groups differ, but not which ones differ. Determining which groups differ (when it’s unclear) requires more sophisticated analyses to correct for the problem of multiple comparisons… ANOVA reminders… A statistically significant ANOVA does not imply a clinically significant difference between the groups! For example, if we had found a significant 10-mg difference in calcium, this probably would not be enough to care… ANOVA reminders… When data are not normally distributed and sample size is small, use nonparametric ANOVA equivalent… ANOVA reminders ANOVA is just linear regression!