Download ANOVAmath

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Linear least squares (mathematics) wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Degrees of freedom (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

Categorical variable wikipedia , lookup

Regression toward the mean wikipedia , lookup

Analysis of variance wikipedia , lookup

Transcript
Math Interlude: Signs, Symbols,
and ANOVA nuts and bolts
Remember these?

Σ= Summation
= Product
! = Factorial

Just shorthand!!


Example


Take 5 data points: 5, 6, 7, 9, 10
Represent them generally as X1 to X5 (the
observations are indexed by subscripts i):
n 5
X
i 1
i
 X1  X 2  X 3  X 4  X 5
 5  6  7  9  10  37

Just a shorthand way to write “add up all five data points”!
Example


Take 5 data points: 5, 6, 7, 9, 10
Represent them generally as X1 to X5 (the
observations are indexed by subscripts i):
n 5
X
i 1
2
i
?
 5  6  7  9  10  291
2
2
2
2
2
More summation…
10
i  ?
i 0
10
 i  0  1  2  3  4  5  6  7  8  9  10  55
i 0

In general,
n(n  1)
i  1  2  3  .   n - 3  n - 2  n - 1  n 

2
i 1
n
Working with summation…
4
10i  ?
i 1
4
 10(1)  10(2)  10(3)  10(4)  10 [1  2  3  4]  10 i
i 1
In general,

n
n
i 1
i 1
 ci  c i
Practice (just for fun!)…
5
i 1
 i j
i 1
j 0
2
?
Products…

5 data points again: 5, 6, 7, 9, 10
5
X
i
?
i 1
= 5(6)(7)(9)(10) = 18,900
Factorials…
n
 i  n!
i 1
5! = 5(4)(3)(2)(1)
note: 0! = 1 by convention
Review of math symbols…

X = often used to indicate the independent (predictor) variable

Y = often used to indicate the dependent (or outcome) variable

Xi = the ith observation of X

Xij = in a table, the observation in row i and column j

µy = the “true” (population or theoretical) mean of Y

Y = the sample mean of Y

x2 = the true variance of X / x = the true standard deviation

Sx2= Sample variance/ Sx = Sample standard dev
ANOVA: Concepts Review





When do you use ANOVA?
What assumptions are you making
when you use ANOVA?
What is the null hypothesis in ANOVA?
What does a “significant” ANOVA mean?
Why not just do a bunch of t-tests to
compare means?

ANOVA example
Does this
example meet
the assumptions
of ANOVA?
Mean micronutrient intake from the school lunch by school
Calcium (mg)
Iron (mg)
Folate (μg)
Zinc (mg)
a
Mean
SDe
Mean
SD
Mean
SD
Mean
SD
S1a, n=28
117.8
62.4
2.0
0.6
26.6
13.1
1.9
1.0
S2b, n=25
158.7
70.5
2.0
0.6
38.7
14.5
1.5
1.2
S3c, n=21
206.5
86.2
2.0
0.6
42.6
15.1
1.3
0.4
School 1 (most deprived; 40% subsidized lunches).
b School 2 (medium deprived; <10% subsidized).
c School 3 (least deprived; no subsidization, private school).
d ANOVA; significant differences are highlighted in bold (P<0.05).
P-valued
0.000
0.854
0.000
0.055
FROM: Gould R, Russell J,
Barker ME. School lunch
menus and 11 to 12 year old
children's food choice in three
secondary schools in Englandare the nutritional standards
being met? Appetite. 2006
Jan;46(1):86-92.
ANOVA: Concepts Review

On a piece of paper, represent the
relationship between school group and
calcium as a linear regression equation (no
you do not need a computer to calculate
this!)





predictor=school group
outcome=calcium
no other predictors in the model
set the most deprived school as your reference
group
round numbers for ease!
New Stuff: The Math Behind
ANOVA

It’s like this: If I have three groups to
compare:



I could do three pair-wise ttests, but this
would increase my type I error
So, instead I want to look at the pairwise
differences “all at once.”
To do this, I can recognize that variance is
a statistic that let’s me look at more than
one difference at a time…
The ANOVA “F-test”
Is the difference in the means of the groups more
than background noise (=variability within groups)?
Summarizes the mean differences
between all groups at once.
Variabilit y between groups
F
Variabilit y within groups
Analogous to pooled variance from a ttest.
Side Note: The F-distribution

The F-distribution is a continuous probability distribution that
depends on two parameters n and m (numerator and
denominator degrees of freedom, respectively):
http://www.econtools.com/jevons/java/Graphics2D/FDist.html
The F-distribution

A ratio of variances follows an F-distribution:


2
between
2
within
~ Fn ,m
The
F-test tests the hypothesis that two variances
are equal.
F
will be close to 1 if sample variances are equal.
2
2
H 0 :  between
  within
H a :
2
between

2
within
For example…


Randomize 33 subjects to three groups:
800 mg calcium supplement vs. 1500
mg calcium supplement vs. placebo.
Compare the spine bone density of all 3
groups after 1 year.
Group means and standard
deviations

Placebo group (n=11):



800 mg calcium supplement group (n=11)



Mean spine BMD = .92 g/cm2
standard deviation = .10 g/cm2
Mean spine BMD = .94 g/cm2
standard deviation = .08 g/cm2
1500 mg calcium supplement group (n=11)


Mean spine BMD =1.06 g/cm2
standard deviation = .11 g/cm2
Spine bone density vs.
treatment
1.2
1.1
1.0
S
P
I
N
E
0.9
Within group
variability
Between
group
variation
Within group
variability
Within group
variability
0.8
0.7
PLACEBO
800mg CALCIUM
1500 mg CALCIUM
The size of the
groups.
Between-group
variation.
The F-Test
2
sbetween
The difference of
each group’s
mean from the
overall mean.
2
2
2
(.
92

.
97
)

(.
94

.
97
)

(
1
.
06

.
97
)
 ns x2  11* (
)  .063
3 1
2
swithin
 avg s 2  1 (.102  .082  .112 )  .0095
3
F2,30
The average
amount of
variation within
groups.
2
between
2
within
s

s
.063

 6.6
.0095
Large F value indicates
Each group’s variance.
that the between group
variation exceeds the
within group variation
(=the background
noise).
How to calculate ANOVA’s by
hand…
Treatment 1
Treatment 2
Treatment 3
Treatment 4
y11
y21
y31
y41
y12
y22
y32
y42
y13
y23
y33
y43
y14
y24
y34
y44
y15
y25
y35
y45
y16
y26
y36
y46
y17
y27
y37
y47
y18
y28
y38
y48
y19
y29
y39
y49
y110
y210
y310
y410
10
y1 

j 1
y 2 
10
10
10
(y
1j
 y1 )
j 1
10  1
2

y
2j
j 1
10
( y 2 j  y 2 ) 2
j 1
y 3 
10
(y
3j
y
3j
y 4 
j 1
y
10
 y 3 )
j 1
10  1
k=4 groups
10
10
10
y1 j
n=10 obs./group
10  1
10
2
(y
4j
4j
j 1
The group means
10
 y 4 ) 2
j 1
10  1
The (within)
group variances
Sum of Squares Within (SSW),
or Sum of Squares Error (SSE)
10
10
(y
1j
 y1 ) 2
(y
1j
10
(y
10
 y1 ) +
j 1
2
3j
 y 3 )
10
2

( y 2 j  y 2 ) 2
j 1

10

4j
 y 4 ) 2
The (within)
group variances
10  1
10  1
10
4
(y
j 1
j 1
10  1
10  1
(y
 y 2 )
j 1
j 1
10
2j
2
+

( y 3 j  y 3 ) +
j 3
( y ij  y i ) 2
2
10
(y
4j
 y 4 ) 2
j 1
Sum of Squares Within (SSW)
(or SSE, for chance error)
i 1 j 1
Terminology Note:

“Sum of squares” is just a fancy way of saying the numerator of a variance.

Sum of squares divided by degrees of freedom = variance

Variance times degrees of freedom = sum of squares
Sum of Squares Between (SSB), or
Sum of Squares Regression (SSR)
4
Overall mean
of all 40
observations
(“grand mean”)
y  
 y
(y
i 1
ij
i 1 j 1
4
10 x
10
i
40
 y  )
2
Sum of Squares Between
(SSB). Variability of the
group means compared to
the grand mean (the
variability due to the
treatment).
Total Sum of Squares (SST)
4
10

i 1 j 1
( y ij  y  ) 2
Total sum of squares(TSS).
Squared difference of
every observation from the
overall mean. (numerator
of variance of Y!)
Partitioning of Variance
4
10
 ( y
i 1 j 1
ij
 y i )
4
2

+10x
i 1
( y i   y  )
4
2
=
10

i 1 j 1
SSW + SSB = TSS
( y ij  y  ) 2
ANOVA Table
Source of
variation
Between
(k groups)
Within
d.f.
Sum of
squares
k-1
SSB
F-statistic
SSB/k-1
(sum of squared
deviations of
group means from
grand mean)
nk-k
(n individuals per
group)
Total
variation
Mean Sum
of Squares
nk-1
SSW
(sum of squared
deviations of
observations from
their group mean)
SSB
SSW
Go to
k 1
nk  k
s2=SSW/nk-k
TSS
(sum of squared deviations of
observations from grand mean)
p-value
TSS=SSB + SSW
Fk-1,nk-k
chart
n
X n  Yn 2
X  Yn 2
SSB  n (X n  (
))  n (Yn  ( n
)) 
2
2
i 1
i 1
n

ANOVA=t-test

n
n
X n Yn 2
Y
X
n (
 )  n ( n  n )2
2
2
2
2
i 1
i 1


X n 2 Yn 2
X *Y
Y
X
X *Y
)  ( )  2 n n  ( n )2  ( n )2  2 n n ) 
2
2
2
2
2
2
2
2
n( X n  2 X n * Yn  Yn )  n( X n  Yn ) 2
n((
Source of
variation
Between
(2 groups)
Within
d.f.
1
2n-2
Sum of
squares
SSB
Squared
(squared difference
difference in means
in means
times n
multiplied
by n)
SSW
equivalent to
numerator of
pooled
variance
Total
2n-1
variation
Mean
Sum of
Squares
TSS
Pooled
variance
F-statistic
n( X  Y ) 2
sp
2
(
Go to
(X  Y )
sp
n
2

sp
n
p-value
)  (t 2 n  2 )
2
2
2
F1, 2n-2
Chart
notice
values
are just (t
2
2n-2)
Numerical Example
Treatment 1
Treatment 2
Treatment 3
Treatment 4
60 inches
50
48
47
67
52
49
67
42
43
50
54
67
67
55
67
56
67
56
68
62
59
61
65
64
67
61
65
59
64
60
56
72
63
59
60
71
65
64
65
Example
Step 1) calculate the sum
of squares between groups:
Treatment 1
Treatment 2
Treatment 3
Treatment 4
60 inches
50
48
47
67
52
49
67
Mean for group 1 = 62.0
42
43
50
54
67
67
55
67
Mean for group 2 = 59.7
56
67
56
68
62
59
61
65
Mean for group 3 = 56.3
64
67
61
65
59
64
60
56
72
63
59
60
71
65
64
65
Mean for group 4 = 61.4
Grand mean= 59.85
SSB = [(62-59.85)2 + (59.7-59.85)2 + (56.3-59.85)2 + (61.4-59.85)2 ] xn per
group= 19.65x10 = 196.5
Example
Step 2) calculate the sum
of squares within groups:
(60-62) 2+(67-62) 2+ (42-62)
2+ (67-62) 2+ (56-62) 2+ (6262) 2+ (64-62) 2+ (59-62) 2+
(72-62) 2+ (71-62) 2+ (5059.7) 2+ (52-59.7) 2+ (4359.7) 2+67-59.7) 2+ (6759.7) 2+ (69-59.7)
2…+….(sum of 40 squared
deviations) = 2060.6
Treatment 1
Treatment 2
Treatment 3
Treatment 4
60 inches
50
48
47
67
52
49
67
42
43
50
54
67
67
55
67
56
67
56
68
62
59
61
65
64
67
61
65
59
64
60
56
72
63
59
60
71
65
64
65
Step 3) Fill in the ANOVA table
Source of variation
d.f.
Sum of squares
Mean Sum of
Squares
F-statistic
p-value
Between
3
196.5
65.5
1.14
.344
Within
36
2060.6
57.2
Total
39
2257.1
Step 3) Fill in the ANOVA table
Source of variation
d.f.
Sum of squares
Mean Sum of
Squares
F-statistic
p-value
Between
3
196.5
65.5
1.14
.344
Within
36
2060.6
57.2
Total
39
2257.1
Coefficient of Determination:
How much of the variance in height is “explained by” treatment group?
R2=“Coefficient of Determination” = SSB/TSS = 196.5/2275.1=9%
Coefficient of Determination
SSB
SSB
R 

SSB  SSE SST
2
The amount of variation in the outcome variable (dependent
variable) that is “explained by” the predictor (independent
variable).
Beyond one-way ANOVA
Often, you may want to test more than 1
treatment. ANOVA can accommodate
more than 1 treatment or factor, so long
as they are independent. Again, the
variation partitions beautifully!
TSS = SSB1 + SSB2 + SSW
Calculating ANOVA from
grouped data…
Table 6. Mean micronutrient intake from the school lunch by school
Calcium (mg)
Iron (mg)
Folate (μg)
Zinc (mg)
a
Mean
SDe
Mean
SD
Mean
SD
Mean
SD
S1a, n=25
117.8
62.4
2.0
0.6
26.6
13.1
1.9
1.0
S2b, n=25
158.7
70.5
2.0
0.6
38.7
14.5
1.5
1.2
S3c, n=25
206.5
86.2
2.0
0.6
42.6
15.1
1.3
0.4
School 1 (most deprived; 40% subsidized lunches).
b School 2 (medium deprived; <10% subsidized).
c School 3 (least deprived; no subsidization, private school).
d ANOVA; significant differences are highlighted in bold (P<0.05).
P-valued
0.000
0.854
0.000
0.055
FROM: Gould R, Russell J,
Barker ME. School lunch
menus and 11 to 12 year old
children's food choice in three
secondary schools in Englandare the nutritional standards
being met? Appetite. 2006
Jan;46(1):86-92.
Answer
Step 1) calculate the sum of squares between groups:
Mean for School 1 = 117.8
Mean for School 2 = 158.7
Mean for School 3 = 206.5
Grand mean: 161
SSB = [(117.8-161)2 + (158.7-161)2 + (206.5-161)2] x25 per
group= 98,113
Answer
Step 2) calculate the sum of squares within groups:
S.D. for S1 = 62.4
S.D. for S2 = 70.5
S.D. for S3 = 86.2
Therefore, sum of squares within is:
(24)[ 62.42 + 70.5 2+ 86.22]=391,066
Answer
Step 3) Fill in your ANOVA table
Source of variation
d.f.
Sum of squares
Mean Sum of
Squares
F-statistic
p-value
Between
2
98,113
49056
9
<.05
Within
72
391,066
5431
Total
74
489,179
**R2=98113/489179=20%
School “explains” 20% of the variance in lunchtime calcium
intake in these kids.
ANOVA reminders…


A statistically significant ANOVA (F-test)
only tells you that at least two of the
groups differ, but not which ones differ.
Determining which groups differ (when
it’s unclear) requires more sophisticated
analyses to correct for the problem of
multiple comparisons…
ANOVA reminders…


A statistically significant ANOVA does
not imply a clinically significant
difference between the groups!
For example, if we had found a
significant 10-mg difference in calcium,
this probably would not be enough to
care…
ANOVA reminders…

When data are not normally distributed
and sample size is small, use nonparametric ANOVA equivalent…
ANOVA reminders

ANOVA is just linear regression!