Download Document

Document related concepts

Data assimilation wikipedia , lookup

Time series wikipedia , lookup

Instrumental variables estimation wikipedia , lookup

Choice modelling wikipedia , lookup

Regression toward the mean wikipedia , lookup

Interaction (statistics) wikipedia , lookup

Regression analysis wikipedia , lookup

Linear regression wikipedia , lookup

Coefficient of determination wikipedia , lookup

Transcript
Group #4
AMS 572 – Data Analysis
Professor: Wei Zhu
1/85
Lin Wang (Lana)
Xian Lin (Ben)
Zhide Mo (Jeff)
Miao Zhang
Yuan Bian
Juan E.
Mojica
Ruofeng Wen
Hemal
Khandwala
Lei Lei
Xiaochen Li (Joe)
2/85
3/85
4/85
ANCOVA
Analysis of Covariance
ANCOVA
merge of ANOVA & Linear Regression
Analysis of Variance
5/85
6/85
• described by R. A. Fisher to assist in the
analysis of data from agricultural
experiments.
• Compare the means of any number of
experimental conditions without any
increase in Type 1 error.
H0 is rejected
when it is true
7/85
ANOVA
a way of determining whether
the average scores of groups differed
significantly.
Psychology
Assess the average effect of
different experimental
conditions on subjects in
terms of a particular
dependent variable.
8/85
An English statistician,
Evolutionary biologist, and
Geneticist.
Feb.17, 1890 – July 29, 1962
Analysis of Variance(ANOVA), Maximum
likelihood, F-distribution, etc.
9/85
10/85
• developed and applied in different areas with
that of ANOVA
• got developed in biology and psychology
• The term "regression" was coined by Francis
Galton in the nineteenth century to describe a
biological phenomenon
11/85
studied
the height of parents and
their adult children
5’4’’
5’6’’
short
5’8’’
<
parents’ children are usually
shorter than average, but
still taller than their parents.
5’9’’
Average height
Regression toward the Mean
12/85
applied to data obtained
from correlational or non-experimental research
helps us
understand the effect of changing one
independent variable on changing dependent
variable value
13/85
(Feb. 16, 1822-Jan. 17, 1851)
English anthropologist,
eugenicist, and statistician.
•
widely promoted regression
toward the mean
• created the statistical concept of correlation
•
a pioneer in eugenics, coined the term in 1883
•
the first to apply statistical methods to the study
of human differences
14/85
• a statistical technique that combines
regression and ANOVA(analysis of variance).
• originally developed by R.A. Fisher to increase
the precision of experimental analysis
• applied most frequently in quasiexperimental research
involve variables cannot be controlled directly
15/85
16/85
Balanced
design, if
𝑛𝑖 ≡ 𝑛
Sample Mean
Sample SD
1
𝑦11
𝑦12
Treatment
2
…
𝑦21
…
𝑦22
…
𝑎
𝑦𝑎1
𝑦𝑎2
𝑦1𝑛𝑖
𝑦2𝑛2
…
𝑦𝑎𝑛𝑎
𝑦1
𝑠1
𝑦2
𝑠2
…
…
𝑦𝑎
𝑠𝑎
17/85
• 𝑌𝑖𝑗 = 𝜇𝑖 + 𝜖𝑖𝑗 , where i  1, 2,..., a; j  1, 2,..., ni
• 𝑌𝑖𝑗 ~𝑁 𝜇𝑖 , 𝜎 2 , 𝜖𝑖𝑗 ~𝑁(0, 𝜎 2 )
• 𝜇𝑖 = 𝜇 + 𝛼𝑖 , where 𝜇 is the grand mean
𝑌𝑖𝑗 = 𝜇 + 𝛼𝑖 + 𝜖𝑖𝑗
18/85
𝜇=𝑦
(grand mean)
𝛼𝑖 = 𝑦𝑖 − 𝑦
19/85
Sample Mean
Sample SD
1
𝑦11
𝑦12
Treatment
2
…
𝑦21
…
𝑦22
…
𝑎
𝑦𝑎1
𝑦𝑎2
𝑦1𝑛𝑖
𝑦2𝑛2
…
𝑦𝑎𝑛𝑎
𝑦1
𝑠1
𝑦2
𝑠2
…
…
𝑦𝑎
𝑠𝑎
20/85
• the factor A sum of squares
𝒂
𝒏𝒊 (𝒚𝒊 − 𝒚)𝟐
𝑺𝑺𝑨 =
𝒊=𝟏
• the factor A mean square, with 𝑎 − 1 d.f.
𝑺𝑺𝑨
𝑴𝑺𝑨 =
=
𝒂−𝟏
𝒂
𝒊=𝟏 𝒏𝒊 (𝒚𝒊
− 𝒚)𝟐
𝒂−𝟏
21/85
Sample Mean
Sample SD
1
𝑦11
𝑦12
Treatment
2
…
𝑦21
…
𝑦22
…
𝑎
𝑦𝑎1
𝑦𝑎2
𝑦1𝑛𝑖
𝑦2𝑛2
…
𝑦𝑎𝑛𝑎
𝑦1
𝑠1
𝑦2
𝑠2
…
…
𝑦𝑎
𝑠𝑎
22/85
23/85
Sample Mean
Sample SD
1
𝑦11
𝑦12
Treatment
2
…
𝑦21
…
𝑦22
…
𝑎
𝑦𝑎1
𝑦𝑎2
𝑦1𝑛𝑖
𝑦2𝑛2
…
𝑦𝑎𝑛𝑎
𝑦1
𝑠1
𝑦2
𝑠2
…
…
𝑦𝑎
𝑠𝑎
24/85
• the total sum of squares
𝒂
𝒏𝒊
(𝒚𝒊𝒋 − 𝒚)𝟐
𝑺𝑺𝑻 =
𝒊=𝟏 𝒋=𝟏
• ANOVA identity
𝒂
𝑺𝑺𝑻 = 𝑺𝑺𝑨 + 𝑺𝑺𝑬
𝒏𝒊
𝒂
𝒏𝒊
(𝒚𝒊𝒋 − 𝒚)𝟐 =
𝒊=𝟏 𝒋=𝟏
𝒂
𝒏𝒊
(𝒚𝒊𝒋 − 𝒚𝒊 )𝟐 +
𝒊=𝟏 𝒋=𝟏
(𝒚𝒊𝒋 − 𝒚𝒊 )𝟐
𝒊=𝟏 𝒋=𝟏
25/85
Source of
Variance
Sum of Squares
Treatments
𝑆𝑆𝐴
=
Error
Total
𝑆𝑆𝐴
𝑀𝑆𝐴 =
𝑎−1
𝑁−𝑎
𝑆𝑆𝐸
𝑀𝑆𝐸 =
𝑁−𝑎
(𝑦𝑖𝑗 − 𝑦𝑖 )2
𝑆𝑆𝑇
=
𝑎−1
𝑛𝑖 (𝑦𝑖 − 𝑦)2
𝑆𝑆𝐸
=
Degrees of Mean Square
Freedom
F
𝑀𝑆𝐴
𝐹=
𝑀𝑆𝐸
𝑁−1
(𝑦𝑖𝑗 − 𝑦)2
26/85
27/85
Yij     i   ij
Data, the jth
observation
of the ith
group
Grand
mean of Y
Error
N(0,σ2)
Effects of the ith group
(We focus on if αi = 0, i = 1, …, a)
28/85
Yij  1 X ij   0   ij
Data, the
(ij)th
observation
Predictor
Error
Slope and Intersect
(We focus on the estimate)
29/85
Yij    i   ( X ij  X ..)   ij
Effects of the ith
group
(We still focus on if
αi = 0, i = 1, …, a)
Known Covariate
(What is this guy
doing here?)
30/85
Yij    i   ( X ij  X ..)   ij
𝑌𝑖𝑗 (𝑎𝑑𝑗𝑢𝑠𝑡) = 𝑌𝑖𝑗 − 𝛽(𝑋𝑖𝑗 − 𝑋. .
Yij (adjust )    i   ij
(This is just the ANOVA Model!)
31/85
ˆ
Yij  1 X ij   0   ij
Within each group,
consider αi a constant,
and notice that we
actually only desire the
estimate of slope β
instead of INTERSECT.
Yij    i   ( X ij  X ..)   ij
32/85
ˆ
• Within each group, do Least Square:
ˆi


j
( X ij  X i. )(Yij  Yi. )
2
(
X

X
)
 j ij i.
• Assume that
33/85
ˆ
• We use Pooled Estimate of β
ˆi
ˆ 


j
( X ij  X i. )(Yij  Yi. )
2
(
X

X
)
 j ij i.
ˆ ( X  X )2

 i  ij i.
i
j
2
(
X

X
)
 ij i.
i
j

 ( X
i
ij
 X i. )(Yij  Yi. )
j
2
(
X

X
)
 ij i.
i
j
34/85
ANCOVA begins:
In each group, find
Slope Estimation
via Linear
Regression
Yij    i   ( X ij  X ..)   ij
𝛽𝑖 =
𝑋𝑖𝑗 − 𝑋𝑖. )(𝑌𝑖𝑗 − 𝑌𝑖.
𝑗
𝑗
 ˆ  ( X  X )
ˆ 
 ( X  X )
i
Pool them together
2
𝑋𝑖𝑗 − 𝑋𝑖.
ij
i
2
i.
j
2
ij
i
i.
j
Get rid of the
Covariate
𝑌𝑖𝑗 (𝑎𝑑𝑗𝑢𝑠𝑡) = 𝑌𝑖𝑗 − 𝛽(𝑋𝑖𝑗 − 𝑋. .
Do ANOVA on the
model
𝑌𝑖𝑗 (𝑎𝑑𝑗𝑢𝑠𝑡) = 𝜇 + 𝛼𝑖 + 𝜀𝑖𝑗
Go home and have
dinner.
Yammy  Cheeseburg 2  ice(Coke)   ?
35/85
Regression
General Linear Model
ANOVA /ANCOVA
36/85
Y  0   X  
Error
Response
Variable
Predictor
Intersect
Slope
All of them are Scalars!
37/85
Y  X 
 y1 
 
 
y 
 m
 x11


 xm1

x1,( n 1)
xm,( n 1)
1


1
 1 
 
 
 
 n
 1 
 
 
 
 n
38/85
Yi  0  1Zi   i
Outcome
of the ith
unit
coefficient for
the intersect
coefficient
for the slope
More about the Zi :
Zi =1 if unit is the treatment group
Zi =0 if unit is the control group
Residual
for the ith
unit
Categorical
variable (binary)
39/85
Overall mean
response
Yijk     i   j   ij   ijk
Residual
for the ith
unit
Response
variable
effect due to
the ith level of
factor A
effect due to
the jth level
of factor B
the effect due
to any
interaction
between the ith
level of A and
the jth level of B
40
The ith
response
variable
Random Error
yi  0  1 X i1   2 X i 2  ... p1 X p1   p 2 X p 2   i
Categorical
Variables
Categorical
Variables
Continuous
Variable
Continuous
Variable
The above formula can be simply denoted as:
Y  X 
What can this X be?
Before we see an example of X, we have learned that
General Linear Model covers (1) Simple Linear Regression; (2) Multiple Linear
Regression; (3) ANOVA; (4) 2-way/n-way ANOVA.
41/85
X in the GLM might be expanded as
Y  0  1 X1  2 X 2  3 X 3
Where X3 in the above formula
could be the INTERACTION
between X1 and X2
Y  0  1 X1  2 X 2  3 X1 * X 2
Did you see the tricks?
Next, let us see what assumptions shall be satisfied before using ANCOVA.
42/85
Before using ANCOVA…
1. Test the homogeneity of variance
2. Test the homogeneity of regression
whether H0:
1  ...  i  ...  a
3. Test whether there is a linear
relationship between the dependent
variable and covariate.
43/85
For each i, calculate the MSEi
MSEi  SSEi / df  SSEi / n  2
Utilize Max( MSEi )and Min( MSEi ) to do a Fmax test
i
i
to make sure  is a constant under each different
levels.
F=Max(MSEi ) / Min( MSEi )
44/85
1  ...  i  ...  a (1)
45/85
1  ...  i  ...  a (2)
a
(1) Define SSE G   SSEi
i 1
SSE G
SSEi
Sum of Square of Errors within Groups
Is calculated based on ˆ
i
AND, SSE G is generated by the random error  .
46/85
1  ...  i  ...  a (3)
(2) SSE is generated by
• Random Error 
• Difference between distinct ˆi
We can calculate SSE based on a common ˆ
(3) Let SSB=SSE – SSEG.
SSB
Sum of Square between Groups
SSB is constituted by the difference between
different ˆ
i
47/85
1  ...  i  ...  a (4)
dfb  df e  df
G
e
 [a (n  1)  1]  a (n  2)  a  1
MSB  SSB / dfb  SSB / a  1
MSE G  SSE G / df eG  SSE G / a( n  2)
MSB
MSE G
Mean Square between Groups
Mean Square within Groups
Do F test on MSB and MSEG to see whether
we can reject our HO
F=MSB / MSEG
48/85
Assumption 3:
Test a linear relationship between the
dependent variable and covariate.
Ho:  = 0
How to do it?
F test on SSR and SSE
Sum of Square of
Regression
49/85
How to calculate SSR and MSR?
ˆ 
ˆ x
ˆi  
y
0
1 i
From each xi
yˆ i
SSR is the difference obtained from the
summation of the square of the differences
ˆ i and y .
between y
n
SSR   ( yˆi  y ) 2
i 1
MSR  SSR /1
50/85
How to calculate SSE and MSE?
ˆ 
ˆ x
ˆi  
y
0
1 i
From each xi
yˆ i
SSE is the error obtained from the summation
of the square of the differences between yi
and y
ˆi.
n
SSE   ( yi  yˆi ) 2
i 1
MSE  SSE /(n  2)
51/85
MSR
F
MSE
Based on the T.S. we determine
whether to accept H0 (   0 ) or not.
Assume Assumptions 01 and 02 are already passed.
• If H0 is true (   0 ),we do ANOVA.
• Otherwise, we do ANCOVA.
So, anytime we want to use ANCOVA, we
need to test the three assumptions first!
52/85
53/85
• In this hypothetical study, a sample of 36 teams (id in
the data set) of 12-year-old children attending a
summer camp participated in a study to determine
which one of three different tree-watering techniques
worked best to promote tree growth.
Techniques
Frequency
Code
Watering the base with a
hose
10 minutes once
per day
1
Watering the ground
surrounding (drip system)
2 hours each day
2
Deep watering (sunk pipe)
10 minutes every 3
days
3
54/85
• From a large set of equally sized and equally
healthy fast-growing trees, each team was
given a tree to plant at the start of the camp.
• Each team was responsible for the watering
and general care of their trees
• At the end of the summer, the height of each
tree was measured.
60/85
• that some children might have had more
gardening experience than others, and
• that any knowledge gained as a result of that
prior experience might affect the way the tree
was planted and perhaps even the way in
which the children cared for the tree and
carried out the watering regime.
How to approach?
Create a indicator for that knowledge. (i.e. a 40 point scale gardering
experience)
61/85
Grouping (1,2,3)
Dependend
Variable
id
watering
technique
1
2
3
4
…….
32
33
34
35
36
1
1
1
1
………
3
3
3
3
3
tree
growth
dv
39
36
30
42
………..
36
30
39
27
24
Covariate
Variable
gardening
exp cov
24
18
21
24
………
15
18
18
9
6
Real Data
62/85
Grouping (1,2,3)
Dependend
Variable
id
1
2
3
4
…….
32
33
34
35
36
Overall Mean
tree
Response
watering
technique
1
1
1
1
………
3
3
3
3
3
growth
dv
39
36
30i
42
………..
36
30
39
27
24
Covariate
Variable
gardening
exp cov
Residual error
24
18
21 ij
24
………
15 Regression coefficient
parameter.
18
18
9
6
Yij       ( X  X ..)   ij
Real Data
63/85
Homogenity
of Regression
Homogenity
of Variance
and dv is
Normal
Linearity of
Regression
ANCOVA
SAS
64/85
 X ,Y 
cov( X , Y )
 XY

E[( X   X )(Y  Y )]
 XY

n

i 1
( X i  X )(Yi  Y )
i1 ( X i  X )2
n

n
2
(
Y

Y
)
i
i 1
The Pearson correlation
coefficient between the
covariate and the
dependent
var.is .81150.
65/85
Assumptions
Clearly a strong linear
component to the relationship.
Linearity of regression
assumption appears to be met
by the data set
66/85
The assumption of homogeneity of regression is tested by
examining the interaction of the covariate and the independent
variable. If it is not statistically significant, as is the case here, then
the assumption is met.
67/85
The Model contains the effects
of both the covariate and the
independent variable.
The effects of the covariate
and the independent variable
are separately evaluated in
this summary table.
68/85
69/85
Watering techniques coded as 1
(hose watering) and 3 (deep
watering) are the only two groups
whose means differ significantly
78/85
• We can assert that prior gardening experience and
knowledge was quite influential in how well the trees
fared under the attention of the young campers.
• when we statistically control for or equate the
gardening experience and knowledge of the children,
was a relatively strong factor in how much growth was
seen in the trees.
• On the basis of the adjusted means, we may therefore
conclude that, when we statistically control for
gardening experience,deep watering is more effective
than hose watering but is not significantly more
effective than drip watering.
79/85
GROUP VARIABLE, DEPENDENT VARIABLE and COVARIATE
80/85
81/85
Tasks->Graph->Scatter Plot
82/85
Tasks->ANOVA->Linear Models
83/85
84/85
85/85