Download Johnson, N.L.; (1972)A method of separation of residual and interaction effects in cross-classification."

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Linear regression wikipedia , lookup

Bias of an estimator wikipedia , lookup

Regression analysis wikipedia , lookup

Regression toward the mean wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Coefficient of determination wikipedia , lookup

Transcript
This work was supported in part by the U.S. Army Research Office Durham under Contract No. DAHC04-71-C-0042.
-e
A METHOD FOR SEPARATION OF RESIDUAL AND
INTERACTION EFFECTS IN CROSS-CLASSIFICATIONS
N. L. JOHNSON
Institute of Statistics Mimeo Series No. 842
Department of Statistics
University of North Carolina at ChapeZ HilZ
September, 1972
-e
A Method for Separation of Residual and
Interaction Effects in Cross-Classifications
N. L. JOHNSON
University of North CaroUna at ChapeZ HiU
1. Description of the Problem
Consider a two-way cross-classification with factors
("columns") at
r, c
R
("rows")
and
C
levels respectively, and suppose that the expected value
of an observation at the i-th level of
R and the j-th level of
C is
represented in the form
.~
The last term represents the interaction effect between
denoted in the abbreviated form
Rand C, usually
RXC.
Variability about this expected value is represented by addition of a
random variable with zero expected value.
It is usually assumed that these
random variables are mutually independent and have a common variance,
If there is only one observation for each of the
levels of
Rand
C
rc
(J
2
say.
combinations of
(as, for example, in a randomized block experiment) it
is not possible to estimate
(J
2
separately from the interaction effects
(pK) ij •
It would appear that this will also be the case if we have more than one
individual in some cells, but observe only the total (or the arithmetic mean)
of the values in each cell.
.e
There is then only one observation in each cell
and the situation appears to be similar to that already described •
2
However, in certain cases of this kind, it is possible to distinguish
between residual and interaction effects.
We will consider here cases
characterized by the following conditions:
(i)
the interaction effects are represented by random variables,
mutually independent and also independent of the random variables representing
residual variation,
(ii)
each of the interaction random variables has expected value zero
and the same variance,
(iii)
0,2, say, and
The number of individuals is not the same in all cells.
2. Frequencies Constant Within Rows: Two Subsets
2.1
Estimation
In this paper, we will consider only cases in which the
number of individuals per cell is the same for a given level of
·e
be the value of
(a)
the
c
(b)
for
>
r
l
levels of
C,
R there are
n
individuals for each of
l
and
for the other
c
whatever
We first suppose that
(~2)
levels of
for each of the
nl
R.
C,
levels of
levels of
C.
R
there are
individuals
Without loss of generality we suppose
n2 ·
If the two sections of the data are analysed separately, we obtain
analyses of variance of the form:
FIRST
SOURCE
r
l
DEGREES OF
FREEDOM
LEVELS OF R
SUM OF
SQUARES
REMAINING
DEGREES OF
FREEDOM
r Z LEVELS OF
SUM OF
SQUARES
R
r -1
1
SRI
r 2-1
SR2
C
c-l
SCl
c-l
SC2
RxC
(rl-l) (c-l)
SRCI
(rZ-l)(c-l)
SRC2
R
3
The sums of squares can all be expressed in terms of the arithmetic
means
X
ij
level of
of observations for the i-th level of
C
(is l,2, ••• ,r;
S
RCI
j=1,2, ••• ,c).
r
l
~
= i=l
I..
R combined with the j-th
In particular,
c
\ (X
l.
j=l
X
ij -
i
+
_ x(.l)
J
x(l) ••• )2
where
Xi "" c -1
-(t)
X
=c
-1
r
~ -
-(1)
.I.. X ij ; X •
J"=l
•J
c
L
X(t)
j=l
ij
= r1
-1
l
L
X . x(2)= r -1
i=l
ij'
.j
2
r
L X ;
i=r +1 ij
1
The expected values of the interaction mean squares are
·e
Hence
(1)
is an unbiased estimator of
CJ
,2 ,
and
(2)
is an unbiased estimator of
2
CJ •
It is, of course, possible that these estimators may take negative values.
It is interesting to note that the variance of
.e
ratio
(3)
n /n
l 2
and not on the actual values of
var (V')
n
l
V'
and
depends only on the
n •
2
In fact,
4
where MRCt
=
(rt-l)
-1
The variance of
(c-l)
V,
-1
SRCt
(t=1,2)
are the interaction mean squares.
on the other hand, for a fixed value of
is proportional to the actual values of
n
In fact
and
l
nl/nZ'
(4)
Tests of Hypothesis of Zero Interaction
2.2
The ratio
a'
= O.
n l MRcl/{n 2MRc2),
may be used as a test of the hypothesis
If, in addition to the other assumptions, it be assumed that all
the random variables have normal distributions, then
nlMRcl/(n2MRc2)
is
distributed as
(a 2+nla ,2){ a2+n2a ,2)-1 F (r -l)(c-l),(r -l)(c-l) •
2
l
.~
Since
n l > n Z'
large values of the observed ratio are regarded as signi-
ficant of departure from the hypothesis
and
c
t h e power
and so of
a'/a.
0
a' = O.
For given values of
f t he test i s an i ncreas i ng f unct i on
0
f
r , r
l
2
2+n
,2)(
2+n
(a
la
a
2a ,2)-1 ,
Evidently the sensitivity of the test will be increased if
(a) the absolute magnitudes of
~l
are increased (with
and
n /n
l 2
constant)
or
(b)
the ratio of
or by decreasing
to
is increased, either by increasing
n ,
l
n •
2
Note that in case (a) the power cannot exceed
(5)
Pr[F
(rl-l) (c-l), (rz-l) (c-l)
however large are n
.e
l
and
denotes the upper lOOa% point of
> (n
In )F (r -l)(c-l),(r -1)(c-l),1-a ]
l
2
2 1
fixed) or
a'/a.
(Here
5
e
Table 1 gives a few illustrative values of the power of the test, with
a significance level of 5%, and
Table 1:
0,2
·:·2
a
nl
0.5
of
r l =4, r =4
2
r 1=6, r =2
2
10
5
0.1959
0.2239
0.1225
50
25
0.2433
0.2866
0.1468
t'2 ~""
20
""
0.2604
0.3080
0.1555
25
0.6049
0.7180
0.2930
100
0.4761
0.5643
0.3790
0.5912
0.7495
0.3922
5
00
10
5
0.2227
0.2590
0.1361
50
25
0.2512
0.2972
0.1509
t·2·""
20
100
00
0.2604
0.3080
0.1555
5
25
0.5282
0.6729
0.3352
0.5772
0.7336
0.3859
4~""
00
0.5912
0.7495
0.3922
Estimation
We now consider the more general case when the
R can be divided into
r
j
levels of
so ordered that
r
levels
subsets of r ,r ,···,rk levels respectively,
l 2
k
greater than 1, and
L r = r. In the t-th set, containing
t=l t
R, each of the rtc cells contain n t observations, and
the numbers
k
are all different.
We will suppose the subsets
nl>n2>"'>~'
It is now possible to construct
.e
Significance level 0.05)
Frequencies Constant Within Rows - k Subsets
with each
rt
CDS .•
r l =2, r a 6
2
t
3.1
(aaO.05;
in each case.
2
1.0
3.
Power of test
n
1-4.""
.-
c=5, r +r =8
l 2
analogous to those in Table 1.
in that Table, we have
k
separate analyses of variance
With a natural extension of the notation
6
where
MRCt
=
(rt-l)
-1
(c-l)
t-th subset of levels of
-1
SRCt
is the interaction mean square for the
R.
The regression of
on
n
t
0,2.
provides estimates of
account of the conditional variance of
is linear.
Fitting this regression
In the fitting process we should take
n M_
t-~Ct'
given
n •
t
If no assumption
is made about the distribution of the random variables in the model, we do
not have a usable formula for this conditional variance.
If it is supposed
that all cell means have distributions of the same shape then the conditional
variance is proportional to
It seems reasonable
to proceed on this assumption - especially since each mean is the arithmetic
mean of at least two observed values, and so should have a distribution
·e
closer to normality then that of the random variables representing the
original observations.
This assumption would lead to seeking the values of
....2 .... ,2
(0,0
,say)
a
2
and
0,2
minimizing
(6)
If
0'/0
= w,
say, were known we would obtain the explicit solution
a
2
=
If, as will usually be the case,
by an iterative process •
.e
0'/0
is not known,
G may be minimized
7
The following formulae are suggested for estimating the variances of the
.
est1mators
0
f
a2
and
a ,2 •
k
L
r -1
t
k
r -1
(7.1)
var(~2) ~ __2__ [{
c-1
t=l
(7.2)
var (....a ,2).' i ' -2 c-1
n
={
- 2 -1
(nt-n) }
t=l
k
where
t
{l
r -1
t
L
t=l
(These formulae are based on the assumption that the cell means are normally
distributed.
For p1atykurtic (leptokurtic) variation the variances would be
expected to be rather less (greater) than indicated by formulae (7.1) and
.~
(7.2).)
3.2
Test of Zero. Interaction
If
a
,2
= 0,
then the expected values of
n1~C1' n2~C2,···,nk~Ck
are all equal, and the slope of the regression of
Assuming normal variation, the hypothesis that
nt~Ct on
a'
=0
n
t
is zero.
could be tested
by applying the Neyman-Pearson-Bart1ett test of equality of variances to the
statistics
nl~CI, ••• ,nk~Ck.
Apart from the well-known sensitivity of this
test to non-normality, it has another drawback in this context (although this
does not affect its validity).
native hypothesis
(a'/a
~
0)
It neglects the fact that any specific altergives a specific pattern of values for the ratios
( a 2+n a ,2) : ( a 2+nZa ,2) : ••. : (Z
a +nka ,2) .
1
.e
It seems likely, therefore, that a test of the significance of the slope
of the fitted regression line (of
nt~Ct
on
nt )
will be more powerful.
This is, indeed, quite natural, since this slope is, in fact, just
a
,2
•
8
..... 2
.... 2 ~
o i / [var (0 v )]
An approximate test can be effected by comparing
a unit normal scale.
3.3
..... 2
var(oV ).
Formula (7.2) may be used to estimate
Test of Constancy of
with
OV
There may be reason to suspect that the values of
change from one subset of levels of
R to another.
0
and/or
may
There are numerous pos-
sibilities, and we will not attempt to explore them here.
However, a quick
and simple test which will detect some types of heterogeneity in
is provided by testing the regression of
0'
on
0
and/or
for linearity.
Oi
It
is possible to construct such a test (on an approximate basis, and assuming
normal variation) using the fact that the variance of
2(r -I)-1(c-l)-1(02+n o,2).
t
t
~
~(c-l) l
t=l
.4It
n M_
CRCt
If there is no heterogeneity in
is
02
and/or
0,2,
.....2
.... 2 -2
....2
.... 2 2
(rt-l)(0 +ntO V ) (nt~Ct-(J -nto' )
should be approximately distributed as
x2
with (k-2) degrees of freedom.
Large values of the statistic would be regarded as significant of heterogeneity.
4. A Hare General Case
It is not essential that the subsets of levels of
same frequency of observations in each cell.
R
Retaining the assumption that any
one level has the same cell-frequencies for all levels of
nth
let us denote by
R( h=1,2, ••• ,r ).
t
We then use, in place of
rt
l\tCt
.e
C,
the number of observations per cell for the h-th level of the t-th subset
of levels of
where
Should each have the
X(t)
hj
L
= h=l
}1tCt'
the statistic
c
nth
I
j=l
(x(t)_x(t)_x(t)+x(t»2/{(r -1) (c-l)}
hj
h·
.j
.•
t
is the arithmetic mean of observations for the h-th level of the
t-th subset of levels of
R, combined with the j-th level of
C,
and
9
r
c
t
-(t)
-1
-(t)
Xho = c -1 I -(t)
Xh.; x. = Nt I nthX-(t)
hj ;
j=l J
J
h=l
r
with
N =
t
(t)
X
r
1 t
c (t)
=(cN)- I n
Ix.
t
h=l thj=l hJ
t
L
h=l
nth •
The expected value of
MiCt
is
C1
2
+ n't
C1
,2
with
(8)
Similar methods to those described in Section 3 can be used, with
replaced by
theory,
.e
MiCt
n .
t
n
t
It must, however, be realized that even assuming normal
is no longer distributed as a multiple of
a