* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Johnson, N.L.; (1972)A method of separation of residual and interaction effects in cross-classification."
Survey
Document related concepts
Transcript
This work was supported in part by the U.S. Army Research Office Durham under Contract No. DAHC04-71-C-0042. -e A METHOD FOR SEPARATION OF RESIDUAL AND INTERACTION EFFECTS IN CROSS-CLASSIFICATIONS N. L. JOHNSON Institute of Statistics Mimeo Series No. 842 Department of Statistics University of North Carolina at ChapeZ HilZ September, 1972 -e A Method for Separation of Residual and Interaction Effects in Cross-Classifications N. L. JOHNSON University of North CaroUna at ChapeZ HiU 1. Description of the Problem Consider a two-way cross-classification with factors ("columns") at r, c R ("rows") and C levels respectively, and suppose that the expected value of an observation at the i-th level of R and the j-th level of C is represented in the form .~ The last term represents the interaction effect between denoted in the abbreviated form Rand C, usually RXC. Variability about this expected value is represented by addition of a random variable with zero expected value. It is usually assumed that these random variables are mutually independent and have a common variance, If there is only one observation for each of the levels of Rand C rc (J 2 say. combinations of (as, for example, in a randomized block experiment) it is not possible to estimate (J 2 separately from the interaction effects (pK) ij • It would appear that this will also be the case if we have more than one individual in some cells, but observe only the total (or the arithmetic mean) of the values in each cell. .e There is then only one observation in each cell and the situation appears to be similar to that already described • 2 However, in certain cases of this kind, it is possible to distinguish between residual and interaction effects. We will consider here cases characterized by the following conditions: (i) the interaction effects are represented by random variables, mutually independent and also independent of the random variables representing residual variation, (ii) each of the interaction random variables has expected value zero and the same variance, (iii) 0,2, say, and The number of individuals is not the same in all cells. 2. Frequencies Constant Within Rows: Two Subsets 2.1 Estimation In this paper, we will consider only cases in which the number of individuals per cell is the same for a given level of ·e be the value of (a) the c (b) for > r l levels of C, R there are n individuals for each of l and for the other c whatever We first suppose that (~2) levels of for each of the nl R. C, levels of levels of C. R there are individuals Without loss of generality we suppose n2 · If the two sections of the data are analysed separately, we obtain analyses of variance of the form: FIRST SOURCE r l DEGREES OF FREEDOM LEVELS OF R SUM OF SQUARES REMAINING DEGREES OF FREEDOM r Z LEVELS OF SUM OF SQUARES R r -1 1 SRI r 2-1 SR2 C c-l SCl c-l SC2 RxC (rl-l) (c-l) SRCI (rZ-l)(c-l) SRC2 R 3 The sums of squares can all be expressed in terms of the arithmetic means X ij level of of observations for the i-th level of C (is l,2, ••• ,r; S RCI j=1,2, ••• ,c). r l ~ = i=l I.. R combined with the j-th In particular, c \ (X l. j=l X ij - i + _ x(.l) J x(l) ••• )2 where Xi "" c -1 -(t) X =c -1 r ~ - -(1) .I.. X ij ; X • J"=l •J c L X(t) j=l ij = r1 -1 l L X . x(2)= r -1 i=l ij' .j 2 r L X ; i=r +1 ij 1 The expected values of the interaction mean squares are ·e Hence (1) is an unbiased estimator of CJ ,2 , and (2) is an unbiased estimator of 2 CJ • It is, of course, possible that these estimators may take negative values. It is interesting to note that the variance of .e ratio (3) n /n l 2 and not on the actual values of var (V') n l V' and depends only on the n • 2 In fact, 4 where MRCt = (rt-l) -1 The variance of (c-l) V, -1 SRCt (t=1,2) are the interaction mean squares. on the other hand, for a fixed value of is proportional to the actual values of n In fact and l nl/nZ' (4) Tests of Hypothesis of Zero Interaction 2.2 The ratio a' = O. n l MRcl/{n 2MRc2), may be used as a test of the hypothesis If, in addition to the other assumptions, it be assumed that all the random variables have normal distributions, then nlMRcl/(n2MRc2) is distributed as (a 2+nla ,2){ a2+n2a ,2)-1 F (r -l)(c-l),(r -l)(c-l) • 2 l .~ Since n l > n Z' large values of the observed ratio are regarded as signi- ficant of departure from the hypothesis and c t h e power and so of a'/a. 0 a' = O. For given values of f t he test i s an i ncreas i ng f unct i on 0 f r , r l 2 2+n ,2)( 2+n (a la a 2a ,2)-1 , Evidently the sensitivity of the test will be increased if (a) the absolute magnitudes of ~l are increased (with and n /n l 2 constant) or (b) the ratio of or by decreasing to is increased, either by increasing n , l n • 2 Note that in case (a) the power cannot exceed (5) Pr[F (rl-l) (c-l), (rz-l) (c-l) however large are n .e l and denotes the upper lOOa% point of > (n In )F (r -l)(c-l),(r -1)(c-l),1-a ] l 2 2 1 fixed) or a'/a. (Here 5 e Table 1 gives a few illustrative values of the power of the test, with a significance level of 5%, and Table 1: 0,2 ·:·2 a nl 0.5 of r l =4, r =4 2 r 1=6, r =2 2 10 5 0.1959 0.2239 0.1225 50 25 0.2433 0.2866 0.1468 t'2 ~"" 20 "" 0.2604 0.3080 0.1555 25 0.6049 0.7180 0.2930 100 0.4761 0.5643 0.3790 0.5912 0.7495 0.3922 5 00 10 5 0.2227 0.2590 0.1361 50 25 0.2512 0.2972 0.1509 t·2·"" 20 100 00 0.2604 0.3080 0.1555 5 25 0.5282 0.6729 0.3352 0.5772 0.7336 0.3859 4~"" 00 0.5912 0.7495 0.3922 Estimation We now consider the more general case when the R can be divided into r j levels of so ordered that r levels subsets of r ,r ,···,rk levels respectively, l 2 k greater than 1, and L r = r. In the t-th set, containing t=l t R, each of the rtc cells contain n t observations, and the numbers k are all different. We will suppose the subsets nl>n2>"'>~' It is now possible to construct .e Significance level 0.05) Frequencies Constant Within Rows - k Subsets with each rt CDS .• r l =2, r a 6 2 t 3.1 (aaO.05; in each case. 2 1.0 3. Power of test n 1-4."" .- c=5, r +r =8 l 2 analogous to those in Table 1. in that Table, we have k separate analyses of variance With a natural extension of the notation 6 where MRCt = (rt-l) -1 (c-l) t-th subset of levels of -1 SRCt is the interaction mean square for the R. The regression of on n t 0,2. provides estimates of account of the conditional variance of is linear. Fitting this regression In the fitting process we should take n M_ t-~Ct' given n • t If no assumption is made about the distribution of the random variables in the model, we do not have a usable formula for this conditional variance. If it is supposed that all cell means have distributions of the same shape then the conditional variance is proportional to It seems reasonable to proceed on this assumption - especially since each mean is the arithmetic mean of at least two observed values, and so should have a distribution ·e closer to normality then that of the random variables representing the original observations. This assumption would lead to seeking the values of ....2 .... ,2 (0,0 ,say) a 2 and 0,2 minimizing (6) If 0'/0 = w, say, were known we would obtain the explicit solution a 2 = If, as will usually be the case, by an iterative process • .e 0'/0 is not known, G may be minimized 7 The following formulae are suggested for estimating the variances of the . est1mators 0 f a2 and a ,2 • k L r -1 t k r -1 (7.1) var(~2) ~ __2__ [{ c-1 t=l (7.2) var (....a ,2).' i ' -2 c-1 n ={ - 2 -1 (nt-n) } t=l k where t {l r -1 t L t=l (These formulae are based on the assumption that the cell means are normally distributed. For p1atykurtic (leptokurtic) variation the variances would be expected to be rather less (greater) than indicated by formulae (7.1) and .~ (7.2).) 3.2 Test of Zero. Interaction If a ,2 = 0, then the expected values of n1~C1' n2~C2,···,nk~Ck are all equal, and the slope of the regression of Assuming normal variation, the hypothesis that nt~Ct on a' =0 n t is zero. could be tested by applying the Neyman-Pearson-Bart1ett test of equality of variances to the statistics nl~CI, ••• ,nk~Ck. Apart from the well-known sensitivity of this test to non-normality, it has another drawback in this context (although this does not affect its validity). native hypothesis (a'/a ~ 0) It neglects the fact that any specific altergives a specific pattern of values for the ratios ( a 2+n a ,2) : ( a 2+nZa ,2) : ••. : (Z a +nka ,2) . 1 .e It seems likely, therefore, that a test of the significance of the slope of the fitted regression line (of nt~Ct on nt ) will be more powerful. This is, indeed, quite natural, since this slope is, in fact, just a ,2 • 8 ..... 2 .... 2 ~ o i / [var (0 v )] An approximate test can be effected by comparing a unit normal scale. 3.3 ..... 2 var(oV ). Formula (7.2) may be used to estimate Test of Constancy of with OV There may be reason to suspect that the values of change from one subset of levels of R to another. 0 and/or may There are numerous pos- sibilities, and we will not attempt to explore them here. However, a quick and simple test which will detect some types of heterogeneity in is provided by testing the regression of 0' on 0 and/or for linearity. Oi It is possible to construct such a test (on an approximate basis, and assuming normal variation) using the fact that the variance of 2(r -I)-1(c-l)-1(02+n o,2). t t ~ ~(c-l) l t=l .4It n M_ CRCt If there is no heterogeneity in is 02 and/or 0,2, .....2 .... 2 -2 ....2 .... 2 2 (rt-l)(0 +ntO V ) (nt~Ct-(J -nto' ) should be approximately distributed as x2 with (k-2) degrees of freedom. Large values of the statistic would be regarded as significant of heterogeneity. 4. A Hare General Case It is not essential that the subsets of levels of same frequency of observations in each cell. R Retaining the assumption that any one level has the same cell-frequencies for all levels of nth let us denote by R( h=1,2, ••• ,r ). t We then use, in place of rt l\tCt .e C, the number of observations per cell for the h-th level of the t-th subset of levels of where Should each have the X(t) hj L = h=l }1tCt' the statistic c nth I j=l (x(t)_x(t)_x(t)+x(t»2/{(r -1) (c-l)} hj h· .j .• t is the arithmetic mean of observations for the h-th level of the t-th subset of levels of R, combined with the j-th level of C, and 9 r c t -(t) -1 -(t) Xho = c -1 I -(t) Xh.; x. = Nt I nthX-(t) hj ; j=l J J h=l r with N = t (t) X r 1 t c (t) =(cN)- I n Ix. t h=l thj=l hJ t L h=l nth • The expected value of MiCt is C1 2 + n't C1 ,2 with (8) Similar methods to those described in Section 3 can be used, with replaced by theory, .e MiCt n . t n t It must, however, be realized that even assuming normal is no longer distributed as a multiple of a