Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
A SAS PROGRAM FOR CHOOSING RANDOM UNBLOCKED DESIGNS BALANCED FOR CORRELATION BETWEEN NEIGHBORS 1.J. Good, Virginia Polytechnic Institute and State University E.P. Smith, Virginia Polytechnic Institute and State University David D. Morris, Abbott Laboratories . ABSTRACT with treatment tIl as plot number q + 1. Thedeslgn is then balanced for pairs, but there would be one In the past it must have been at least unusual extra treatment t 1 " The purpose of the balancing is for a theorem about linear graphs to be used for a only to remove some of _the effect of correlation problem in experimental design. We here make such between adjacent pairs, and the analysis of a an application for selecting a random balanced completed experiment is to be carried out without unblocked serial design or BUS design. The reference to the ordering of the q + 1 plots. algorithm makes essential use of a computer program written as a SAS macro. The computer revealed the Any rotation of a design will be regarded as need to correct the original theory; the correction the same design. The case n=2 is trivial becaJlse in used some elementary theory of numbers. that case all designs are of the form (t. t,t. I, ... t. 1,), and we shall always assume that n > 2. A small In one example we obtained enough random fraction of designs are periodic, for example, designs to be convinced that we had the complete (t.t,t.t,t,t,t.t,t.t,t,t,), where the "period" is 6 instead set; a somewhat unusual application of the Monte of 12; when c=2 and n=3 the fraction of periodic Carlo method. desi~ns is 1/14, while when c=4 and n=3 the fraction is 3/3296. Such designs seem too systematic, so we INTRODUCTION suggest that when the program produces a periodic design it should be ignored. This creates no problem Correlation between experimental units can because many designs can be produced in a single lead to problems in the analysis of data in run of the program. agricultural, psychological and medical studies (Milliken and Johnson, 1984, p. 363). To alleviate The program selects random BUS designs by the problem in agricultural studies, Williams (1952) making use of a theorem for enumerating unicursal suggested the use of a 'serial' design (the treatment routes in an oriented (directed) linear graph (or being in a one-dimensional series) with blocking. A nodes-and-edges diagram with arrows on each edge). similar design involves the treatments in series but The theorem was applied by Dawson & Good (1957) without blocking. This type of design is of potential to a different statistical problem. They called it the importance in studies with correlation but, in BEST theorem in honor of de Bruijn, Aardennecommon with the Williams design, it is difficult to Ehrenfest, Smith, and Tutte. It follows from the generate a random design. In this paper we describe BEST theorem, combined with some further a SAS program that overcomes this difficulty for the combinatorics and number theory, that the number unblocked designs. An unusual and possibly unique of possible nonperiodic designs is Dc,} (n) where feature of our program is that it makes use of a theorem about the enumeration of linear graphs to solve a problem of experimental design. When the (1 ) program was run it showed that it was necessary to correct the theory in Good (1987), which gave the in which algorithm on which the program was based. The d n nn-' [(dn-d-l)!]n correction, which required a tricky combinatorial (2) Rd (n) = .::......:=--:-:(dC"":!)"'n(::=n"',)C-=-C"-argument, involving some elementary theory of numbers, is given by Good (1990). The algorithm requires no correction, but its interpretation does. dlc means that d is a divisor (factor) of c (d=l and d=c counting as divisors) and JL is the Mobius function (of integers) defined by Given n distinct treatments til ~J " ' J tn! a balanced unblocked serial (BUS) design (Good, 1987, 1990, parts of which we closely follow) is a circular sequence of cn{n-I) treatments Tv T21 '.'1 T9, where q = cn(n-1) and c is a positive integer, such that (i) each T; is one of the treatments Jl (1) = 1, I' (s) = 0 if s has a squared divisor, p. (p, p, ... p,) = (-1)' if all the primes p" p" ... p, are distinct. t., 1" ..., (See, for example, Hardy and Wright, 1938, p. 234.) Some values of Dc} (n) are shown in Table 1 and these values can easily run into the billions or more. The question arises, how can we select a random design when Dc, t (n) is large? and (ii) each pair of distinct treatments (A, B) occurs exactly c times in each order AB and BA (as in Finney & Outhwaite, 1956). There are no occurrences of AA for any A. An example of a BUS design in which c = 2 and n = 3 is (t,t,t.t,t.t,t,t,t.t,t,t,) in which the first tl is regarded as following the last tz for CIrcularization. In practice the q plots would usually be strung out on a straight line with one extra plot, 1363 Table 1. Some values of Dc•• (n). will not be selected (to select it would be a "dearly losing move"), 3 4 1 2 5 1 0 o 0 o 3 39 999 32,136 1,200,225 256 6,479,872 5.245(E11) 972,000 The number of periodic designs, having period (meaning shortest period) q/d, is denoted by Dc d (n) and of course is equal to Dcld 1 (n), so these numbers can be obtained at once from Table 1. For the case c=2, n=3, there are 39 nonperiodic designs and 3 periodic ones, and all 42 designs are shown in Table 2, the three periodic designs being marked with an asterisk. Each nonperiodic design has probability 2/81 of being produced by the program and each periodic design a probability 1/81. We ran the program to produce 810 designs (42 distinct ones) of which 784 were nonperiodic (780 being expected) and 26 had period 6. No other period is possible because a period shorter than 6 would imply c > 2. The frequencies of the designs in the sample are also shown in Table 2. A Pearson chi-squared test for equiprobabilily of the nonperiodic designs gave 42.06 with 38 degrees of freedom. (A nearly equivalent test would use the theoretical expectations of 20 for each design.) The sample frequencies of the three periodic designs were 5, 13, and 8, close enough to the theoretical expectations of 10, 10, and 10 (or to 8.67 if conditioned on the total of 26). In Bayesian terminology the sample dearly supports the "null hypothesis" 1 and in non-Bayesian termmology the null hypothesis is not rejected by any reasonable criterion. n\c 2 3 4 5 , The program selects a random BUS design, all nonperiodic designs having equal probability of being selected, in the following manner. Each nonperiodic design can be shown to have d times the probability of each periodic design whose period is qj d (and therefore consists of d abutting congruent segments). We can start by selecting one of the n treatments with probability lIn and the next one with probability l/(n-I), though these probabilities are unimportant because a circular sequence can be started anywhere. The program then becomes recursive. Suppose we have selected the first r treatments At A2 ... Ar and are about to select the next one A r + t , like a chess player contemplating making his next move. We want the condlbona:r probability that Ar + 1 = B, say, to be proportional to the number N(B) of different ways in which the sequence At A2 ... Ar B can be completed to become a "BEST circuit" of cn(n-l) terms, multiplied by the number of edges from Ar to B. (This product will usually be much greater than the number of ways of completing a design.) At each stage all possible vertices B mu~taken into consideration (like a 'brute force' chess program). Thus the SAS program confirmed the revised theory as well as its own accuracy 1 and it also produced the list of possible designs for c = 2, n = 3. (The general technique here is of interest: to produce designs at random until one is convinced that all distinct ones have been obtained. This is an example of the Monte Carlo method and of the coupon collector's problem. For the latter see, for example, the index of Feller, 1950/68.) For the case c=2, n=3 it is therefore no longer necessary to run the program since a random nonperiodic design can be obtained from Table 2 by the simplest use of random numbers. When n>2 and c+n>5, the number of designs is too large to be conveniently listed on paper, except perhaps when c=n=3, but the program produces random designs in any case, equally probable apart from the periodic designs. Consider the original oriented (directed) linear graph G whose n by n incidence matrix consists of zeros on the main diagonal and c everywhere else. Adjust G by removing one (oriented) edge AIA2' one edge A 2A 3 , .,., one edge Ar -1 A r, because these have already been used, and also the edge ArB whose "more permanent" removal we are contemplating. We wish to select B ("make the move under consideration") with a probability proportional to the number N(B) of routes from B to AI' Thus we now want to enumerate the routes that will use up all the remaining edges, while beginning at B and ending at At. Introduce an artificial node 0 into the graph together with one edge At 0 and one edge OB. Call the graph as so produced G(B). Then N(B) is equal to the number of "oriented unicursal" routes through G(B). By an "oriented unicursal" route through an oriented graph we mean one that uses up all the edges in accordance with their multiplicities and directions. Such routes exist if and only if the graph hi connected and "appropriate" in the sense of Appendix A. In fact it is appropriate because of the introduction of the artificial node O. The number of such routes can be obtained by the BEST theorem, the details being given, for example, by Dawson and Good (1957, p. 947) and axe repeated in our Appendix A. The program requires the computation of numerous determinants and the computations have to be done by machine because the determinants seldom have a nice pattern. Of course N(B) = 0 if G(B) is disconnected and in this case B REFERENCES Ball, W.W. Rouse & Coxeter, H.M.S. (1940). Mathematical Recreations and Essays. London: Macmillan. Dawson, Reed & Good, I.J. (1957). Exact Maxkov probabilities from oriented linear graphs, Annals of Mathematical Statistics 28, 946-956. Feller, W. (1950/68). An Introduction to Probability Theory and its Applications. New York: Wiley. Finney, D.J. & Outhwaite, Anne D. (1956). Serially balanced sequences in bioassay. Proc. Roy. Soc. B 145, 493·507. Good, I.J. (1987). Serial unblocked designs balanced for correlation between neighbours. Communica.tions in Statistics: Theory and Methods 16, 1153-1159. 1364 Good, I.J. (1990). Serial unblocked designs balanced for correlation between neighbors, II. Communications in Statistics: Theory and Methods. In press. APPENDIX A. THE BEST THEOREM. We here closely follow the wording of Dawson & Good (1957, Section 2). Suppose we have an oriented linear graph H containing tL vertices (nodes) and such that the number of oriented edges from vertex i to vertex j is mij' A u x u matrix M = (mj -) is called an incidence matrix of H if the rows 'tave the same ordering as the columns. This matrix is unique to within the same rearrangement of rows as of columns. The graph is appropriate if, for each vertex the number of edges lea:<ting In is equal to the number leading out. In terms of M this means that each row has the same sum as' the corresponding column. A (complete) circuit in such a graph is defined as a "unicursal" path passing exadly once through each edge in the right direction. Unicursal paths were first studied by Euler in 1736 for nODoriented graphs (Rouse Ball & Coxeter, 1960, pp. Hardy, G.H. and Wright, E.M. (1938). Introduction to the Theory oj Numbers. Oxford: Press. Clarendon Milliken, G.A. and Johnson, Dallas E. (1984). Analysis of Messy Data, Vol. 1, Designed Experiments. Belmont, California. Lifetime Learning Publications. Williams, R.M. (1952). Experimental designs for serially correlated observations. Biometrika 39, 151·167. Table 2. THE POSSIBLE DESIGNS WHEN c=2,n=3. 242-254). The periodic designs axe marked with an asterisk; they are numbers 26, 38, and 42. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 DESIGN 121213132323 121213231323 121213232313 121231313232 121231321323 121231323132 121231323213 121232131323 121232132313 121232313132 121232313213 121232321313 121312132323 121312313232 121312321323 121312323132 121312323213 121313123232 121313212323 121313231232 121313232123 121321231323 121321232313 121321312323 121321323123 121323121323* 121323123132 121323123213 121323131232 121323132123 121323212313 121323213123 123123132132 123123213132 123123213213 123131232132 123131321232 123132123132* 123132123213 123132131232 123212321313 123213123213* The BEST theorem gives the number of distinct (complete) circuits for an appropriate graph H when all edges are regarded as distinguishable. Let FREQUENCY 23 26 20 23 18 16 26 16 17 15 18 16 20 17 20 17 30 19 18 23 21 15 29 30 17 il . = ~mij = ~mji J J which is the sum of the ith row (and of the ith column). Let M = (m',j) be the u' x u' matrix formed from M by deleting every row and column consisting entirely of zeros (that is, eliminating isolated vertices). Then Ljm';j = Ljm'j; = m/j", say, where m'j > O(i = 1, 2, "'J U'). Let m*j' = - m'i· when i -#-j, let m*i.i = m'j - m'ii, and let M* = (m*ij) which is also (m' i} 6ij) .- M', Then M* ~s a square matrix with each row and column SUffiffilllg to zero from which it can be shown that all its cofactors are equalj let II M* H denote the common value of these cofadors. Then the BEST theorem asserts that the number of circuits in Gis .' (m'i-')! IIM'II ·n 1=1 In our application it can be deduced that the number of "BEST" circuits, at the beginning of the algorithm, is Nc (n) where No (n) = c"-' n"-2 [(cn-c-l)!]" (3) as mentioned by Good (1987). Formula (1) for Do ,(n) is deducible from (3) by means of a 5 cot'nbinatorial argument (Good, 1990). 17 18 16 23 11 25 21 24 20 14 16 13 25 16 28 8 1365 APPENDIX B. THE PROGRAM The program to compute random serial unblocked designs is written as a SAS macro. The macro uses the IML procedure to do the necessary computations. There are fout parameters in the call to the macro; NTRT is the number of treatments; NREP is the number of replications of pairs of treatments; NDES is the number of designs to be generated and L2=NREP x NTRT x (NTRT - 1). The designs in Table 2, for case <:=2, n::::;3, werf'! generated by the program. TITLE 'SERIAL UNBLOCKED DESIGN GENERATOR'; %MACRO SERIAl(NTRT,NREP,NDES,L2); %lET LEN = &NREP*&NTRT*(&NTRT-1); /* /* /* /* /* <== <== <== <== NTRT IS THE NUMBER OF TREATMENTS*/ NREP IS THE NUMBER OF REP PAIRS */ NOES IS THE NUMBER OF DESIGNS */ L2 IS NTRT*NREP*(NTRT-l) */ */ LENGTH OF DESIGN PROC IMl; = &NTRT; = &NREP; MAX = C*(N-1) N /* C /* + 1 ; FACTORS = J(MAX,1) /* <== <== SET NUMBER OF TREATMENTS SET NUMBER OF PAIR REPS */ */ ZERO FACTORIAL */ START FAC; = DO I 2 TO MAX FACTORS[I,l = I*FACTORS[I-1,1; END; FINISH; RUN FAC; L = &LEN ; /* COMPUTE NECESSARY FACTORIALS */ /* LENGTH OF DEStGN */ DESGN = J(&NDES,l,O)j START GENER; DO X = 1 TO &NDES: INITIAL = C # J(N,N,l) - C # I(N); IDE NT = I(N+1>: NULL = J(l,N+l,O); COLl = J(N,',O); NULLVEC = J(L,l,D); NBASIC = J(N+l,l,O); RANDOM = UNIFORM(NULLVEC); OESGNIX,l, = CEIL(N # RANOOMll,]); COL1[OESGNIX,ll,] = 1; BASIC = NULL // ( COLl II INITIAL) PLACE = J(N+l,',O) ; ONE = J(N+l,1,1) ; ACCUM = J(N+l,N+l,O-); ACCUMI,l] = ONE; DO ~ = 2 TO (N+l); PLACEI~,] = ~-1; ONE[Y-l,l = 0; ACCUM[,U] = ONE; END; /* INCIDENCE MATRIX */ TOP ROY OF COUNTING MATRIX FIRST COL OF COUNTING MATRIX */ /* /* STORAGE FOR DESIGN COUNTS */ /* */ /* FIRST TRT OF DESIGN */ /* COUNTING MATRIX GIVES TREATMENT NUMBER */ /* /. ACCUMULATED SUM MATRIX ./ /* BUILD A MATRIX YHOSE LOYER TRIANGLE IS ALL ONE IS USED TO GET CUML SUMS */ ROU OF COUNTING MATRIX */ */ /* /* /* * OBTAIN REMAINING TREATMENTS IN RANDOM ORDER ; DO Z = 2 TO L ; LAST DESGN[X,Z-11 + l' /* NPATH = BASICILAST,]; /* = 1366 # PATHS AT EACH NODE */ */ */ */ * COUNT NUMBER OF POSSIBLE DESIGNS AT EACH NODE DO Y = 2 TO (N+1); HeY = LAST)THEN NBASIC[Y,J IF(Y -= LAST)THEN DO; 0; j* TBASIC = BASIC ; TBASIC[1.YJ = 1; TBASIC[lAST,Y] = TBASIC[LAST,YJ -1; CSUM = TBASIC[+,l ; LOCATE LOC(CSUM > 0); REDUCE IDENl[,lOCATE] ; BPRIME = REDUCE' * TBASIC * REDUCE CPRIME = CSUM * REDUCE NPRIME = NCOL(BPRIME); BSTAR = DIAG(CPRIME) - BPRIMEi CFACTOR = FACTORS[CPRIME+1,]; NB = ABS(DET(BSTAR[2:NPRIHE.2:NPRIHEJ» NBASIC[Y,J = NB # NPATH[,Yl; END; j* '* '* '* TRANSFORM AS PER BEST THRM COPY OF COUNTING MATRIX PUT 1 IN TOP ROW j* COLUMN TOTALS LOCATION OF NON-ZERO COLS j* GET SUBSET OF IDENTITY ELIMINATE NUll ROUS, COLS ELIMINATE ZERO COL TOTALS # NUMBER OF NON-ZERO COlS' 1* COUNT THE DESIGNS /* GET NEEDED FACTORIALS # CFACTOR [#,]; /* /* END; * CHOOSE NEXT STEP AT RANDOM BASED ON ABOVE COUNTS ; NSUM = (ACCUM * NBASIC) / NBASIC[+,]; POSIT = (RANDOM[Z,] <= NSUM) # (NBASIC > 0) # PLAce LOCATE = LOC(POSIT > 0). REDUCE = IDENT[,LOCATE]; NPOSIT = REDUCE' * POSIT; NEW = NPOSIT[><,]; BASIC{LAST,(NEY + 1)] = BASIC[LAST,(NEU+l)] - 1; DESGN{X,Z] = NEW; END; END: FINISH; RUN GENER; * CREATE DATA SET CONTAINING DESIGN AND OUTPUT; CREATE RESULT FROM DESGN ; APPEND FROM DESGN: * FORMAT OUTPUT IN DATA SET AND PRINT OUT DESIGN; DATA _NULl_i SET RESULT; ARRAY VV COL1-COL&L2; FILE PRINT; I = 0: 00 OVER Wi I =I + 1; PUT Ql I VV END; = I + 1; PUT @I I ' ; ".MEND SERIAL; @; %SERIAL(3.2.5D.12) 1367 *' *'*' *' *' *'*' *'*' *' *'