Download A SAS Program for Choosing Random Unblocked Designs Balanced for Correlation between Neighbors

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Matrix multiplication wikipedia , lookup

Cayley–Hamilton theorem wikipedia , lookup

Gaussian elimination wikipedia , lookup

Transcript
A SAS PROGRAM FOR CHOOSING RANDOM UNBLOCKED DESIGNS
BALANCED FOR CORRELATION BETWEEN NEIGHBORS
1.J. Good, Virginia Polytechnic Institute and State University
E.P. Smith, Virginia Polytechnic Institute and State University
David D. Morris, Abbott Laboratories
.
ABSTRACT
with treatment tIl as plot number q + 1. Thedeslgn
is then balanced for pairs, but there would be one
In the past it must have been at least unusual
extra treatment t 1 " The purpose of the balancing is
for a theorem about linear graphs to be used for a
only to remove some of _the effect of correlation
problem in experimental design. We here make such
between adjacent pairs, and the analysis of a
an application for selecting a random balanced
completed experiment is to be carried out without
unblocked serial design or BUS design.
The
reference to the ordering of the q + 1 plots.
algorithm makes essential use of a computer program
written as a SAS macro. The computer revealed the
Any rotation of a design will be regarded as
need to correct the original theory; the correction
the same design. The case n=2 is trivial becaJlse in
used some elementary theory of numbers.
that case all designs are of the form (t. t,t. I, ... t. 1,),
and we shall always assume that n > 2. A small
In one example we obtained enough random
fraction of designs are periodic, for example,
designs to be convinced that we had the complete
(t.t,t.t,t,t,t.t,t.t,t,t,), where the "period" is 6 instead
set; a somewhat unusual application of the Monte
of 12; when c=2 and n=3 the fraction of periodic
Carlo method.
desi~ns is 1/14, while when c=4 and n=3 the fraction
is 3/3296. Such designs seem too systematic, so we
INTRODUCTION
suggest that when the program produces a periodic
design it should be ignored. This creates no problem
Correlation between experimental units can
because many designs can be produced in a single
lead to problems in the analysis of data in
run of the program.
agricultural, psychological and medical studies
(Milliken and Johnson, 1984, p. 363). To alleviate
The program selects random BUS designs by
the problem in agricultural studies, Williams (1952)
making use of a theorem for enumerating unicursal
suggested the use of a 'serial' design (the treatment
routes in an oriented (directed) linear graph (or
being in a one-dimensional series) with blocking. A
nodes-and-edges diagram with arrows on each edge).
similar design involves the treatments in series but
The theorem was applied by Dawson & Good (1957)
without blocking. This type of design is of potential
to a different statistical problem. They called it the
importance in studies with correlation but, in
BEST theorem in honor of de Bruijn, Aardennecommon with the Williams design, it is difficult to
Ehrenfest, Smith, and Tutte. It follows from the
generate a random design. In this paper we describe
BEST theorem, combined with some further
a SAS program that overcomes this difficulty for the
combinatorics and number theory, that the number
unblocked designs. An unusual and possibly unique
of possible nonperiodic designs is Dc,} (n) where
feature of our program is that it makes use of a
theorem about the enumeration of linear graphs to
solve a problem of experimental design. When the
(1 )
program was run it showed that it was necessary to
correct the theory in Good (1987), which gave the
in which
algorithm on which the program was based. The
d n nn-' [(dn-d-l)!]n
correction, which required a tricky combinatorial
(2)
Rd
(n)
= .::......:=--:-:(dC"":!)"'n(::=n"',)C-=-C"-argument, involving some elementary theory of
numbers, is given by Good (1990). The algorithm
requires no correction, but its interpretation does.
dlc means that d is a divisor (factor) of c (d=l and
d=c counting as divisors) and JL is the Mobius
function (of integers) defined by
Given n distinct treatments til ~J " ' J tn! a
balanced unblocked serial (BUS) design (Good, 1987,
1990, parts of which we closely follow) is a circular
sequence of cn{n-I) treatments Tv T21 '.'1 T9, where
q = cn(n-1) and c is a positive integer, such that
(i) each T; is one of the treatments
Jl (1) = 1, I' (s) = 0 if s has a squared divisor,
p. (p, p, ... p,) = (-1)' if all the primes p" p" ... p,
are distinct.
t., 1" ...,
(See, for example, Hardy and Wright, 1938, p. 234.)
Some values of Dc} (n) are shown in Table 1 and
these values can easily run into the billions or more.
The question arises, how can we select a random
design when Dc, t (n) is large?
and
(ii)
each pair of distinct treatments (A, B)
occurs exactly c times in each order AB and BA (as
in Finney & Outhwaite, 1956).
There are no
occurrences of AA for any A.
An example of a BUS design in which c = 2
and n = 3 is (t,t,t.t,t.t,t,t,t.t,t,t,) in which the first
tl is regarded as following the last tz for
CIrcularization. In practice the q plots would usually
be strung out on a straight line with one extra plot,
1363
Table 1. Some values of Dc•• (n).
will not be selected (to select it would be a "dearly
losing move"),
3
4
1
2
5
1
0
o
0
o
3
39
999
32,136 1,200,225
256
6,479,872 5.245(E11)
972,000
The number of periodic designs, having period
(meaning shortest period) q/d, is denoted by Dc d (n)
and of course is equal to Dcld 1 (n), so these numbers
can be obtained at once from Table 1.
For the case c=2, n=3, there are 39
nonperiodic designs and 3 periodic ones, and all 42
designs are shown in Table 2, the three periodic
designs being marked with an asterisk.
Each
nonperiodic design has probability 2/81 of being
produced by the program and each periodic design a
probability 1/81. We ran the program to produce
810 designs (42 distinct ones) of which 784 were
nonperiodic (780 being expected) and 26 had period
6. No other period is possible because a period
shorter than 6 would imply c > 2. The frequencies
of the designs in the sample are also shown in Table
2. A Pearson chi-squared test for equiprobabilily of
the nonperiodic designs gave 42.06 with 38 degrees of
freedom. (A nearly equivalent test would use the
theoretical expectations of 20 for each design.) The
sample frequencies of the three periodic designs were
5, 13, and 8, close enough to the theoretical
expectations of 10, 10, and 10 (or to 8.67 if
conditioned on the total of 26).
In Bayesian
terminology the sample dearly supports the "null
hypothesis" 1 and in non-Bayesian termmology the
null hypothesis is not rejected by any reasonable
criterion.
n\c
2
3
4
5
,
The program selects a random BUS design, all
nonperiodic designs having equal probability of being
selected, in the following manner. Each nonperiodic
design can be shown to have d times the probability
of each periodic design whose period is qj d (and
therefore consists of d abutting congruent segments).
We can start by selecting one of the n
treatments with probability lIn and the next one
with probability l/(n-I), though these probabilities
are unimportant because a circular sequence can be
started anywhere.
The program then becomes
recursive.
Suppose we have selected the first r
treatments At A2 ... Ar and are about to select the
next one A r + t , like a chess player contemplating
making his next move. We want the condlbona:r
probability that Ar + 1 = B, say, to be proportional to
the number N(B) of different ways in which the
sequence At A2 ... Ar B can be completed to become
a "BEST circuit" of cn(n-l) terms, multiplied by the
number of edges from Ar to B. (This product will
usually be much greater than the number of ways of
completing a design.) At each stage all possible
vertices B mu~taken into consideration (like a
'brute force' chess program).
Thus the SAS program confirmed the revised
theory as well as its own accuracy 1 and it also
produced the list of possible designs for c = 2, n = 3.
(The general technique here is of interest:
to
produce designs at random until one is convinced
that all distinct ones have been obtained. This is an
example of the Monte Carlo method and of the
coupon collector's problem. For the latter see, for
example, the index of Feller, 1950/68.) For the case
c=2, n=3 it is therefore no longer necessary to run
the program since a random nonperiodic design can
be obtained from Table 2 by the simplest use of
random numbers.
When n>2 and c+n>5, the
number of designs is too large to be conveniently
listed on paper, except perhaps when c=n=3, but the
program produces random designs in any case,
equally probable apart from the periodic designs.
Consider the original oriented (directed) linear
graph G whose n by n incidence matrix consists of
zeros on the main diagonal and c everywhere else.
Adjust G by removing one (oriented) edge AIA2' one
edge A 2A 3 , .,., one edge Ar -1 A r, because these have
already been used, and also the edge ArB whose
"more permanent" removal we are contemplating.
We wish to select B ("make the move under
consideration") with a probability proportional to the
number N(B) of routes from B to AI' Thus we now
want to enumerate the routes that will use up all the
remaining edges, while beginning at B and ending at
At. Introduce an artificial node 0 into the graph
together with one edge At 0 and one edge OB. Call
the graph as so produced G(B). Then N(B) is equal
to the number of "oriented unicursal" routes through
G(B). By an "oriented unicursal" route through an
oriented graph we mean one that uses up all the
edges in accordance with their multiplicities and
directions. Such routes exist if and only if the graph
hi connected and "appropriate" in the sense of
Appendix A. In fact it is appropriate because of the
introduction of the artificial node O. The number of
such routes can be obtained by the BEST theorem,
the details being given, for example, by Dawson and
Good (1957, p. 947) and axe repeated in our
Appendix A. The program requires the computation
of numerous determinants and the computations
have to be done by machine because the
determinants seldom have a nice pattern. Of course
N(B) = 0 if G(B) is disconnected and in this case B
REFERENCES
Ball,
W.W. Rouse & Coxeter, H.M.S. (1940).
Mathematical Recreations and Essays. London:
Macmillan.
Dawson, Reed & Good, I.J. (1957). Exact Maxkov
probabilities from oriented linear graphs, Annals
of Mathematical Statistics 28, 946-956.
Feller, W. (1950/68). An Introduction to Probability
Theory and its Applications. New York: Wiley.
Finney, D.J. & Outhwaite, Anne D. (1956). Serially
balanced sequences in bioassay. Proc. Roy. Soc.
B 145, 493·507.
Good, I.J. (1987). Serial unblocked designs balanced
for correlation between neighbours. Communica.tions in Statistics: Theory and Methods 16,
1153-1159.
1364
Good, I.J. (1990). Serial unblocked designs balanced
for
correlation
between
neighbors,
II.
Communications in Statistics:
Theory and
Methods. In press.
APPENDIX A. THE BEST THEOREM.
We here closely follow the wording of Dawson
& Good (1957, Section 2).
Suppose we have an oriented linear graph H
containing tL vertices (nodes) and such that the
number of oriented edges from vertex i to vertex j is
mij' A u x u matrix M = (mj -) is called an
incidence matrix of H if the rows 'tave the same
ordering as the columns. This matrix is unique to
within the same rearrangement of rows as of
columns. The graph is appropriate if, for each vertex
the number of edges lea:<ting In is equal to the
number leading out. In terms of M this means that
each row has the same sum as' the corresponding
column. A (complete) circuit in such a graph is
defined as a "unicursal" path passing exadly once
through each edge in the right direction. Unicursal
paths were first studied by Euler in 1736 for nODoriented graphs (Rouse Ball & Coxeter, 1960, pp.
Hardy, G.H. and Wright, E.M. (1938). Introduction
to the Theory oj Numbers. Oxford:
Press.
Clarendon
Milliken, G.A. and Johnson, Dallas E. (1984).
Analysis of Messy Data, Vol. 1, Designed
Experiments.
Belmont, California. Lifetime
Learning Publications.
Williams, R.M. (1952). Experimental designs for
serially correlated observations. Biometrika 39,
151·167.
Table 2. THE POSSIBLE DESIGNS WHEN
c=2,n=3.
242-254).
The periodic designs axe marked with an asterisk;
they are numbers 26, 38, and 42.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
DESIGN
121213132323
121213231323
121213232313
121231313232
121231321323
121231323132
121231323213
121232131323
121232132313
121232313132
121232313213
121232321313
121312132323
121312313232
121312321323
121312323132
121312323213
121313123232
121313212323
121313231232
121313232123
121321231323
121321232313
121321312323
121321323123
121323121323*
121323123132
121323123213
121323131232
121323132123
121323212313
121323213123
123123132132
123123213132
123123213213
123131232132
123131321232
123132123132*
123132123213
123132131232
123212321313
123213123213*
The BEST theorem gives the number of
distinct (complete) circuits for an appropriate graph
H when all edges are regarded as distinguishable.
Let
FREQUENCY
23
26
20
23
18
16
26
16
17
15
18
16
20
17
20
17
30
19
18
23
21
15
29
30
17
il . =
~mij = ~mji
J
J
which is the sum of the ith row (and of the ith
column). Let M = (m',j) be the u' x u' matrix
formed from M by deleting every row and column
consisting entirely of zeros (that is, eliminating isolated vertices). Then Ljm';j = Ljm'j; = m/j", say,
where m'j > O(i = 1, 2, "'J U'). Let m*j' = - m'i·
when i -#-j, let m*i.i = m'j - m'ii, and let M* = (m*ij)
which is also (m' i} 6ij) .- M', Then M* ~s a square
matrix with each row and column SUffiffilllg to zero
from which it can be shown that all its cofactors are
equalj let II M* H denote the common value of these
cofadors. Then the BEST theorem asserts that the
number of circuits in Gis
.' (m'i-')!
IIM'II ·n
1=1
In our application it can be deduced that the
number of "BEST" circuits, at the beginning of the
algorithm, is Nc (n) where
No (n) = c"-' n"-2 [(cn-c-l)!]"
(3)
as mentioned by Good (1987). Formula (1) for
Do ,(n) is deducible from (3) by means of a
5
cot'nbinatorial argument (Good, 1990).
17
18
16
23
11
25
21
24
20
14
16
13
25
16
28
8
1365
APPENDIX B. THE PROGRAM
The program to compute random serial unblocked designs is written as a SAS macro. The
macro uses the IML procedure to do the necessary computations. There are fout parameters in the
call to the macro; NTRT is the number of treatments; NREP is the number of replications of pairs of
treatments; NDES is the number of designs to be generated and L2=NREP x NTRT x (NTRT - 1).
The designs in Table 2, for case <:=2, n::::;3, werf'! generated by the program.
TITLE 'SERIAL UNBLOCKED DESIGN GENERATOR';
%MACRO SERIAl(NTRT,NREP,NDES,L2);
%lET LEN = &NREP*&NTRT*(&NTRT-1);
/*
/*
/*
/*
/*
<==
<==
<==
<==
NTRT IS THE NUMBER OF TREATMENTS*/
NREP IS THE NUMBER OF REP PAIRS */
NOES IS THE NUMBER OF DESIGNS */
L2 IS NTRT*NREP*(NTRT-l)
*/
*/
LENGTH OF DESIGN
PROC IMl;
= &NTRT;
= &NREP;
MAX = C*(N-1)
N
/*
C
/*
+ 1 ;
FACTORS = J(MAX,1)
/*
<==
<==
SET NUMBER OF TREATMENTS
SET NUMBER OF PAIR REPS
*/
*/
ZERO FACTORIAL
*/
START FAC;
=
DO I
2 TO MAX
FACTORS[I,l = I*FACTORS[I-1,1;
END; FINISH; RUN FAC;
L
= &LEN
;
/*
COMPUTE NECESSARY FACTORIALS
*/
/*
LENGTH OF DEStGN
*/
DESGN = J(&NDES,l,O)j
START GENER;
DO X = 1 TO &NDES:
INITIAL = C # J(N,N,l) - C # I(N);
IDE NT = I(N+1>:
NULL = J(l,N+l,O);
COLl = J(N,',O);
NULLVEC = J(L,l,D);
NBASIC = J(N+l,l,O);
RANDOM = UNIFORM(NULLVEC);
OESGNIX,l, = CEIL(N # RANOOMll,]);
COL1[OESGNIX,ll,] = 1;
BASIC = NULL // ( COLl II INITIAL)
PLACE = J(N+l,',O) ;
ONE = J(N+l,1,1) ;
ACCUM = J(N+l,N+l,O-);
ACCUMI,l] = ONE;
DO ~ = 2 TO (N+l);
PLACEI~,] = ~-1;
ONE[Y-l,l = 0;
ACCUM[,U] = ONE;
END;
/*
INCIDENCE MATRIX
*/
TOP ROY OF COUNTING MATRIX
FIRST COL OF COUNTING MATRIX
*/
/*
/*
STORAGE FOR DESIGN COUNTS
*/
/*
*/
/*
FIRST TRT OF DESIGN
*/
/*
COUNTING MATRIX
GIVES TREATMENT NUMBER
*/
/*
/.
ACCUMULATED SUM MATRIX
./
/*
BUILD A MATRIX YHOSE
LOYER TRIANGLE IS ALL
ONE IS USED TO GET CUML
SUMS
*/
ROU OF COUNTING MATRIX
*/
*/
/*
/*
/*
* OBTAIN REMAINING TREATMENTS IN RANDOM ORDER ;
DO Z = 2 TO L ;
LAST
DESGN[X,Z-11 + l'
/*
NPATH = BASICILAST,];
/*
=
1366
# PATHS AT EACH NODE
*/
*/
*/
*/
* COUNT NUMBER OF POSSIBLE DESIGNS AT EACH NODE
DO Y = 2 TO (N+1);
HeY = LAST)THEN NBASIC[Y,J
IF(Y -= LAST)THEN DO;
0;
j*
TBASIC = BASIC ;
TBASIC[1.YJ = 1;
TBASIC[lAST,Y] = TBASIC[LAST,YJ -1;
CSUM = TBASIC[+,l ;
LOCATE
LOC(CSUM > 0);
REDUCE
IDENl[,lOCATE] ;
BPRIME = REDUCE' * TBASIC * REDUCE
CPRIME = CSUM * REDUCE
NPRIME = NCOL(BPRIME);
BSTAR = DIAG(CPRIME) - BPRIMEi
CFACTOR = FACTORS[CPRIME+1,];
NB = ABS(DET(BSTAR[2:NPRIHE.2:NPRIHEJ»
NBASIC[Y,J = NB # NPATH[,Yl;
END;
j*
'*
'*
'*
TRANSFORM AS PER BEST THRM
COPY OF COUNTING MATRIX
PUT 1 IN TOP ROW
j*
COLUMN TOTALS
LOCATION OF NON-ZERO COLS
j*
GET SUBSET OF IDENTITY
ELIMINATE NUll ROUS, COLS
ELIMINATE ZERO COL TOTALS
# NUMBER OF NON-ZERO COlS'
1*
COUNT THE DESIGNS
/*
GET NEEDED FACTORIALS
# CFACTOR [#,];
/*
/*
END;
* CHOOSE NEXT STEP AT RANDOM BASED ON ABOVE COUNTS ;
NSUM = (ACCUM * NBASIC) / NBASIC[+,];
POSIT = (RANDOM[Z,] <= NSUM) # (NBASIC > 0) # PLAce
LOCATE = LOC(POSIT > 0).
REDUCE = IDENT[,LOCATE];
NPOSIT = REDUCE' * POSIT;
NEW = NPOSIT[><,];
BASIC{LAST,(NEY + 1)] = BASIC[LAST,(NEU+l)] - 1;
DESGN{X,Z] = NEW;
END;
END:
FINISH;
RUN GENER;
* CREATE DATA SET CONTAINING DESIGN AND OUTPUT;
CREATE RESULT FROM DESGN ;
APPEND FROM DESGN:
* FORMAT OUTPUT IN DATA SET AND PRINT OUT DESIGN;
DATA _NULl_i
SET RESULT;
ARRAY VV COL1-COL&L2;
FILE PRINT;
I = 0:
00 OVER Wi
I
=I
+ 1;
PUT Ql I VV
END;
= I + 1;
PUT @I I ' ;
".MEND SERIAL;
@;
%SERIAL(3.2.5D.12)
1367
*'
*'*'
*'
*'
*'*'
*'*'
*'
*'