Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
EXACT DISTRIBUTION FOR 2-SAMPLE-PERMUTATION- AND RANKTESTS USING THE STRITBERG-ROHMEL-ALGORITHM AND SAS/IML SOFTWARE Heide VoB, Boehringer Ingelheim KG I N T ROD U C T ION Well known 2-sample-permutation- and ranktests are: a) for two dependent samples: Fisher's Permutation Test, Sign Test, wilcoxon signed rank test, Pratt's Test; b) for two independent samples: Fisher-Pitman Test, wilcoxon Rank Sum Test. In textbooks about nonparametric statistics, exact distributions of the respective teststatistics are tabulated only for small sample sizes provided no ties occured in the observations. Otherwise one has to rely on approximations and asymptotic procedures. The Streitberg-Rohmel-Algorithm [1,2] provides the possibility to compute the exact distributions for the teststatistics of the above mentioned tests (and many more) for any samplesize (limited only by computer recources) and guarantees an exact handling of ties. This algorithm is best represented" using sums of appropriate matrices. While streitberg and Rohmel used APL, here it was performed with SAS / IML The idea behind the algorithm will be explained and the basic programming tools will be given in the appendix. THE ALGORITHM Exact distributions of a given test-statistic are always dependent on the underlying data • . Therefore I will explain the following algorithm using examples. ", I) Comparison of two DEPENDENT samples :', A typical example is the comparison of a new'treatment with a standard. Having applied each of them to n subjects (e.g. in a cross-over trial), we obtain 2n observations. For example: ~ ~ r: ~:: }' }, t.; ,r, subj. t- standard new difference K ~ :2 3 4 ~: ~: ~ "1-.t, 6 7 7 S 10 6 Xs x6 == S 11 x7 S Xs x2 x3 x4 S f. l ~ "F. ~: Y2 Y3 Y4 Ys Y6 Y7 yS 9 4 7 4 7 7 6 S d2 x2-Y2 = 3 -2 d3 x3-Y3 6 d4 x4-Y4 -1 dS = xS-yS -2 d6 x6-Y6 S d7 x7-Y7 4 dS xs-Ys t I model: Ho: r = 0 == . \ ~ "28 1 2 S 4 ------------ no;.. t r e a t men t - e f f e c t ;\ R 2 6 Di = Xi - Yi = fJ-+ &i ~ I D- 3 fJ. = constant treatment effect £i= independent random variables, E(£i)=O, symmetric distribution in case of ranktests ~ ~ ~ D+ D+obs=20 ~ I di ---------~--~---------~----~------------~-~~~~-----------~--S 6 2 2 1 x:( d1 x1-Y1 Y1 S=D-obs Different nonparametricmethods are offered to test this hypothesis, like e.g. : WILCOXON-matched-pairs-signed-rank-test FISHER'S permutation test sign test. ~'. The algorithm will be explained for Fisher's permutation test. For the other tests it is only a modification of this. :f: ~~; D+, the sum of positive differences, is chosen as teststatistic. [It could as well be D-; the IML-modules in the appendix use D-.] It follows from the hypothesis of no treatment effect (Ho),that each observed pair (xi,Yi) could have occured interchanged, so each difference di could have been positive or negative with the same probability of ~. ~'J => D+ is distributed as the weighted sum of n independent Bernoulli-variates, Le. n P{Bi = 1} = P{Bi = O} = ~ (*) D+ = ::L: I d' I Bi i=l ~ Knowing the conditional distribution ofD+ (under HO), dependent on the observed Idil, the probabilities Pge Po(D+ ~ D+obs) PIe Po (D+ ~ D+obs), can be determined. If one of these probabilities lies below the given ,Hois rejected. (That means, the probability of finding a sum of positive differences above or equal (resp. below or equal) to the observed D+obs, under Ho is so small, that the hypothesis of no treatmenteffect is to be rejected.) " significance~level ; ~;. ~ .). ~~ J~ }:" ).'; " I will show how we easily can get the exact distributionof D+ for the above example: }~ 12, lj, ~~ It is seen from (*) that ~ ~.: f ~~. ~. Asking for the exact distribution of D+ on this range means: ::1 > Form all 2 n possible subsets of Dn = <ldll,ld21, .•. , Idnl}. > ,Add up all elements in each resulting subset. > Count the number of occurences of each sum. ::e H " <' 1;' ~~ ::;; {; L. y As this would be too laborious we use the Algorithm of Streitberg and R6hmel This algorithm is based on a recursion. It,is a recursion on the sequence of distributions we gain while raising n. ,b'; i~: .:. J ~ i' i: ;~ i ~ ~ \ \, \, ~' ~ ~: 'a' ~ l:t: 29 In the above example On = { 2,3,2,6,1,2,5,4}. If there would have been only the first two subjects in the trial, we would have had 02 == ( 2,3 } and'O ~ D+ ~ 5. The distribution would be: possibie 0+ - range: no. of occurence: --~-I-~-I-~~I-~-I-~-I-~-- i empty set Enlarging 02 by 'ld31 =2 we get 03 = { 2,3,2 } and now 0 Gathering all subsets of 03, we get ., ~ 0+ <' 7. I: all subsets of 02 II: all unions {2} v A, A being subset of 02. (i.e. every subset of 02 enlarged by the element Id31=2, so the sums grow by 2.) that leads to the following distribution of 0+ " '" possible 0+ - range: --~-I-~-I-~-I-~-I-~-I-~-I-~-I-~-- no. of occurence: referring to I : 1 referring to II: sum: 1 o o o 1 1 1 o 1 2 1 1 1 1 I0 I1 -;-~- -~-- Repeating this procedure until On is complete yields the wanted distribution of the 2 n possible sums. USING SAS-IML Take the matrix MO I : from the left {I} and concatenate the zero-matrix Zl {O ••• O} to it d1 \ II: from the right I > add the results repeat the procedure for M1 and Z2 M1 II Z2 + Z21 I M1 M2 Mn -1 II Zn + Zn II Mn -1 Mn i.e. MO II Zl + Zl II Mo {O ••• O},to get M2, and so on: d2 \ \.. 30 M1 Mn is a vector having elements containing the wanted n exact distribution of D+ ~. ?==Idil Bi 1.=1 (depending on the given example) Divide every vector-element by 2 n to get the probability distribution. For the above example the distribution and probabilities, which are important for the test, are given below. Fisher's Randomization Test: new <~-> standard The number of possible sums is 256. They lie in the interval [0, O --- ... 1 D5 D+ 20 ---+---------- 25 interval of possible sums 1 respective number of possible sums ----------+---···---1 8 1<--···-->+<--------13 8 212 -----~--->+<--···-->I pe= .03i25 1---···---+ 13 pe= .03125 ple= .08594 +---···---1 pge=.08594 1---···--> .05469 pg= .05469 <-- ... ---I pl= pe probe finding a sum of = probe finding a sum of 25] . 5 (D-) 20 (D+) = 8 / 256 pI pg probe finding a sum less than probe finding a sum greater than pIe pge probe finding a sum less or equal 5 (D-) probe finding a sum greater or equal 20 (D+) 5 (D-) 20 (D+) .03125 14 / 256 22 / ;:: .05469 256 - .08594 This is the exact distribution for the Fisher-Permutation-test. If the Idil are replaced by RANKS ri we get the distribution for a ranktest It ,is at the user's decision what ranks should be used. Using midranks the procedure results in the WILCOXON-matched-pairs-signedrank-test.In case of ties one might get non-integer ranks.To use the above algorithm they must be multiplied by 2. Choosing di=l we get the exact distribution for the sign-test. 31 ~: ". ~., II) Comparison of two INDEPENDENT samples Consider again a comparison of a new treatment with a standard method. Having applied the new treatment to a group of m subjects and the standard to a second group of n subjects, we obtain N=m+n observations. For example: subject . new subject standard . ---------------------------------------------- 1 2 3 4 xl x2 x3 x4 :;: Xs S 6 x6 3 4 2 3 1 3 1 7 8 9 = 3 10 1 2 ----------- Smobs=16 model: 0. ~ snobs= 7 + f"" + Ci, j.A+c.j, i = l, ... ,m j = l , ..• ,n constant treatment effect t-i LLd. random variables, E(€.i)=O, symmetric distribution in case of ranktest. o Ho: n 0 - t r e a t men t - e f f e c t Under Ho (no treatment effect) all N=m+n values are exchangeable. Chosing Sm = sum of all m new treatment observations for the test-statistic, it is distributed as the weighted sum: N :::C:X' i=l ~ B~ ... under the side condition m N =L:B' i=l ~ Bi = Bernoulli-variates, Le. P{Bi = 1} = P{Bi = O} = ~ Comparing this to the dependent-samples-case it means taking only subsets with m elements out of the set of possible values,which is {x1, ... ,xN}' So if we consider the common distribution of S N L::"}{' Bi i=l ~ and M N =:> B' i=l ~ (i.e. the common distribution of one-salllple-permutation-test-statistic and the sign test-statistic.) we can apply the same recursion formula, but expanded to two dimensions. \ \. 32 The result will be a matrix having N 1 +L::xi columns (all possible values of S), i=l and N+l rows (all possible values of M). The wanted distridution of all(~) possible sums is found in row m+l. Here the weights assigned to the Bernoulli-variates are identical to the observed values, according to the Fisher-Pitman-test (this is only possible in case of xi beeing integers). For a ranktest the xi are to be replaced by ranks ri (or by double ranks in case of non-integer midranks). USING SAS-IML After starting with Mo [I], we get Mi+l out of Mi in the following way: . xi+l (0 0 ••• 0) + 1(0 ••• 0) ----~----- ---;~-- (This is very easy using the IML BLOCK-function.) At the end of the loop the matrix Mm+n results. The distribution in question is found in row m+1. For the above ·example it is: col u m n n o. 0 1 2 3 0 0 0 0 0 0 0 3 0 0 0 0 0 2 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 4 1 0 0 0 0 0 0 0 0 0 0 6 13 11 8 4. 0 0 0 0 0 0 0 1 6 15 27 28 25 12 6 0 0 0 0 0 0 2 4 0 7 25 36 51 42 30 13 0 0 0 0 1 8 20 43 54 54 43 20 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 4 13 30 42 51 36 25 7 2 0 0 0 0 01 0 0 0 0 0 0 0 0 6 12 25 28 27 15 6 0 0 0 4 8 11 13 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 6 4 0 0 3 2 0 0 0 3 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 8 After division by (~) = 210, we get the probability distribution. 33 i DISTRIBUTION given the data in the example, there are 210 subsets containing 6 elements. Compared to the sum 16 of "new treatment" there are the following no. of subsets 9 .34 176 201 with sum greater greater or equal less less or equal relative frequency 0.0428 0.1619 0.8381 0.9571 A P PEN D I X proc iml worksize=200i reset storage = "sasuser.exact_np"i START MIDRANKS(x,rm); * rm contains midranks for x rm = ranktie(x); if any(rm-int(rm» then rm=rm#2i FINISHi * store module=midranksi START NATRANK(x,ncx,rn); * rn contains natural ranks for x * xunique unique(x)i ncxu = ncol(xunique)i rn l:nCXi do col=l to nCXUi rn[,loc(x=xunique[,col])] endi coli FINISHi store module=natranki START SCORES(sctype,x,ncx,r); if sctype = 0.5 then dOi load module=midranksi run midranks(x,r)i endi * r contains wanted scores for x * if sctype = 1 then dOi load.module=natranki runnatrank(x,ncx,r)i endi i f sctype 2 then r=repeat(l,l,ncx)i FINISH; store module=scoresi 34 START NEGSUH(diff,r,negrsum); *negrsum=sum of ranks for neg. differences* rneg = r[,loc(diff<O»)i negrsum = rneg[,+)i free rnegi FINISHi store module=negsumi *algorithm for two dependent samples* START ALGOR1(r,negrsum,possible,ne,nle,ng)i ncr = ncol(r)i h = 1i do rn=l to ncri hn=j (1, r [ , rn) , 0) i h1=h1 lh n i h2=hn Ihi h=h1+h2i endi possible 2 #=/I ncr i ne = h[,negrsum+1)i nle sum(h[,1:negrsum+1)i ng sum(h[,negrsum+2:1+r[,+)))i I I· I FINISHi store module=algor1i START BINOMIAL(n,k,bin)i kdiff n-ki Bmin min(k,kdiff)i Bmax max(k,kdiff)i Bin1 1i Bin2 1i * bin n! / k! (n-k)! do m (Bmax+1) to ni Bin1 = m # Bin1i endi do m Bin2 endi· 1 to Bmini m # Bin2i Bin = Bin1/Bin2i FINISHi store module=binomiali 35 * START GCD(pos,rsortpos); * rsortpos pos / ged(pos) rsortpos = pos; do t = 2 to pos[,l); rhelp = 0; do k = 1 to neol(pos); rhelp = rhelp + mod(pos[,k),); end; if rhelp = 0 then rsortpos = pos / t; end; FINISH; store module=GCO; * algorithm for two independent samples * START ALGOR2(r,nexl,nex2,rlsum,r2sum,ne,nl,ng,indexmin); rsort = r; rsort[,rank(r») = r; sub = rsort[,l); rsort = rsort - sub; rlsumred rlsum - nexl#sub; r2sumred = r2sum - nex2#sub; nrO = sum«rsort=O»; nrOplUsl = nrO+l; ner = nexl+nex2i if rsort[,nrOplusl) >= 2 then do; pos = rsort[,nrOplusl:ner); load module = GCO; run GCO(pos,rsortpos); div = rsort[,nrOplusl) / rsortpos[,l); rlsumred = rlsumred / div; r2sl,lmred = r2sumred / div; rsort = repeat(O,l,nrO) I I rsortpos; end; = rlsumred II r2sumred; NC = nexl!! nex2; indexmin=NC[,>:<); nemin = NC[,indexmin); RS relevrow = nemin+l; leindex ner-nemin+l; lasteol = 1 + sum(rsort[,leindex:ner); h = j(nrOplusl,l,O); load module=binomial; do k= 0 to nrO; run binomial(nrO,k,binO); h[k+l,l) = binO; end; 36 * do rn hn hI h2 h nrOplus1 to ncmin; j(l,rsort[,rn],O); block(h,hn); block(hn,h); h1+h2; end; do rn = max (relevrow,nrOplus1)to ncr; hn j (1, rsort [ , rn] , 0)· ; h1 block(h,hn) ; h2 block(hn,h); h h1+h2; h h[l:relevrow,]; if ncol(h) > lastcol then h=h[,l:lastcol]; if ncmin+rn-ncr > 1 then do; h = h[2:relevrow,]; relevrow = relevrow-1; end: end: h = h[relevrow,]: rs=RS [ i indexmin] : ne == h[ ,1+rs] : if rs = 0 then nl=O: else nl = sum(h[,l:rsJ): ng = h[,+] - ne - nl; FINISH; store module=ALGOR2; quit; REF ERE NeE S [1] streitberg, B., R6hmel, J. Exact Distributions for Permutation and Rank Tests: An Introduction to some recentlz published Algorithms. statistical Software Newsletter, Vol 12 (1), [ 2 ] strei tberg, B., R6hmel, .J . ExakteVerteilungen fur Rang-und Randomisierungstests im allgemeinen c~Stichprobenproblem. EDV in.Medizin und Biologie 18(1), 1987 , 12-19. 37 1986, 10-17 •