Download Exact Distribution For 2-Sample-Permutation And Ranktests Using The Streitberg-Röhmel-Algorithm And SAS/IML Software

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Transcript
EXACT DISTRIBUTION FOR 2-SAMPLE-PERMUTATION- AND RANKTESTS
USING THE STRITBERG-ROHMEL-ALGORITHM AND SAS/IML SOFTWARE
Heide VoB, Boehringer Ingelheim KG
I N T ROD U C T ION
Well known 2-sample-permutation- and ranktests are:
a) for two dependent samples: Fisher's Permutation Test, Sign Test,
wilcoxon signed rank test, Pratt's Test;
b) for two independent samples: Fisher-Pitman Test, wilcoxon Rank Sum Test.
In textbooks about nonparametric statistics, exact distributions of the
respective teststatistics are tabulated only for small sample sizes provided
no ties occured in the observations. Otherwise one has to rely on
approximations and asymptotic procedures.
The Streitberg-Rohmel-Algorithm [1,2] provides the possibility to compute
the exact distributions for the teststatistics of the above mentioned tests
(and many more) for any samplesize (limited only by computer recources) and
guarantees an exact handling of ties.
This algorithm is best represented" using sums of appropriate matrices.
While streitberg and Rohmel used APL, here it was performed with SAS / IML
The idea behind the algorithm will be explained and the basic programming
tools will be given in the appendix.
THE
ALGORITHM
Exact distributions of a given test-statistic are always dependent on the
underlying data • . Therefore I will explain the following algorithm using
examples.
",
I) Comparison of two DEPENDENT samples
:',
A typical example is the comparison of a new'treatment with a standard.
Having applied each of them to n subjects (e.g. in a cross-over trial),
we obtain 2n observations. For example:
~
~
r:
~::
}'
},
t.;
,r,
subj.
t-
standard
new
difference
K
~
:2
3
4
~:
~:
~
"1-.t,
6
7
7
S
10
6
Xs
x6 == S
11
x7
S
Xs
x2
x3
x4
S
f.
l
~
"F.
~:
Y2
Y3
Y4
Ys
Y6
Y7
yS
9
4
7
4
7
7
6
S
d2
x2-Y2 = 3
-2
d3
x3-Y3
6
d4
x4-Y4
-1
dS = xS-yS
-2
d6
x6-Y6
S
d7
x7-Y7
4
dS
xs-Ys
t
I
model:
Ho:
r
= 0
==
.
\
~
"28
1
2
S
4
------------
no;.. t r e a t men t - e f f e c t
;\
R
2
6
Di = Xi - Yi = fJ-+ &i
~
I
D-
3
fJ. = constant treatment effect
£i= independent random variables, E(£i)=O,
symmetric distribution in case of ranktests
~
~
~
D+
D+obs=20
~
I
di
---------~--~---------~----~------------~-~~~~-----------~--S
6
2
2
1
x:( d1
x1-Y1
Y1
S=D-obs
Different nonparametricmethods are offered to test this hypothesis, like
e.g. :
WILCOXON-matched-pairs-signed-rank-test
FISHER'S permutation test
sign test.
~'.
The algorithm will be explained for Fisher's permutation test.
For the other tests it is only a modification of this.
:f:
~~;
D+, the sum of positive differences, is chosen as teststatistic.
[It could as well be D-; the IML-modules in the appendix use D-.]
It follows from the hypothesis of no treatment effect (Ho),that each
observed pair (xi,Yi) could have occured interchanged, so each difference di
could have been positive or negative with the same probability of ~.
~'J
=> D+ is distributed as the weighted sum of n independent
Bernoulli-variates, Le.
n
P{Bi = 1} = P{Bi = O} = ~
(*)
D+ = ::L: I d' I Bi
i=l ~
Knowing the conditional distribution ofD+ (under HO), dependent on the observed Idil, the probabilities
Pge
Po(D+
~
D+obs)
PIe
Po (D+
~
D+obs),
can be determined.
If one of these probabilities lies below the given
,Hois rejected. (That means, the probability of finding
a sum of positive differences above or equal (resp. below or equal) to the
observed D+obs, under Ho is so small, that the hypothesis of no treatmenteffect is to be rejected.)
"
significance~level
;
~;.
~
.).
~~
J~
}:"
).';
"
I will show how we easily can get the exact distributionof D+ for the above
example:
}~
12,
lj,
~~
It is seen from (*) that
~
~.:
f
~~.
~.
Asking for the exact distribution of D+ on this range means:
::1
> Form all 2 n possible subsets of Dn = <ldll,ld21, .•. , Idnl}.
> ,Add up all elements in each resulting subset.
> Count the number of occurences of each sum.
::e
H
"
<'
1;'
~~
::;;
{;
L.
y
As this would be too laborious we use the Algorithm of Streitberg and R6hmel
This algorithm is based on a recursion. It,is a recursion on the sequence of
distributions we gain while raising n.
,b';
i~:
.:.
J
~
i'
i:
;~
i
~
~
\
\,
\,
~'
~
~:
'a'
~
l:t:
29
In the above example
On = { 2,3,2,6,1,2,5,4}.
If there would have been only the first two subjects in the trial, we would
have had 02 == ( 2,3 } and'O ~ D+ ~ 5. The distribution would be:
possibie 0+ - range:
no. of occurence:
--~-I-~-I-~~I-~-I-~-I-~--
i
empty set
Enlarging 02 by 'ld31 =2 we get 03 = { 2,3,2 } and now 0
Gathering all subsets of 03, we get
.,
~
0+ <' 7.
I:
all subsets of 02
II:
all unions {2} v A, A being subset of 02.
(i.e. every subset of 02 enlarged by the element Id31=2,
so the sums grow by 2.)
that leads to the following distribution of 0+
"
'"
possible 0+ - range:
--~-I-~-I-~-I-~-I-~-I-~-I-~-I-~--
no. of occurence:
referring to I : 1
referring to II:
sum:
1
o
o
o
1
1
1
o
1
2
1
1
1
1 I0 I1
-;-~-
-~--
Repeating this procedure until On is complete yields the wanted distribution
of the 2 n possible sums.
USING SAS-IML
Take the matrix MO
I : from the left
{I} and concatenate the zero-matrix Zl
{O ••• O} to it
d1
\
II: from the right I
> add the results
repeat the procedure for M1 and Z2
M1 II Z2
+
Z21 I M1
M2
Mn -1 II Zn
+
Zn II Mn -1
Mn
i.e.
MO II Zl
+
Zl II Mo
{O ••• O},to get M2, and so on:
d2
\
\..
30
M1
Mn is a vector having
elements containing the wanted
n
exact distribution of
D+
~.
?==Idil Bi
1.=1
(depending on the given example)
Divide every vector-element by 2 n to get the probability distribution.
For the above example the distribution and probabilities, which are important
for the test, are given below.
Fisher's Randomization Test: new
<~->
standard
The number of possible sums is 256. They lie in the interval [0,
O
--- ...
1
D5
D+
20
---+----------
25
interval of
possible sums
1
respective number
of possible sums
----------+---···---1
8
1<--···-->+<--------13
8
212
-----~--->+<--···-->I
pe=
.03i25
1---···---+
13
pe=
.03125
ple= .08594
+---···---1
pge=.08594
1---···-->
.05469
pg= .05469
<-- ... ---I
pl=
pe
probe finding a sum of
= probe finding a sum of
25] .
5 (D-)
20 (D+)
= 8 /
256
pI
pg
probe finding a sum less
than
probe finding a sum greater than
pIe
pge
probe finding a sum less
or equal 5 (D-)
probe finding a sum greater or equal 20 (D+)
5 (D-)
20 (D+)
.03125
14 /
256
22 /
;::
.05469
256 -
.08594
This is the exact distribution for the Fisher-Permutation-test.
If the Idil are replaced by RANKS ri we get the distribution for a ranktest
It ,is at the user's decision what ranks should be used.
Using midranks the procedure results in the WILCOXON-matched-pairs-signedrank-test.In case of ties one might get non-integer ranks.To use the above
algorithm they must be multiplied by 2.
Choosing di=l we get the exact distribution for the sign-test.
31
~:
".
~.,
II) Comparison of two INDEPENDENT samples
Consider again a comparison of a new treatment with a standard method.
Having applied the new treatment to a group of m subjects and the standard
to a second group of n subjects, we obtain N=m+n observations. For example:
subject .
new
subject
standard
.
----------------------------------------------
1
2
3
4
xl
x2
x3
x4
:;:
Xs
S
6
x6
3
4
2
3
1
3
1
7
8
9
= 3
10
1
2
-----------
Smobs=16
model:
0.
~
snobs= 7
+ f"" + Ci,
j.A+c.j,
i = l, ... ,m
j = l , ..• ,n
constant treatment effect
t-i
LLd. random variables,
E(€.i)=O,
symmetric distribution in case of ranktest.
o
Ho:
n
0
-
t r e a t men t - e f f e c t
Under Ho (no treatment effect) all N=m+n values are exchangeable.
Chosing Sm = sum of all m new treatment observations for the test-statistic,
it is distributed as the weighted sum:
N
:::C:X'
i=l
~
B~
...
under the side condition
m
N
=L:B'
i=l ~
Bi = Bernoulli-variates, Le. P{Bi = 1} = P{Bi =
O}
=
~
Comparing this to the dependent-samples-case it means taking only subsets
with m elements out of the set of possible values,which is {x1, ... ,xN}'
So if we consider the common distribution of
S
N
L::"}{' Bi
i=l ~
and
M
N
=:> B'
i=l ~
(i.e. the common distribution of one-salllple-permutation-test-statistic and
the sign test-statistic.)
we can apply the same recursion formula, but expanded to two dimensions.
\
\.
32
The result will be a matrix having
N
1 +L::xi columns (all possible values of S),
i=l
and N+l
rows
(all possible values of M).
The wanted distridution of all(~) possible sums is found in row m+l.
Here the weights assigned to the Bernoulli-variates are identical to the
observed values, according to the Fisher-Pitman-test (this is only possible
in case of xi beeing integers).
For a ranktest the xi are to be replaced by ranks ri (or by double ranks in
case of non-integer midranks).
USING SAS-IML
After starting with Mo
[I], we get Mi+l out of Mi in the following way:
. xi+l
(0 0 ••• 0)
+
1(0 ••• 0)
----~----- ---;~--
(This is very easy using the IML BLOCK-function.)
At the end of the loop the matrix Mm+n results.
The distribution in question is found in row m+1.
For the above ·example it is:
col u m n
n o.
0
1
2
3
0
0
0
0
0
0
0
3
0
0
0
0
0
2
3
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
4
1
0
0
0
0
0
0
0
0
0
0
6 13 11 8
4. 0
0
0
0
0
0
0
1
6 15 27 28 25 12
6
0
0
0
0
0
0
2
4
0
7 25 36 51 42 30 13
0
0
0
0
1
8 20 43 54 54 43 20
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
4
5
6
7
8
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
4 13 30 42 51 36 25
7
2
0
0
0
0
01
0
0
0
0
0
0
0
0
6 12 25 28 27 15 6
0
0
0
4
8 11 13
0
1
0
0
0
0
0
0
0
0
0
0
0
0
1
6
4
0
0
3
2
0
0
0
3
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
8
After division by (~) = 210, we get the probability distribution.
33
i
DISTRIBUTION
given the data in the example, there are 210 subsets containing 6 elements.
Compared to the sum 16 of "new treatment" there are the following
no. of subsets
9
.34
176
201
with sum
greater
greater or equal
less
less or equal
relative frequency
0.0428
0.1619
0.8381
0.9571
A P PEN D I X
proc iml worksize=200i
reset storage = "sasuser.exact_np"i
START MIDRANKS(x,rm);
* rm contains midranks for x
rm = ranktie(x);
if any(rm-int(rm» then rm=rm#2i
FINISHi
*
store module=midranksi
START NATRANK(x,ncx,rn);
*
rn contains natural ranks for x
*
xunique
unique(x)i
ncxu
= ncol(xunique)i
rn
l:nCXi
do col=l to nCXUi
rn[,loc(x=xunique[,col])]
endi
coli
FINISHi
store module=natranki
START SCORES(sctype,x,ncx,r);
if sctype = 0.5 then dOi
load module=midranksi
run midranks(x,r)i
endi
* r contains wanted scores for x *
if sctype = 1 then dOi
load.module=natranki
runnatrank(x,ncx,r)i
endi
i f sctype
2 then r=repeat(l,l,ncx)i
FINISH;
store module=scoresi
34
START NEGSUH(diff,r,negrsum);
*negrsum=sum of ranks for neg. differences*
rneg
= r[,loc(diff<O»)i
negrsum = rneg[,+)i
free rnegi
FINISHi
store module=negsumi
*algorithm for two dependent samples*
START ALGOR1(r,negrsum,possible,ne,nle,ng)i
ncr = ncol(r)i
h
= 1i
do rn=l to ncri
hn=j (1, r [ , rn) , 0) i
h1=h1 lh n i
h2=hn Ihi
h=h1+h2i
endi
possible
2 #=/I ncr i
ne
= h[,negrsum+1)i
nle
sum(h[,1:negrsum+1)i
ng
sum(h[,negrsum+2:1+r[,+)))i
I
I·
I
FINISHi
store module=algor1i
START BINOMIAL(n,k,bin)i
kdiff
n-ki
Bmin
min(k,kdiff)i
Bmax
max(k,kdiff)i
Bin1
1i
Bin2
1i
*
bin
n! / k! (n-k)!
do m
(Bmax+1) to ni
Bin1 = m # Bin1i
endi
do m
Bin2
endi·
1
to Bmini
m # Bin2i
Bin = Bin1/Bin2i
FINISHi
store module=binomiali
35
*
START GCD(pos,rsortpos);
*
rsortpos
pos / ged(pos)
rsortpos = pos;
do t = 2 to pos[,l);
rhelp = 0;
do k = 1 to neol(pos);
rhelp = rhelp + mod(pos[,k),);
end;
if rhelp = 0 then rsortpos = pos / t;
end;
FINISH;
store module=GCO;
* algorithm for two independent samples * START
ALGOR2(r,nexl,nex2,rlsum,r2sum,ne,nl,ng,indexmin);
rsort = r;
rsort[,rank(r») = r;
sub = rsort[,l);
rsort = rsort - sub;
rlsumred
rlsum - nexl#sub;
r2sumred = r2sum - nex2#sub;
nrO = sum«rsort=O»;
nrOplUsl = nrO+l;
ner
= nexl+nex2i
if rsort[,nrOplusl) >= 2 then do;
pos = rsort[,nrOplusl:ner);
load module = GCO;
run GCO(pos,rsortpos);
div = rsort[,nrOplusl) / rsortpos[,l);
rlsumred = rlsumred / div;
r2sl,lmred = r2sumred / div;
rsort = repeat(O,l,nrO) I I rsortpos;
end;
= rlsumred II r2sumred;
NC
= nexl!! nex2;
indexmin=NC[,>:<);
nemin = NC[,indexmin);
RS
relevrow = nemin+l;
leindex
ner-nemin+l;
lasteol = 1 + sum(rsort[,leindex:ner);
h = j(nrOplusl,l,O);
load module=binomial;
do k= 0 to nrO;
run binomial(nrO,k,binO);
h[k+l,l) = binO;
end;
36
*
do rn
hn
hI
h2
h
nrOplus1 to ncmin;
j(l,rsort[,rn],O);
block(h,hn);
block(hn,h);
h1+h2;
end;
do rn = max (relevrow,nrOplus1)to ncr;
hn
j (1, rsort [ , rn] , 0)· ;
h1
block(h,hn) ;
h2
block(hn,h);
h
h1+h2;
h
h[l:relevrow,];
if ncol(h) > lastcol then h=h[,l:lastcol];
if ncmin+rn-ncr > 1 then do;
h = h[2:relevrow,];
relevrow = relevrow-1;
end:
end:
h = h[relevrow,]:
rs=RS [ i indexmin] :
ne == h[ ,1+rs] :
if rs = 0 then nl=O:
else nl = sum(h[,l:rsJ):
ng = h[,+] - ne - nl;
FINISH;
store module=ALGOR2;
quit;
REF ERE NeE S
[1] streitberg, B., R6hmel, J.
Exact Distributions for Permutation and Rank Tests:
An Introduction to some recentlz published Algorithms.
statistical
Software
Newsletter,
Vol
12
(1),
[ 2 ] strei tberg, B., R6hmel, .J .
ExakteVerteilungen fur Rang-und Randomisierungstests
im allgemeinen c~Stichprobenproblem.
EDV in.Medizin und Biologie 18(1), 1987 , 12-19.
37
1986,
10-17 •