Download portable document (.pdf) format

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Statistics wikipedia , lookup

History of statistics wikipedia , lookup

Probability interpretations wikipedia , lookup

Inductive probability wikipedia , lookup

Probability wikipedia , lookup

Ars Conjectandi wikipedia , lookup

Transcript
An Algorithm of a Longest of Runs Test for Very Long
Sequences of Bernoulli Trials
Alexander I. KOZYNCHENKO
Faculty of Science, Technology, and Media, Mid Sweden University,
SE-85170, Sundsvall, Sweden
[email protected]
Abstract
A new algorithm of computing statistics of a longest of runs test is proposed for the case
of equal probability Bernoulli trials processes. The algorithm is founded on the analysis
of the event tree diagram, which has shown the role of Fibonacci numbers of higher
orders in counting the number of outcomes of interest in the sample space. The proof by
induction is given. Compared to the classical combinatorial formulas, the proposed
algorithm provides the error-free exact probabilities and makes possible the processing
of very long binomial data sets (up to 103) on contemporary computers.
Keywords: runs tests, longest run, Bernoulli trials, Fibonacci numbers, computing
algorithms
2000 Mathematics Subject Classifications: 62G10; 62-04; 11B39
1. Introduction
Distribution-free tests for randomness of a sample data play an important role in
much statistical inference and are relevant to many applications in sociology, biology,
psychology, engineering etc., including such particular problems as regression and
2
curve fitting. There is a great body of literature on the subject, worthy of mention of
which are the books by Siegel and Castellan, [1], and Sprent [2].
This paper is concerned with the computational aspects of an important distributionfree runs test, namely, the longest of runs test of randomness applied to long patterns of
binomial trials. Amongst a number of publications in the runs tests investigation, it is
worth mentioning the paper by Mood, [3], and the monograph by Bradley, [4], who
gave a detailed treatment of runs tests, as well as the appropriate survey of the works
done in 1940s-60s by Olmstead, Mosteller, Grant, Burr and Cane, et al.
The existing runs tests are based on either numbers of runs or lengths of runs. The
total-number-of-runs test provides both exact combinatorial formulas and asymptotic
ones in the assumption of normally distributed statistics for large samples (see, e.g., [4],
However, for the longest of runs tests there is no such an asymptotic theory,
p. 262).
and we have to use the exact combinatorial formulas. So, the question arises as to
whether those formulas are applicable for computing the statistics on contemporary
computers in the case of large samples, or it is necessary to derive more adequate theory
applicable to processing long samples.
2. Analysis of the classical combinatorial formulas for the longest of
runs test
The conventional approach to deriving a general formula for the probability
P (r?,
( ≥S )
≥ 1) of obtaining at least one run of length S or greater among either the 1’s or
the 2’s had been described in [3]. It is to be noted that “either” includes the possibility
3
of “both”. The approach is based on the formula of calculating the probability of a sum
of random compatible events:
P ( r?,
(≥S )
≥ 1) = P ( r1,
(≥S )
− P ( r1,
where r1 ,
≥ 1) + P ( r2,
(≥S )
(≥ S )
≥ 1 and r2 ,
≥ 1)
(≥ S )
≥ 1)
(1)
r2 , r? are numbers of runs of 1’s, 2’s, and of unspecified type of element
containing the run, respectively;
P ( r1, ( ≥S ) ≥ 1) is the probability of obtaining at least one run of length ≥ S
among the 1’s but not among the 2’s;
P ( r2 , ( ≥S ) ≥ 1) is the probability of obtaining at least one such run among the 2’s
but not among the 1’s;
P ( r1, ( ≥ S ) ≥ 1 and r2 , ( ≥ S ) ≥ 1) is the probability of obtaining at least one such
run among both the 1’s and the 2’s;
Suppose that a sequence of n trials contains n1 1’s and n 2 = n − n1 2’s. In this case, the
probabilities in (1) can be computed on the following combinatorial formulas:
P (r1, ( ≥ S ) ≥ 1) =
P(r2, ( ≥ S ) ≥ 1) =
n + 1  n − iS 
1 n 1/ S
,
(− 1)i +1  2 
∑
i  n2 
 n  i =1

 
 n1 
n + 1 n − iS 
1 n2 / S
,
(− 1)i +1  1 
∑
i  n1 
 n  i =1

 
 n1 
(2)
(3)
4
P(r1,
(≥ S )
≥ 1 and r2,
≥ 1)
(n − r ) / ( S −1)
1 n 1− S +1  1 1

i +1  r1  (n1 − 1) − i (S − 1) 

=
 ∑ (− 1)  
∑
r1 − 1
 n  r1 =1  i =1
 i 

 
 n1 
(≥ S )
(n2 − r 1−1) / (S −1)
r − 1  (n − 1) − i (S − 1)

(− 1)i +1  1  2

∑
r1 − 2

i =1
 i 

(n2 − r1 −1) / ( S −1)
r (n − 1) − i (S − 1)

(− 1)i +1  1  2
+ 2
∑
r1 − 1

i =1
 i 

(n2 − r1 −1) / (S −1)
r + 1  (n − 1) − i (S − 1) 
 
+
(− 1)i +1  1  2
∑
r1
i =1
 i 
 

(4)
These run formulas take n1 and n 2 as given. But in the case of Bernoulli trials, when 1
and 2 are mutually exclusive outcomes with probabilities p and q respectively of
occurrence on a single trial, it would be convenient to eliminate parameters n1 and n 2 .
The extension where n1 and n2 are not fixed, so that the probability (1) is completely
irrespective of n1 and depends only on n and p, is described in [3]. The compound
n n n
probability is obtained by taking the product of the binomial probability   p 1 q 2 and
 n1 
the probability being computed on (1)-(4). The sum of that product over all possible
values of n1 gives the sought-for probability:
n
P (r?,
(≥S )
≥ 1 | n, p ) =
∑ P(r
?, ( ≥ S )
n1 = 0
n
n−n
≥ 1 | n1 , n) ⋅   ⋅ p n1 (1 − p ) 1
 n1 
(5)
Evidently, it is worth while developing the computing algorithm in order to check the
correctness and to evaluate the performance of this formula. The author has created the
C++ program that computes the probability P (r?, ( ≥ S ) ≥ 1 | n, p ) using the formulas (1)(5). The code is placed in Appendix A. A number of computing tests has been
5
accomplished, and the analysis of the results has revealed two drawbacks of the formula
(5).
First of all, it gives a systematic error that manifests itself in the expression (4). Let
us consider, for instance, the case of n = 8, S =4, p=0.5. The computations on the
formula (5) give the probability P ( r?,
(≥S )
≥ 1 | n, p ) = 0.375 and the number of
outcomes of interest N = 96 (the number of all possible outcomes equals to 28 = 256).
However, the correct values are 0.3671875 and 94, respectively. The reason is in the
formula
(4)
that
gives
the
erroneous
zero
value
for
the
probability
P (r1, ( ≥ S ) ≥ 1 and r2, ( ≥ S ) ≥ 1) , whereas the correct value is 2-7, which corresponds to
two outcomes of interest available: 11112222 and 22221111. In order to further
checking of the classical formulas, the author developed a brute-force algorithm based
on the breadth-first search technique. It gives the exact solutions, but has the
( )
exponential computation time t (n ) = O 2 n and therefore cannot be applied to samples
of length n > 40. The comparison of the results obtained by this algorithm with that of
the classical formulas discloses that the classical formulas give a regular positive error
when S ≤ 0.5 ⋅ n . This evidently confirms the fallacy of the formula (4), since it relates
to the paths with two or more runs of length S when the formula (4) is applied.
Secondly, the computing tests have revealed an upper limit on the length of binomial
sequences being processed on contemporary PCs equipped, e.g., the AMD Athlon™
64x2 Dual Core processor 4600+. This limit amounts to n = 180 for S<10 and p = 0.5.
Such a restriction does not actually allow processing very long binomial sequences
wherein the adequate power of runs tests could be attained.
6
3. Description of the proposed algorithm using the Fibonacci numbers,
its proof, and performance
The longest of runs test of a Bernoulli trials process can be analysed using a binary
tree diagram as the standard technique of representing the sample space and counting
probabilities, [5]. The paths with outcomes of interest contain at least one run of length
S or greater. They all are indicated in Fig. 1, where solid and dotted lines mean success
(or, say, 1) and failure (or 2) of a chance experiment, correspondingly.
Number of a trial j = { 0, … n}
B
1
2
3
4
C
D
1
5
E 1
6
2
7
4
8
F
7
Fig. 1. The half of an event tree diagram showing the number of the
runs of length ≥S in the sequence of Bernoulli trials of length n = 8
(S = 4)
Some paths containing a run of length S at their ends are shown completely from the
root to a leaf (e.g., ABDF), whereas the others are pooled into clusters of paths having
both an initial common explicit part that ends with a run of length S and a subsequent
arbitrary sub-tree (see, e.g., clusters ABC or ABDE).
Fibonacci numbers Fi+S-2, S-1 of order S-1, i = j- 4
A
0
7
We will consider the particular (and most important) case of a Bernoulli trials
process with equal probabilities of successes and failures on a chance experiment p = q
= 0.5. Here, the probabilities of all outcomes are the same, being equal to 0.5n for n
Bernoulli trials. Hence, in order to compute the probability P(r?, (≥S ) ≥ 1) of obtaining at
least one run of length S or greater among either the 1’s or the 2’s we need to calculate
the number of outcomes of interest. In the case (n = 8, S = 4) depicted in Fig. 1, this
number can be estimated as follows:
(
)
N (r?, (≥ S = 4 ) ≥ 1 | n = 8, p = 0.5) = 2 ⋅ 1 ⋅ 2 4 + 1 ⋅ 2 3 + 2 ⋅ 2 2 + 4 ⋅ 21 + 7 ⋅ 2 0 ,
(6)
where the first item, 1 ⋅ 2 4 , corresponds to the cluster ABC, the second one, 1 ⋅ 2 3 ,
corresponds to ABDE, and so on until the item 7 ⋅ 2 0 that relates to the individual paths
not including sub-trees, as ABDF. As we can see, the factors 1, 1, 2, 4, 7 form a part
(without zeros) of the sequence of Fibonacci numbers of 3rd order. Having analysed the
tree diagrams for other (n, S) cases, the general formula for arbitrary n, S<n is derived:
N (r?, (≥ S ) ≥ 1 | n, p = 0.5)
(
= 2 ⋅ F2, S −1 ⋅ 2 n − S + F3, S −1 ⋅ 2 n − S −1 + ... + F2+i, S −1 ⋅ 2 n − S −i + ... + Fn − S + 2, S −1 ⋅ 2 0
)
(7)
n− S
= ∑ Fi + 2, S −1 ⋅ 2 n − S + 1−i
i =0
where Fi + 2, S −1 is a (i + 2) th Fibonacci number of S-1 order.
So, the formula for the sought-for probability is derived from (2) by dividing it by the
number of all possible outcomes 2n:
n−S
P ( r?, ( ≥ S ) ≥ 1 | n, p = 0.5) = ∑
i =0
Fi + 2 , S −1
2 i + S −1
The formula (7) can be proved by mathematical induction:
(8)
8
1. The basis: the formula (7) holds when n = S. Indeed, in this case (7) gives us the
correct number (two) of paths of length n that contain a run of length n = S:
n− S
N (r?, (≥ S ) ≥ 1 | n = S , p = 0.5) = ∑ Fi + 2 , S −1 ⋅ 2 n − S + 1−i = 2 ⋅ F2, S −1 ⋅ 2 0 = 2 .
(
)
i =0
2. The inductive step: suppose that the formula (7) holds for some n. We need to
prove that the formula (7) also holds when n +1 is substituted for n.
Let us write down the formula (7) for n +1:
N (r?, (≥ S ) ≥ 1 | n + 1, p = 0.5) =
n +1− S
∑F
i + 2 , S −1
⋅ 2 n +1− S + 1−i
i =0
n− S
= 2 ⋅ ∑ Fi + 2 , S −1 ⋅ 2 n − S + 1−i + 2 ⋅ Fn − S +3, S −1
i =0
The first summand of this expression gives a number of those outcomes of
interest for n + 1 binary trials, which are generated from the ends of all paths
existing at the nth level of a tree diagram. These outcomes are represented by
clusters of paths at the (n+1)st level (see Fig. 1). The second summand gives a
number of the single outcomes of interest appearing at the (n + 1)st level. These
new outcomes belong to the paths having only one run of the length S which is
situated at the end of the path. These terminating runs originate at the paths
having the same feature (only one run of the length S at the end of the path at
levels n, n-1, …). The total number of these generating paths equals to the sum
of their numbers at levels n, n − 1, K , n − S + 2 . As we can see from the formula
(7) for the nth level written using the Horner scheme
n− S
N (r?, (≥ S ) ≥ 1 | n, p = 0.5) = ∑ Fi + 2, S −1 ⋅ 2 n − S + 1−i =
(
(
i =0
(
) ))
= 2 ⋅ Fn − S + 2 , S −1 + 2 ⋅ Fn − S +1, S −1 + K + 2 ⋅ F2 +i, S −1 + K + 2 ⋅ (F3, S −1 + 2 ⋅ F2, S −1 )K K
9
the
abovementioned
numbers
equal
to
the
Fibonacci
numbers
Fn− S + 2− j , S −1 , j = 0, S − 2 . This means that the actor Fn − S + 3, S −1 of the second
S −2
summand equals to the sum of Fibonacci numbers
∑F
n − S + 2 − j , S −1
and, therefore,
j =0
is indeed a Fibonacci number of S-1 order by definition. That is, the inductive
step is proven ■
The analysis of computation performance of the formula (8) has been carried out using
the C++ code given in Appendix B. First of all, the program calculates the related
Fibonacci numbers placed into array fib. In so doing the computing algorithm takes into
account the inner structure of a Fibonacci numbers sequence, which contains an initial
sub-sequence of numbers being a power of 2. Several Fibonacci numbers sequences are
listed in Table A, the initial sub-sequences being selected by the grey background
colour.
Table A. Fibonacci numbers Fi + 2 , S −1 of order S − 1 , i = 0, n
i
0
1
2
3
4
5
6
7
8
9
10
3
1
1
2
3
5
8
13
21
34
55
89
4
1
1
2
4
7
13
24
44
81
149
274
5
1
1
2
4
8
15
29
56
108
208
401
6
1
1
2
4
8
16
31
61
120
236
464
7
1
1
2
4
8
16
32
63
125
248
492
8
1
1
2
4
8
16
32
64
127
253
504
9
1
1
2
4
8
16
32
64
128
255
509
S
10
The second part of the program computes the probability P ( r?, ( ≥ S ) ≥ 1 | n, p = 0.5) by the
formula (8). The algorithm and code are able to make calculations on contemporary PCs
for very long Bernoulli trials sequences, up to n = 10 3 . The results obtained are depicted
in the Fig.2 and can be used to test for randomness of a pattern of Bernoulli trials
Probability of obtaining at least one run of length S or
greater among either the 1’s or the 2’s
with p = q = 0.5 (null hypothesis).
P (r?, (≥ S ) ≥ 1 | n, p = 0.5)
S=9
S = 10
S = 11
0.30
S = 12
0.20
S = 13
0.10
S = 14
S = 15
0
200
400
600
800
1000
Size of the sequence of Bernoulli trials
For example, let us consider a sequence of 1s and 2s of length n = 600 containing a
run of length S = 14 , and test the null hypothesis under the significance level α = 0.05 .
Consulting the Fig. 2, one can find that the null hypothesis should be rejected. If we
increase the number of trials up to n = 1000 , the chance probability that a sequence of n
Bernoulli trials with p = q = 0.5 would contain a run of 14 or more consecutive either
n
11
1’s or 2’s is about 0.06, so the null hypothesis cannot be rejected at the given
significance level.
The Fibonacci numbers of higher orders are used in other Bernoulli trials related
problems, such as the coin tossing (see, e.g. [6]), where the probability that no runs of
k consecutive tails will occur in n coin tosses is given by Fn(+k2) / 2 n , where Fl (k ) is a
Fibonacci k-step number (kth order).
4. Summary
In the paper, a new powerful approach to the longest of runs test is described, which
can effectively replace the classical combinatorial formulas in the particular, but
important, case of equal probabilities Bernoulli trials processes. This approach is based
on a thorough analysis of the event tree diagram, which suggested deriving a concise
formula for the probability of obtaining at least one run of length S or greater among
either the 1’s or the 2’s. The derived formula extensively uses the Fibonacci numbers of
higher orders. The formula proves to be capable processing very long dichotomous
sequences – up to n = 10 3 as compared to n ≈ 180 for the classical combinatorial
approach. The correctness of the results obtained was checked by a breadth-first search
algorithm, and the complete coincidence has been shown. The side result of the paper
lies in revealing a regular error being inherent in the classical combinatorial algorithm
in some cases.
5. Acknowledgements
The author would like to thank Prof. Wej-Min Huang for his comments and
suggestions that led to improvements in the paper.
12
Appendix A
//The C++ code developed for computing the statistics of the
//classical longest of runs test:
#include<iostream>
#include<cmath>
#include<iomanip>
using namespace std;
double Factorial(int n)
{
double t; int i;
for(t = 1, i = 1; i <= n; ++i)
return t;
}
t *= i;
double C(int n, int r)
{
if(n <= 0) return 0;
double t; int i;
for(t = 1, i = n; i >= n-r+1; --i)
return t/Factorial(r);
}
t *= i;
double Prob(int n, int s, double p = 0.5, double q = 0.5)
{
double prob = 0; int i;
for(int n1 = 0; n1 <= n; ++n1)
{
double prob1 = 0, prob2 = 0, prob12 = 0;
for(i = 1; i <= n1/s; ++i)
prob1 += pow(-1, i+1)*C(n-n1+1, i)*C(n-i*s, n-n1);
for(i = 1; i <= (n-n1)/s; ++i)
prob2 += pow(-1, i+1)*C(n1+1, i)*C(n-i*s, n1);
for(int r1 = 1; r1 <= n1-s+1; ++r1)
{
double a1 = 0, a2 = 0, a3 = 0, a4 = 0;
for(i = 1; i <=(n1-r1)/(s-1); ++i)
a1 += pow(-1, i+1)*C(r1, i)*C((n1-1)-i*(s-1), r1-1);
for(i = 1; i <= (n-n1-r1+1)/(s-1); ++i)
a2 += pow(-1, i+1)*C(r1-1, i)*C((n-n1-1)-i*(s-1), r1-2);
for(i = 1; i <= (n-n1-r1)/(s-1); ++i)
a3 += pow(-1, i+1)*C(r1, i)*C((n-n1-1)-i*(s-1), r1-1);
for(i = 1; i <= (n-n1-r1-1)/(s-1); ++i)
a4 += pow(-1, i+1)*C(r1+1, i)*C((n-n1-1)-i*(s-1), r1);
prob12 += a1*(a2 + 2*a3 + a4);
}
prob += (prob1 + prob2 – prob12)*pow(p, n1)*pow(q, n-n1);
}
return prob;
}
int main()
{
int n = 8, s = 4;
double prob = Prob(n, s);
cout.setf(ios::fixed);
cout << "n = " << n << " " << "s = " << s << endl <<"count = "
13
<<setw(16)<<setprecision(0)<<prob*pow(2,n)<< endl
<< "prob24 = " << setw(24)<<setprecision(24)<<prob<<endl;
return 0;
}
Appendix B
//The C++ code for the proposed longest of runs test algorithm
//using the Fibonacci numbers:
#include<iostream>
#include<cmath>
#include<iomanip>
using namespace std;
double RunsFib(int n, int s)
{
double* fib = new double[n-s+1];
for(int i = 0; i <= n-s; ++i)
fib[i] = 0;
double p = 0;
if(s > n)
{
cout << "error" << endl;
return -1;
}
else
{
fib[0] = 1;
fib[1] = 1;
for(i = 2; i < s-1 && i < n-s+1; ++i)
fib[i] = pow(2,i-1);
for(i = s-1; i <= n-s; ++i)
for(int j = 0; j < s-1; ++j)
fib[i] += fib[i-j-1];
for(i = 0; i <= n-s; ++i)
p += fib[i]*pow(0.5, s+i);
}
p *= 2;
cout.setf(ios::fixed);
cout << "n = " << n << " s = " << s << endl;
cout << " prob24 = " << setprecision(24) << p << endl;
delete [] fib;
return p;
}
int main()
{
int n = 8, s = 4;
RunsFib(n,s);
return 0;
}
14
References
[1] Siegel, S., Castellan, N.J., Jr., 1988, Nonparametric Statistics for the Behavioral
Sciences, 2nd ed. (New York: McGraw-Hill).
[2] Sprent P., 1993, Applied Nonparametric Statistical Methods, 2nd ed. (London:
Chapman & Hall).
[3] Mood, A. M., 1940, “The Distribution Theory of Runs,” Annals of Mathematical
Statistics, 11, 367-392.
[4] Bradley, J.V., 1968, Distribution-Free Statistical Tests (Englewood Cliffs, New
York: Prentice Hall).
[5] Grinstead C. M., Snell J. L., 1997, Introduction to Probability, 2nd rev. ed.
(American Mathematical Society).
[6] http://mathworld.wolfram.com/Fibonaccin-StepNumber.html