Download portable document (.pdf) format

An Algorithm of a Longest of Runs Test for Very Long Sequences of Bernoulli Trials Alexander I. KOZYNCHENKO Faculty of Science, Technology, and Media, Mid Sweden University, SE-85170, Sundsvall, Sweden [email protected] Abstract A new algorithm of computing statistics of a longest of runs test is proposed for the case of equal probability Bernoulli trials processes. The algorithm is founded on the analysis of the event tree diagram, which has shown the role of Fibonacci numbers of higher orders in counting the number of outcomes of interest in the sample space. The proof by induction is given. Compared to the classical combinatorial formulas, the proposed algorithm provides the error-free exact probabilities and makes possible the processing of very long binomial data sets (up to 103) on contemporary computers. Keywords: runs tests, longest run, Bernoulli trials, Fibonacci numbers, computing algorithms 2000 Mathematics Subject Classifications: 62G10; 62-04; 11B39 1. Introduction Distribution-free tests for randomness of a sample data play an important role in much statistical inference and are relevant to many applications in sociology, biology, psychology, engineering etc., including such particular problems as regression and 2 curve fitting. There is a great body of literature on the subject, worthy of mention of which are the books by Siegel and Castellan, [1], and Sprent [2]. This paper is concerned with the computational aspects of an important distributionfree runs test, namely, the longest of runs test of randomness applied to long patterns of binomial trials. Amongst a number of publications in the runs tests investigation, it is worth mentioning the paper by Mood, [3], and the monograph by Bradley, [4], who gave a detailed treatment of runs tests, as well as the appropriate survey of the works done in 1940s-60s by Olmstead, Mosteller, Grant, Burr and Cane, et al. The existing runs tests are based on either numbers of runs or lengths of runs. The total-number-of-runs test provides both exact combinatorial formulas and asymptotic ones in the assumption of normally distributed statistics for large samples (see, e.g., [4], However, for the longest of runs tests there is no such an asymptotic theory, p. 262). and we have to use the exact combinatorial formulas. So, the question arises as to whether those formulas are applicable for computing the statistics on contemporary computers in the case of large samples, or it is necessary to derive more adequate theory applicable to processing long samples. 2. Analysis of the classical combinatorial formulas for the longest of runs test The conventional approach to deriving a general formula for the probability P (r?, ( ≥S ) ≥ 1) of obtaining at least one run of length S or greater among either the 1’s or the 2’s had been described in [3]. It is to be noted that “either” includes the possibility 3 of “both”. The approach is based on the formula of calculating the probability of a sum of random compatible events: P ( r?, (≥S ) ≥ 1) = P ( r1, (≥S ) − P ( r1, where r1 , ≥ 1) + P ( r2, (≥S ) (≥ S ) ≥ 1 and r2 , ≥ 1) (≥ S ) ≥ 1) (1) r2 , r? are numbers of runs of 1’s, 2’s, and of unspecified type of element containing the run, respectively; P ( r1, ( ≥S ) ≥ 1) is the probability of obtaining at least one run of length ≥ S among the 1’s but not among the 2’s; P ( r2 , ( ≥S ) ≥ 1) is the probability of obtaining at least one such run among the 2’s but not among the 1’s; P ( r1, ( ≥ S ) ≥ 1 and r2 , ( ≥ S ) ≥ 1) is the probability of obtaining at least one such run among both the 1’s and the 2’s; Suppose that a sequence of n trials contains n1 1’s and n 2 = n − n1 2’s. In this case, the probabilities in (1) can be computed on the following combinatorial formulas: P (r1, ( ≥ S ) ≥ 1) = P(r2, ( ≥ S ) ≥ 1) = n + 1  n − iS  1 n 1/ S , (− 1)i +1  2  ∑ i  n2   n  i =1     n1  n + 1 n − iS  1 n2 / S , (− 1)i +1  1  ∑ i  n1   n  i =1     n1  (2) (3) 4 P(r1, (≥ S ) ≥ 1 and r2, ≥ 1) (n − r ) / ( S −1) 1 n 1− S +1  1 1  i +1  r1  (n1 − 1) − i (S − 1)   =  ∑ (− 1)   ∑ r1 − 1  n  r1 =1  i =1  i      n1  (≥ S ) (n2 − r 1−1) / (S −1) r − 1  (n − 1) − i (S − 1)  (− 1)i +1  1  2  ∑ r1 − 2  i =1  i   (n2 − r1 −1) / ( S −1) r (n − 1) − i (S − 1)  (− 1)i +1  1  2 + 2 ∑ r1 − 1  i =1  i   (n2 − r1 −1) / (S −1) r + 1  (n − 1) − i (S − 1)    + (− 1)i +1  1  2 ∑ r1 i =1  i     (4) These run formulas take n1 and n 2 as given. But in the case of Bernoulli trials, when 1 and 2 are mutually exclusive outcomes with probabilities p and q respectively of occurrence on a single trial, it would be convenient to eliminate parameters n1 and n 2 . The extension where n1 and n2 are not fixed, so that the probability (1) is completely irrespective of n1 and depends only on n and p, is described in [3]. The compound n n n probability is obtained by taking the product of the binomial probability   p 1 q 2 and  n1  the probability being computed on (1)-(4). The sum of that product over all possible values of n1 gives the sought-for probability: n P (r?, (≥S ) ≥ 1 | n, p ) = ∑ P(r ?, ( ≥ S ) n1 = 0 n n−n ≥ 1 | n1 , n) ⋅   ⋅ p n1 (1 − p ) 1  n1  (5) Evidently, it is worth while developing the computing algorithm in order to check the correctness and to evaluate the performance of this formula. The author has created the C++ program that computes the probability P (r?, ( ≥ S ) ≥ 1 | n, p ) using the formulas (1)(5). The code is placed in Appendix A. A number of computing tests has been 5 accomplished, and the analysis of the results has revealed two drawbacks of the formula (5). First of all, it gives a systematic error that manifests itself in the expression (4). Let us consider, for instance, the case of n = 8, S =4, p=0.5. The computations on the formula (5) give the probability P ( r?, (≥S ) ≥ 1 | n, p ) = 0.375 and the number of outcomes of interest N = 96 (the number of all possible outcomes equals to 28 = 256). However, the correct values are 0.3671875 and 94, respectively. The reason is in the formula (4) that gives the erroneous zero value for the probability P (r1, ( ≥ S ) ≥ 1 and r2, ( ≥ S ) ≥ 1) , whereas the correct value is 2-7, which corresponds to two outcomes of interest available: 11112222 and 22221111. In order to further checking of the classical formulas, the author developed a brute-force algorithm based on the breadth-first search technique. It gives the exact solutions, but has the ( ) exponential computation time t (n ) = O 2 n and therefore cannot be applied to samples of length n > 40. The comparison of the results obtained by this algorithm with that of the classical formulas discloses that the classical formulas give a regular positive error when S ≤ 0.5 ⋅ n . This evidently confirms the fallacy of the formula (4), since it relates to the paths with two or more runs of length S when the formula (4) is applied. Secondly, the computing tests have revealed an upper limit on the length of binomial sequences being processed on contemporary PCs equipped, e.g., the AMD Athlon™ 64x2 Dual Core processor 4600+. This limit amounts to n = 180 for S<10 and p = 0.5. Such a restriction does not actually allow processing very long binomial sequences wherein the adequate power of runs tests could be attained. 6 3. Description of the proposed algorithm using the Fibonacci numbers, its proof, and performance The longest of runs test of a Bernoulli trials process can be analysed using a binary tree diagram as the standard technique of representing the sample space and counting probabilities, [5]. The paths with outcomes of interest contain at least one run of length S or greater. They all are indicated in Fig. 1, where solid and dotted lines mean success (or, say, 1) and failure (or 2) of a chance experiment, correspondingly. Number of a trial j = { 0, … n} B 1 2 3 4 C D 1 5 E 1 6 2 7 4 8 F 7 Fig. 1. The half of an event tree diagram showing the number of the runs of length ≥S in the sequence of Bernoulli trials of length n = 8 (S = 4) Some paths containing a run of length S at their ends are shown completely from the root to a leaf (e.g., ABDF), whereas the others are pooled into clusters of paths having both an initial common explicit part that ends with a run of length S and a subsequent arbitrary sub-tree (see, e.g., clusters ABC or ABDE). Fibonacci numbers Fi+S-2, S-1 of order S-1, i = j- 4 A 0 7 We will consider the particular (and most important) case of a Bernoulli trials process with equal probabilities of successes and failures on a chance experiment p = q = 0.5. Here, the probabilities of all outcomes are the same, being equal to 0.5n for n Bernoulli trials. Hence, in order to compute the probability P(r?, (≥S ) ≥ 1) of obtaining at least one run of length S or greater among either the 1’s or the 2’s we need to calculate the number of outcomes of interest. In the case (n = 8, S = 4) depicted in Fig. 1, this number can be estimated as follows: ( ) N (r?, (≥ S = 4 ) ≥ 1 | n = 8, p = 0.5) = 2 ⋅ 1 ⋅ 2 4 + 1 ⋅ 2 3 + 2 ⋅ 2 2 + 4 ⋅ 21 + 7 ⋅ 2 0 , (6) where the first item, 1 ⋅ 2 4 , corresponds to the cluster ABC, the second one, 1 ⋅ 2 3 , corresponds to ABDE, and so on until the item 7 ⋅ 2 0 that relates to the individual paths not including sub-trees, as ABDF. As we can see, the factors 1, 1, 2, 4, 7 form a part (without zeros) of the sequence of Fibonacci numbers of 3rd order. Having analysed the tree diagrams for other (n, S) cases, the general formula for arbitrary n, S<n is derived: N (r?, (≥ S ) ≥ 1 | n, p = 0.5) ( = 2 ⋅ F2, S −1 ⋅ 2 n − S + F3, S −1 ⋅ 2 n − S −1 + ... + F2+i, S −1 ⋅ 2 n − S −i + ... + Fn − S + 2, S −1 ⋅ 2 0 ) (7) n− S = ∑ Fi + 2, S −1 ⋅ 2 n − S + 1−i i =0 where Fi + 2, S −1 is a (i + 2) th Fibonacci number of S-1 order. So, the formula for the sought-for probability is derived from (2) by dividing it by the number of all possible outcomes 2n: n−S P ( r?, ( ≥ S ) ≥ 1 | n, p = 0.5) = ∑ i =0 Fi + 2 , S −1 2 i + S −1 The formula (7) can be proved by mathematical induction: (8) 8 1. The basis: the formula (7) holds when n = S. Indeed, in this case (7) gives us the correct number (two) of paths of length n that contain a run of length n = S: n− S N (r?, (≥ S ) ≥ 1 | n = S , p = 0.5) = ∑ Fi + 2 , S −1 ⋅ 2 n − S + 1−i = 2 ⋅ F2, S −1 ⋅ 2 0 = 2 . ( ) i =0 2. The inductive step: suppose that the formula (7) holds for some n. We need to prove that the formula (7) also holds when n +1 is substituted for n. Let us write down the formula (7) for n +1: N (r?, (≥ S ) ≥ 1 | n + 1, p = 0.5) = n +1− S ∑F i + 2 , S −1 ⋅ 2 n +1− S + 1−i i =0 n− S = 2 ⋅ ∑ Fi + 2 , S −1 ⋅ 2 n − S + 1−i + 2 ⋅ Fn − S +3, S −1 i =0 The first summand of this expression gives a number of those outcomes of interest for n + 1 binary trials, which are generated from the ends of all paths existing at the nth level of a tree diagram. These outcomes are represented by clusters of paths at the (n+1)st level (see Fig. 1). The second summand gives a number of the single outcomes of interest appearing at the (n + 1)st level. These new outcomes belong to the paths having only one run of the length S which is situated at the end of the path. These terminating runs originate at the paths having the same feature (only one run of the length S at the end of the path at levels n, n-1, …). The total number of these generating paths equals to the sum of their numbers at levels n, n − 1, K , n − S + 2 . As we can see from the formula (7) for the nth level written using the Horner scheme n− S N (r?, (≥ S ) ≥ 1 | n, p = 0.5) = ∑ Fi + 2, S −1 ⋅ 2 n − S + 1−i = ( ( i =0 ( ) )) = 2 ⋅ Fn − S + 2 , S −1 + 2 ⋅ Fn − S +1, S −1 + K + 2 ⋅ F2 +i, S −1 + K + 2 ⋅ (F3, S −1 + 2 ⋅ F2, S −1 )K K 9 the abovementioned numbers equal to the Fibonacci numbers Fn− S + 2− j , S −1 , j = 0, S − 2 . This means that the actor Fn − S + 3, S −1 of the second S −2 summand equals to the sum of Fibonacci numbers ∑F n − S + 2 − j , S −1 and, therefore, j =0 is indeed a Fibonacci number of S-1 order by definition. That is, the inductive step is proven ■ The analysis of computation performance of the formula (8) has been carried out using the C++ code given in Appendix B. First of all, the program calculates the related Fibonacci numbers placed into array fib. In so doing the computing algorithm takes into account the inner structure of a Fibonacci numbers sequence, which contains an initial sub-sequence of numbers being a power of 2. Several Fibonacci numbers sequences are listed in Table A, the initial sub-sequences being selected by the grey background colour. Table A. Fibonacci numbers Fi + 2 , S −1 of order S − 1 , i = 0, n i 0 1 2 3 4 5 6 7 8 9 10 3 1 1 2 3 5 8 13 21 34 55 89 4 1 1 2 4 7 13 24 44 81 149 274 5 1 1 2 4 8 15 29 56 108 208 401 6 1 1 2 4 8 16 31 61 120 236 464 7 1 1 2 4 8 16 32 63 125 248 492 8 1 1 2 4 8 16 32 64 127 253 504 9 1 1 2 4 8 16 32 64 128 255 509 S 10 The second part of the program computes the probability P ( r?, ( ≥ S ) ≥ 1 | n, p = 0.5) by the formula (8). The algorithm and code are able to make calculations on contemporary PCs for very long Bernoulli trials sequences, up to n = 10 3 . The results obtained are depicted in the Fig.2 and can be used to test for randomness of a pattern of Bernoulli trials Probability of obtaining at least one run of length S or greater among either the 1’s or the 2’s with p = q = 0.5 (null hypothesis). P (r?, (≥ S ) ≥ 1 | n, p = 0.5) S=9 S = 10 S = 11 0.30 S = 12 0.20 S = 13 0.10 S = 14 S = 15 0 200 400 600 800 1000 Size of the sequence of Bernoulli trials For example, let us consider a sequence of 1s and 2s of length n = 600 containing a run of length S = 14 , and test the null hypothesis under the significance level α = 0.05 . Consulting the Fig. 2, one can find that the null hypothesis should be rejected. If we increase the number of trials up to n = 1000 , the chance probability that a sequence of n Bernoulli trials with p = q = 0.5 would contain a run of 14 or more consecutive either n 11 1’s or 2’s is about 0.06, so the null hypothesis cannot be rejected at the given significance level. The Fibonacci numbers of higher orders are used in other Bernoulli trials related problems, such as the coin tossing (see, e.g. [6]), where the probability that no runs of k consecutive tails will occur in n coin tosses is given by Fn(+k2) / 2 n , where Fl (k ) is a Fibonacci k-step number (kth order). 4. Summary In the paper, a new powerful approach to the longest of runs test is described, which can effectively replace the classical combinatorial formulas in the particular, but important, case of equal probabilities Bernoulli trials processes. This approach is based on a thorough analysis of the event tree diagram, which suggested deriving a concise formula for the probability of obtaining at least one run of length S or greater among either the 1’s or the 2’s. The derived formula extensively uses the Fibonacci numbers of higher orders. The formula proves to be capable processing very long dichotomous sequences – up to n = 10 3 as compared to n ≈ 180 for the classical combinatorial approach. The correctness of the results obtained was checked by a breadth-first search algorithm, and the complete coincidence has been shown. The side result of the paper lies in revealing a regular error being inherent in the classical combinatorial algorithm in some cases. 5. Acknowledgements The author would like to thank Prof. Wej-Min Huang for his comments and suggestions that led to improvements in the paper. 12 Appendix A //The C++ code developed for computing the statistics of the //classical longest of runs test: #include<iostream> #include<cmath> #include<iomanip> using namespace std; double Factorial(int n) { double t; int i; for(t = 1, i = 1; i <= n; ++i) return t; } t *= i; double C(int n, int r) { if(n <= 0) return 0; double t; int i; for(t = 1, i = n; i >= n-r+1; --i) return t/Factorial(r); } t *= i; double Prob(int n, int s, double p = 0.5, double q = 0.5) { double prob = 0; int i; for(int n1 = 0; n1 <= n; ++n1) { double prob1 = 0, prob2 = 0, prob12 = 0; for(i = 1; i <= n1/s; ++i) prob1 += pow(-1, i+1)*C(n-n1+1, i)*C(n-i*s, n-n1); for(i = 1; i <= (n-n1)/s; ++i) prob2 += pow(-1, i+1)*C(n1+1, i)*C(n-i*s, n1); for(int r1 = 1; r1 <= n1-s+1; ++r1) { double a1 = 0, a2 = 0, a3 = 0, a4 = 0; for(i = 1; i <=(n1-r1)/(s-1); ++i) a1 += pow(-1, i+1)*C(r1, i)*C((n1-1)-i*(s-1), r1-1); for(i = 1; i <= (n-n1-r1+1)/(s-1); ++i) a2 += pow(-1, i+1)*C(r1-1, i)*C((n-n1-1)-i*(s-1), r1-2); for(i = 1; i <= (n-n1-r1)/(s-1); ++i) a3 += pow(-1, i+1)*C(r1, i)*C((n-n1-1)-i*(s-1), r1-1); for(i = 1; i <= (n-n1-r1-1)/(s-1); ++i) a4 += pow(-1, i+1)*C(r1+1, i)*C((n-n1-1)-i*(s-1), r1); prob12 += a1*(a2 + 2*a3 + a4); } prob += (prob1 + prob2 – prob12)*pow(p, n1)*pow(q, n-n1); } return prob; } int main() { int n = 8, s = 4; double prob = Prob(n, s); cout.setf(ios::fixed); cout << "n = " << n << " " << "s = " << s << endl <<"count = " 13 <<setw(16)<<setprecision(0)<<prob*pow(2,n)<< endl << "prob24 = " << setw(24)<<setprecision(24)<<prob<<endl; return 0; } Appendix B //The C++ code for the proposed longest of runs test algorithm //using the Fibonacci numbers: #include<iostream> #include<cmath> #include<iomanip> using namespace std; double RunsFib(int n, int s) { double* fib = new double[n-s+1]; for(int i = 0; i <= n-s; ++i) fib[i] = 0; double p = 0; if(s > n) { cout << "error" << endl; return -1; } else { fib[0] = 1; fib[1] = 1; for(i = 2; i < s-1 && i < n-s+1; ++i) fib[i] = pow(2,i-1); for(i = s-1; i <= n-s; ++i) for(int j = 0; j < s-1; ++j) fib[i] += fib[i-j-1]; for(i = 0; i <= n-s; ++i) p += fib[i]*pow(0.5, s+i); } p *= 2; cout.setf(ios::fixed); cout << "n = " << n << " s = " << s << endl; cout << " prob24 = " << setprecision(24) << p << endl; delete [] fib; return p; } int main() { int n = 8, s = 4; RunsFib(n,s); return 0; } 14 References [1] Siegel, S., Castellan, N.J., Jr., 1988, Nonparametric Statistics for the Behavioral Sciences, 2nd ed. (New York: McGraw-Hill). [2] Sprent P., 1993, Applied Nonparametric Statistical Methods, 2nd ed. (London: Chapman & Hall). [3] Mood, A. M., 1940, “The Distribution Theory of Runs,” Annals of Mathematical Statistics, 11, 367-392. [4] Bradley, J.V., 1968, Distribution-Free Statistical Tests (Englewood Cliffs, New York: Prentice Hall). [5] Grinstead C. M., Snell J. L., 1997, Introduction to Probability, 2nd rev. ed. (American Mathematical Society). [6] http://mathworld.wolfram.com/Fibonaccin-StepNumber.html

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download portable document (.pdf) format