Download On Words with Maximal Number of Distinct Subwords

On Words with Maximal Number of Distinct Subwords (preliminary draft) Wojciech Rytter Warsaw University, Warsaw, Poland Abstract We say that a string is factor-maximal iff it contains the largest number of distinct factors among strings of the same length and over the same alphabet. We show a family of binary strings which are factor-maximal and which are closely related to de Bruijn words. One de Bruijn word of rank k represents, in a compact way, exponentially many factor-maximal words. By the way we give a simplified linear time construction of a factor-maximal string of a given length n. We assume in this paper that the alphabet of considerd words is binary, it simplifies presentation, and an extension to general finite alphabets is straightforward. Factor-maximal words are closely related to de Bruijn words and de Bruijn Graphs. De Bruijn word of rank k is any word of length 2k containing cyclically each subword of length k exactly once. A linear de Bruijn word of rank k is a cyclic de bruin word concatenated with its prefix of size k − 1. Denote γk = 2k + k − 1. The number γk is the size of linear version of de Bruijn word of rank k. Each linear de Bruijn word is factor-maximal. Assume we have an integer n such that γk < n < γk+1 , our goal is to construct a factor-maximal binary word of size n. Observation 1. A word w of length n, where γk < n < γk+1 , is factor-maximal iff it contains each word of length k and each subword of length k + 1 occurs in w at most once. Example 1. The following word is a de Bruijn word of rank 4: 0000100110101111 Its linear version is a linear de Bruijn word of size γ4 : 0 0 0 0 1 0 0 1 1 0 1 0 1 1 1 1 0 0 0. Both these words are factor-maximal, since they contain all words of size 3 as factors and all their factors of size 4 are distinct. A construction of factor-maximal word of size γk < n < γk+1 has been given in [?] using a rather complicated construction of simple cycles of size r in de Bruijn graph of rank k for any 1 ≤ k ≤ 2k , 1 this construction was earlier given in [?]. In this paper we only need to find a Hamiltonian cycle which is much easier. De Bruijn graph of rank k is Gk = (Vk , Ek ), where Vk = {0, 1}k . The edges are of the form: d c, d ∈ {0, 1}, α ∈ {0, 1}k−1 c · α −→ α · d, The label of each such edge is the symbol d, appended to α. An example of de Bruijn graph of rank 4 is shown in Figure 2, where binary words corresponding to nodes are converted to numbers. When interpreting nodes as numbers we have the edges 0 1 i −→ (2i mod 2k ), i −→ (2i + 1 mod 2k ) A path (not necessarily simple) is a chain if all its edges are distinct. It is a cyclic-chain if the first and the last vertex are the same. Denote by val(π) the sequence of labels of edges of the chain π. Then each cyclic de Bruijn word of rank k equals val(π) for some Eulerian cycle π of the graph Gk . Lemma 1. (a) If each vertex Gk has an occurrence on the chain π at the distance from the starting vertex at least k then val(π) is factor maximal. (b) If additionaly π is cyclic then after appending at the end to π its prefix of length at most k we also obtain a factor-maximal word. Definition of deBruijn(C, v), where C is a HAmiltonian cycle of Gk . x 3 x 2 x 4 y 3 y 2 y4 y1 x 5 y5 x1 y6 x 6 Figure 1: The cyclic structure of Gk . The big cycle is a Hamiltonian cycle C = (y1 · y2 · y3 · y4 · y5). Other (outer)cycles result by removing C from the graph. xi ’s are values (chain labels) of the outer cycles. We start in the starting node v of C and traverse the graph, the edges which are not in C have priority. We receive the word deBruijn(C, v). The word x1 corresponds to the largest outer cycle. We obtain an Euler cycle by starting with the largest cycle, traversing C and consecutive outer cycles. The edge-labels of such an Euler cycle form the word: x1 y1 x2 y2 x3 y3 x4 y4 x5 y5 . 2 Example. Take the de Bruijn graph G4 , see Figure 2, it has 16 nodes and 32 edges. 0 C4 1 8 2 12 4 5 7 5 10 C3 3 2 15 13 9 9 14 14 10 3 11 C5 12 C2 1 6 0 13 4 6 8 10 1 11 11 2 13 12 C1 6 7 4 14 9 3 15 Figure 2: The graph DB4 can be decomposed into one Hamiltonian cycle and 5 edge-disjoint cycles C1, C2, C3, C4, C5, their sizes are x1 = 8, x2 = 1, x3 = 4, x4 = 1, x5 = 2. There is a Hamiltonian cycle C = (8, 0, 1, 3, 7, 15, 14, 12, 9, 2, 5, 11, 6, 13, 10, 4, 8). After removing this cycle we have 5 disjoint cycles: [8, 12, 2, 4, 9, 3, 6, 12, 8] [0, 0] [7, 14, 13, 11, 7] [15, 15] [5, 10, 5]. the total structure of an Euler cycle induced by C looks as follows: [8, 12, 2, 4, 9, 3, 6, 12, 8] [0, 0] 1, 3, [7, 14, 13, 11, 7] [15, 15] 14, 2, 9, 2, [5, 10, 5] 11, 6, 13, 10, 4, 8. It implies, taking labels of consecutive edges, a de Bruijn sequence [10011000] 0 [0] 111 [0111] 1 [1] 00101 [01] 101000 It can be written as DBk = x1 y1 x2 y2 x3 y3 x4 y4 x5 y5 , where: x1 = 10011000, y1 = 0 x2 = 0 y2 = 111 x3 = 0111 y3 = 1 x3 = 1 y4 = 00101 x4 = 01 y5 = 101000. The sequence DBk−1 = y1 y2 y3 y4 y5 is a de Bruijn sequence of smaller rank. We append the first k symbols to the end and create the sequence: αk = [10011000] 0 [0] 111 [0111] 1 [1] 00101 [01] 101000 1001 3 Then for each 2k < n ≤ 2k+1 + k we can create a factor maximal word of size n by concatenating a subword of DBk with a suffix of DBk−1 and a prefix of DBk of length at most k. Theorem 1. For each k there is a pair of twin de Bruijn words wk of length 2k and wk+1 of length 2k+1 such that for any γk ≤ n < γk+1 there is a factor-maximal subword of wk wk+1 of length n. Specifically Suf (n − p, wk ) · P ref (n, wk+1 ) is factor maximal for a parameter p = p(n). The words wk , wk+1 and the parameter p can be computed in linear time. Proof. Let us consider the decomposition of the de Bruijn word corresponaing w = x1 · y1 · x2 · y2 · . . . xr · yr . (1) Let us distinguish positions corresponding to elements of xi . We compute the size p of the shortest prefix of wk+1 containing n − γk distinguished positions. Then we obtain maximal-factor word of size n as M axF actor(n) = SU f (n − p, wk ) · P ref (p, wk+1 Observation 2. The word wk and its decomposition are the same for all γk < n < γk+1 (they represent exponentially many factor-maximal words). 4

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download On Words with Maximal Number of Distinct Subwords