Download On Words with Maximal Number of Distinct Subwords

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Collatz conjecture wikipedia , lookup

Transcript
On Words with Maximal Number of Distinct Subwords
(preliminary draft)
Wojciech Rytter
Warsaw University, Warsaw, Poland
Abstract
We say that a string is factor-maximal iff it contains the largest number of distinct factors
among strings of the same length and over the same alphabet. We show a family of binary strings
which are factor-maximal and which are closely related to de Bruijn words. One de Bruijn word
of rank k represents, in a compact way, exponentially many factor-maximal words. By the way
we give a simplified linear time construction of a factor-maximal string of a given length n.
We assume in this paper that the alphabet of considerd words is binary, it simplifies presentation,
and an extension to general finite alphabets is straightforward. Factor-maximal words are closely
related to de Bruijn words and de Bruijn Graphs. De Bruijn word of rank k is any word of length
2k containing cyclically each subword of length k exactly once. A linear de Bruijn word of rank k
is a cyclic de bruin word concatenated with its prefix of size k − 1. Denote γk = 2k + k − 1. The
number γk is the size of linear version of de Bruijn word of rank k. Each linear de Bruijn word is
factor-maximal.
Assume we have an integer n such that γk < n < γk+1 , our goal is to construct a factor-maximal
binary word of size n.
Observation 1. A word w of length n, where γk < n < γk+1 , is factor-maximal iff it contains each
word of length k and each subword of length k + 1 occurs in w at most once.
Example 1.
The following word is a de Bruijn word of rank 4:
0000100110101111
Its linear version is a linear de Bruijn word of size γ4 :
0 0 0 0 1 0 0 1 1 0 1 0 1 1 1 1 0 0 0.
Both these words are factor-maximal, since they contain all words of size 3 as factors and all their
factors of size 4 are distinct.
A construction of factor-maximal word of size γk < n < γk+1 has been given in [?] using a rather
complicated construction of simple cycles of size r in de Bruijn graph of rank k for any 1 ≤ k ≤ 2k ,
1
this construction was earlier given in [?]. In this paper we only need to find a Hamiltonian cycle
which is much easier.
De Bruijn graph of rank k is Gk = (Vk , Ek ), where Vk = {0, 1}k . The edges are of the form:
d
c, d ∈ {0, 1}, α ∈ {0, 1}k−1
c · α −→ α · d,
The label of each such edge is the symbol d, appended to α. An example of de Bruijn graph of rank
4 is shown in Figure 2, where binary words corresponding to nodes are converted to numbers. When
interpreting nodes as numbers we have the edges
0
1
i −→ (2i mod 2k ),
i −→ (2i + 1 mod 2k )
A path (not necessarily simple) is a chain if all its edges are distinct. It is a cyclic-chain if the first
and the last vertex are the same.
Denote by val(π) the sequence of labels of edges of the chain π. Then each cyclic de Bruijn word
of rank k equals val(π) for some Eulerian cycle π of the graph Gk .
Lemma 1.
(a) If each vertex Gk has an occurrence on the chain π at the distance from the starting vertex at
least k then val(π) is factor maximal.
(b) If additionaly π is cyclic then after appending at the end to π its prefix of length at most k we
also obtain a factor-maximal word.
Definition of deBruijn(C, v), where C is a HAmiltonian cycle of Gk .
x
3
x
2
x
4
y
3
y
2
y4
y1
x
5
y5
x1
y6
x
6
Figure 1: The cyclic structure of Gk . The big cycle is a Hamiltonian cycle C = (y1 · y2 · y3 · y4 · y5).
Other (outer)cycles result by removing C from the graph. xi ’s are values (chain labels) of the outer
cycles. We start in the starting node v of C and traverse the graph, the edges which are not in C
have priority. We receive the word deBruijn(C, v). The word x1 corresponds to the largest outer
cycle. We obtain an Euler cycle by starting with the largest cycle, traversing C and consecutive
outer cycles. The edge-labels of such an Euler cycle form the word: x1 y1 x2 y2 x3 y3 x4 y4 x5 y5 .
2
Example. Take the de Bruijn graph G4 , see Figure 2, it has 16 nodes and 32 edges.
0
C4
1
8
2
12
4
5
7
5
10
C3
3
2
15
13
9
9
14
14
10
3
11
C5
12
C2
1
6
0
13
4
6
8
10
1
11
11
2
13
12
C1
6
7
4
14
9
3
15
Figure 2: The graph DB4 can be decomposed into one Hamiltonian cycle and 5 edge-disjoint cycles
C1, C2, C3, C4, C5, their sizes are x1 = 8, x2 = 1, x3 = 4, x4 = 1, x5 = 2.
There is a Hamiltonian cycle
C = (8, 0, 1, 3, 7, 15, 14, 12, 9, 2, 5, 11, 6, 13, 10, 4, 8).
After removing this cycle we have 5 disjoint cycles:
[8, 12, 2, 4, 9, 3, 6, 12, 8] [0, 0] [7, 14, 13, 11, 7] [15, 15] [5, 10, 5].
the total structure of an Euler cycle induced by C looks as follows:
[8, 12, 2, 4, 9, 3, 6, 12, 8] [0, 0] 1, 3, [7, 14, 13, 11, 7] [15, 15] 14, 2, 9, 2, [5, 10, 5] 11, 6, 13, 10, 4, 8.
It implies, taking labels of consecutive edges, a de Bruijn sequence
[10011000] 0 [0] 111 [0111] 1 [1] 00101 [01] 101000
It can be written as DBk = x1 y1 x2 y2 x3 y3 x4 y4 x5 y5 , where:
x1 = 10011000, y1 = 0 x2 = 0 y2 = 111 x3 = 0111 y3 = 1 x3 = 1 y4 = 00101 x4 = 01 y5 = 101000.
The sequence DBk−1 = y1 y2 y3 y4 y5 is a de Bruijn sequence of smaller rank. We append the first
k symbols to the end and create the sequence:
αk = [10011000] 0 [0] 111 [0111] 1 [1] 00101 [01] 101000 1001
3
Then for each 2k < n ≤ 2k+1 + k we can create a factor maximal word of size n by concatenating a
subword of DBk with a suffix of DBk−1 and a prefix of DBk of length at most k.
Theorem 1.
For each k there is a pair of twin de Bruijn words wk of length 2k and wk+1 of length 2k+1 such that
for any γk ≤ n < γk+1 there is a factor-maximal subword of wk wk+1 of length n.
Specifically Suf (n − p, wk ) · P ref (n, wk+1 ) is factor maximal for a parameter p = p(n).
The words wk , wk+1 and the parameter p can be computed in linear time.
Proof. Let us consider the decomposition of the de Bruijn word corresponaing
w = x1 · y1 · x2 · y2 · . . . xr · yr .
(1)
Let us distinguish positions corresponding to elements of xi .
We compute the size p of the shortest prefix of wk+1 containing n − γk distinguished positions. Then
we obtain maximal-factor word of size n as
M axF actor(n) = SU f (n − p, wk ) · P ref (p, wk+1
Observation 2. The word wk and its decomposition are the same for all γk < n < γk+1 (they
represent exponentially many factor-maximal words).
4