Download Boosting Textual Compression in Optimal Linear Time

Boosting Textual Compression in Optimal Linear Time Disclaimer The author of this presentation, henceforth referred to as “The Author”, should not be held accountable for any mental illness, confusion, disorientation, or general lack of will to live caused, directly or indirectly, by prolonged exposure to this material. Introduction A boosting technique, in very informal terms, can be seen as a method that, when applied to a particular class of algorithms, yields improved algorithms in terms of one or more parameters characterizing their performance in the class. General boosting techniques have a deep significance for Computer Science. Using such techniques, one can, informally, take a good algorithm and, applying the boosting technique on it, get a very high-quality algorithm, again in terms of the parameters characterizing the nature of the problem. Introduction (cont) In the past weeks, I am sure we have all been convinced of the importance of textual compression to our field of study. If so, we would like to come up with a boosting technique to improve existing textual compression algorithms, while sustaining the smallest possible loss in the algorithm’s asymptotic time and space complexity. In general, such efficient boosting techniques are very hard to come by. In this class I will present one such boosting technique for improving textual compression algorithms. Presentation Outline For a change, this presentation will begin with the results of the boosting technique. only then will I elaborate further.  As with all previous presentations, I will have to introduce many new definitions, and repeat a few that we have already seen. It’s not going to be easy, so bear with me.  Once the new definitions are all clear, we will see the pseudocode for the boosting technique. Assuming that the definitions are indeed clear, the technique itself is quite straightforward.  To conclude this presentation, I will show some remaining open problems.  Statement of Results Let s be a string over a finite alphabet Σ, Let H (s ) k k  0. denote the k-th order empirical entropy of s, and let H * k (s) be the k-th order modified empirical entropy of s, both of which will be defined soon enough. Also, let us recall the Burrows-Wheeler Transform that, given a string s, computes a permutation of that string, hereby denoted BWT(s). let us consider a compression algorithm A that compresses any string z# in at H ( z)  z   bits, where λ,η and μ are constants independent of z, and # is a special symbol not appearing elsewhere in z. A general outline of the most z 0 boosting technique will be shown in the next slide. Statement of Results (cont) Here are the three major steps of the technique: 1. compute sˆ  BWT (s R ). R 2. using the suffix tree of s, greedily partition soŝ that a suitably defined objective function is minimized. 3. compress each substring of the partition, separately, using algorithm A. Statement of Results (cont) We will show that for any k ,0the length in bits of the string resulting from the boosting is bounded by:  s H k (s)  log 2 s  s  gk If we rely on the stronger assumption that A compresses every string z# in at most  z H 0* ( z)   bits then the following improved bound can be achieved:  s H k* (s)  log 2 s  gk Definitions  Let s be a string over the alphabet   {a1 ,..., ah } and, for each ai  ,let nbe i the number of ni . The ni 1 occurences of ain i s. We will assume that zeroth order empirical entropy of s is: h ni ni H 0 ( s)   log( ) s i 1 s  w For any string w, we denote by the s string of single symbols following the occurrences of w in s, taken from left to right. For  example, if s = mississippi and w = si then ws = sp. We define the k-th order entropy as: H K ( s)  1 s   w H ( w  s 0 s) w k Definitions (cont)  Now, we shall define the zeroth order modified empirical entropy: 0   H 0* ( s)  (1  log s ) s  H 0 ( s)  if s  0 if s  0 and H 0 (s)  0 otherwise. To define the k-th order modified empirical entropy, I will introduce the notion of suffix cover: we say that a set of substrings of s of length at most k, k S k is a suffix cover of , and write S k ,ifk every string in hasak unique suffix in . Sk } k = 3 then both {a, b} For example, if   {a, band and {a, ab, abb, bbb}are suffix covers for . 3 Definitions (cont) We now define, for every set cover H S*k ( s)  1 s :Sk  *  w  s H 0 (ws ) wS k Now we can finally define the k-th order modified empirical entropy of s: H k* ( s)  mink H S*k ( s )  H S** ( s) S k  For some optimal suffix cover k S. k* Definitions (cont)   Three more notations of mentioning, but    * also worthy  H k ( s) wsstring of H k (s) briefly, are , ws and , where is the single characters preceding the occurrences of w in s * H k (s)  H k* (s R ) from left to right,  and H k (s)  H k (s R ). I also wish to introduce the notion of prefix cover, which is equivalent to the notion of suffix cover, just k with prefixes instead of suffixes. That is, k is a prefix cover of  kif every string in has a unique k prefix in . Definitions (cont) Let us recall that BWT(s) constructs a matrix whose rows are cyclic shifts of s$ sorted by lexicographical order and returns the last column of that matrix. Let w be a substring of s. then by the matrix’s construction, all of the rows prefixed by w are consecutive (because the matrix is sorted in lexicographical order). this means that the single symbols preceding every occurrence of w in s are grouped together in a set of consecutive positions of the string sˆ  BWT ( s). we denote this substring sˆ[.w]  w(s It is easy to see that sˆ[ wis] a premutation of . ) Definitions (cont) Example: BWT matrix for the string t = mississippi. Let w = s. The four occurrences of s in t are in the last four rows of the BWT matrix. Then tˆ[ s]  ssii  ws  isis and that is indeed a permutation of . Definitions (cont) Let T be the suffix tree of the string s$. We assume that the suffix tree edges are sorted lexicographically. If so, then the i-th leaf (counting from the left) of the suffix tree corresponds to the i-th row of the BWT matrix. We associate the i-th leaf of T with the i-th symbol of the string .ŝI’ll denote the i-th leaf of T by  i and the symbol associated with it by . ̂ i By definition, sˆ  ˆ 1...ˆ s 1. Definitions (cont) Let w be a substring of s. The locus of w, denoted  [w], is the node of T that has associated the shortest string prefixed by w. Definitions (cont) Example: suffix tree for the string s = mississippi$. The locus of the substrings ss and ssi is the node reachable by the path labelled by ssi. Definitions (cont) Another very important notion I would like to introduce is that of the leaf cover. Given a suffix tree T we say that a subset L of its nodes is a leaf cover if every leaf of the suffix tree has a unique ancestor in L. For every node u of T we will denote by ŝ uthe substring of ŝconcatenating, from left to right, the symbols associated to the leaves descending from node u. For example, in the suffix tree from the previous slide, sˆ  [i]  pssm . Definitions (cont) Note that these symbols are exactly the single symbols preceding i in mississippi$. that is, for any string w we have sˆ  [w]  sˆ[ w. ] Definitions (cont) A key observation in this article is the natural relation between leaf covers and prefix covers. let k*  {w1,..., wp } be the optimal prefix cover defining * H k (s) and let k be the set of nodes { [ w1 ],..., [ w p. ]} * k since  kis a prefix cover of we get that every leaf of T corresponding to a suffix of length greater than k has a unique ancestor in .on k the other hand, leaves of T corresponding to suffixes of length smaller than k might not have an ancestor in . k We would like to enhance in k a way that will make it a leaf cover of T. Definitions (cont) We will denote by Qkthe set of leaves corresponding to suffixes of s$ of length at most k which are not *  prefixed by a string in . kwe set L*k  k .Qk Qk  k because s$ has at most k suffixes of length smaller than k. This relation is exploited next. Definitions (cont)  The Cost of a Leaf Cover: Let C denote the function which associates to every string x over , with at most one occurrence of $,   {$} the positive real value C ( x)   x' H 0* ( x' )   where  ,  are constants and x’ is the string x with the symbol $ removed, if it was present. we will now define the value of C for a leaf cover L: C ( L)   C ( sˆ u ) uL Definitions (cont) In this section, I only have the following lemma left to prove: For any given k  0there exists a constant gk such that for any string s: * * C ( Lk )   s H k ( s)  g k The next three slides details the proof for the lemma. Definitions (cont) Let us recall that L*k  k  Qkand that by definition k  Qk   . If so, then the following equation obviously holds: C ( L*k )   C (sˆ u )   C (sˆ u )   C (sˆ u ) uk Qk  u  uL*k (1) ( 2) Observe that every u  Qkis a leaf of T. By the definition of C we get that for every u  Q: k C ( sˆ u )   sˆ u ' H 0* ( sˆ u ' )            1 1 Definitions (cont) Also, recall that Qk . kCombined, we get that summation (2) is bound by k (  . ) For us to evaluate summation (1), recall that every u  k is the locus of a string w   . k* By the relation between the suffix tree and the BWT k *    ˆ ˆ matrix we have that s  [w]  s[w.]Also, . k Then we get:   k *   ˆ ˆ ˆ ˆ C ( s u )  C ( s [ w ])   s [ w ] H ( s [ w ])       0   * * uk w k  w k  Definitions (cont) For the last step, recall that sˆ[ wis] a permutation of   ws and therefore H 0* ( sˆ[ w])  H 0* ( ws )and, obviously,  sˆ[w]  ws . Finally, we get: *   *   k  C ( L )    ws H 0 ( ws )     k (   )   s H k ( s)  g k  w *    k  gk * k Computing the Optimal Leaf Cover Now that we’re finally done with all of the required definitions, we can finally get on to business. Perhaps the most important aspect of this boosting technique is that the optimal leaf cover can be computed in time linear in |s|. In the following slides I will present an algorithm that computes that optimal leaf cover in linear time, and prove its correctness and time complexity. Computing the Optimal Leaf Cover (cont) Before I show the actually algorithm, I will prove the following lemma: An optimal leaf cover for the subtree rooted at u consists of either the single node u, or of the union of optimal leaf covers of the subtrees rooted at the children of u in T. Computing the Optimal Leaf Cover (cont) Proof: Let Lmin (u) denote the optimal leaf cover for the subtree of T rooted at u. If u is a leaf then the result obviously holds. We assume then that u is an internal node and that u1 ,..., uc are its children. c ) both leaf It’s obvious that {u}and  Lmin (ui are i 1 covers of the subtree rooted at u. I will show that one of them is optimal. Computing the Optimal Leaf Cover (cont) Let’s assume that Lmin (u)  {u. We } can then say that c Lmin (u )   L(ui ) where each L(ui )is a leaf cover (not i 1 necessarily the optimal one) for the subtree rooted at ui the following holds: then c c i 1 i 1 C ( Lmin (u ))   C ( L(ui ))   C ( Lmin (ui )) . Computing the Optimal Leaf Cover (cont) Since the cost of the optimal leaf cover is smaller or equal to that of any other leaf cover we get that: c C ( Lmin (u ))   C ( Lmin (ui )) i 1 Which means that the union of the optimal leaf covers of the trees rooted at the children of u is indeed an optimal leaf cover for the tree rooted at u. Computing the Optimal Leaf Cover (cont) The following algorithm computes the optimal leaf cover in linear time: The algorithm’s correctness follows immediately from the previous lemma. I will show that it runs in O(|s|) time. Computing the Optimal Leaf Cover (cont) The only nontrivial operation in the algorithm is the calculation of C(sˆ u )at each step. To do that, we have to know the number of occurrences of each symbol in the alphabet in the string ŝ u (Because in order to calculate the cost of a string, we have to calculate H 0* (sˆ ).u ) Doing this is possible in constant time for each node because if u is a leaf then each symbol in the ŝ. u alphabet appears either once or never in Computing the Optimal Leaf Cover (cont) If u is not a leaf, then the number of occurrences of each symbol in ŝ u is the sum of the number of its occurrences in ŝ u j where u1 ,..., uare the children c of u (Recall that ŝ u is the concatenation of sˆ u1 ,..., sˆ uc ). Now we are finally ready to see the actual algorithm describing the boosting technique. The Boosting Technique The following algorithm describes the technique: The Boosting Technique First, any compression algorithm we wish to use the boosting technique on has to satisfy the following property: A is a compression algorithm such that, given an input string x  ,*A first appends an end-of-string symbol # to x and then compresses x# with the following space and time bounds: * 1. A compresses x# in at most  x H 0 ( x)  bits. 2. the running time of A is T(|x|) and its working space is S(|x|) where T is convex and S is non-decreasing. The Boosting Technique The boosting algorithm can be used on any algorithm satisfying the previous property to boost its compression up to the k-th order entropy for any k without any asymptotic loss in time efficiency and and with a slightly larger working space complexity. The Boosting Technique Theorem: Given a compression algorithm A that satisfies the aforementioned property, our boosting technique yields the following results: 1. 2. If applied to s, it compresses it within *  s H k (s)  log s  g k bits, for any k. R If applied to s , it compresses it within  s H k* (s)  log s  gk bits, for any k.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Boosting Textual Compression in Optimal Linear Time