Download Boosting Textual Compression in Optimal Linear Time

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Matrix multiplication wikipedia , lookup

Transcript
Boosting Textual Compression in
Optimal Linear Time
Disclaimer
The author of this presentation, henceforth
referred to as “The Author”, should not be
held accountable for any mental illness,
confusion, disorientation, or general lack of
will to live caused, directly or indirectly, by
prolonged exposure to this material.
Introduction
A boosting technique, in very informal terms, can be
seen as a method that, when applied to a particular
class of algorithms, yields improved algorithms in
terms of one or more parameters characterizing their
performance in the class.
General boosting techniques have a deep
significance for Computer Science. Using such
techniques, one can, informally, take a good
algorithm and, applying the boosting technique on it,
get a very high-quality algorithm, again in terms of
the parameters characterizing the nature of the problem.
Introduction (cont)
In the past weeks, I am sure we have all been
convinced of the importance of textual compression to our
field of study.
If so, we would like to come up with a boosting
technique to improve existing textual compression
algorithms, while sustaining the smallest possible
loss in the algorithm’s asymptotic time and space
complexity.
In general, such efficient boosting techniques are
very hard to come by. In this class I will present one
such boosting technique for improving textual
compression algorithms.
Presentation Outline
For a change, this presentation will begin with the
results of the boosting technique. only then will I
elaborate further.
 As with all previous presentations, I will have to
introduce many new definitions, and repeat a few that
we have already seen. It’s not going to be easy, so bear
with me.
 Once the new definitions are all clear, we will see the
pseudocode for the boosting technique. Assuming
that the definitions are indeed clear, the technique
itself is quite straightforward.
 To conclude this presentation, I will show some
remaining open problems.

Statement of Results
Let s be a string over a finite alphabet Σ,
Let
H
(s )
k
k  0.
denote the k-th order empirical entropy of s, and let
H
*
k
(s)
be the k-th order modified empirical entropy of s, both of which will be
defined soon enough. Also, let us recall the Burrows-Wheeler Transform
that, given a string s, computes a permutation of that string, hereby denoted
BWT(s).
let us consider a compression algorithm A that compresses any string z# in at
H ( z)  z   bits, where λ,η and μ are constants independent of z, and #
is a special symbol not appearing elsewhere in z. A general outline of the
most
z
0
boosting technique will be shown in the next slide.
Statement of Results (cont)
Here are the three major steps of the technique:
1. compute sˆ  BWT (s R ).
R
2. using the suffix tree of s, greedily partition soŝ that a
suitably defined objective function is minimized.
3. compress each substring of the partition, separately,
using algorithm A.
Statement of Results (cont)
We will show that for any k ,0the length in bits of the
string resulting from the boosting is bounded by:
 s H k (s)  log 2 s  s  gk
If we rely on the stronger assumption that A
compresses every string z# in at most
 z H 0* ( z)  
bits then the following improved bound can be
achieved:
 s H k* (s)  log 2 s  gk
Definitions

Let s be a string over the alphabet   {a1 ,..., ah }
and, for each ai  ,let nbe
i the number of
ni . The
ni 1
occurences of ain
i s. We will assume that
zeroth order empirical entropy of s is:
h
ni
ni
H 0 ( s)   log( )
s
i 1 s

w
For any string w, we denote by
the
s string of single
symbols following the occurrences of w in s, taken
from left to right.
For
 example, if s = mississippi and w = si then
ws = sp. We define the k-th order entropy as:
H K ( s) 
1
s


w
H
(
w
 s 0 s)
w k
Definitions (cont)

Now, we shall define the zeroth order modified empirical
entropy:
0


H 0* ( s)  (1  log s ) s

H 0 ( s)

if s  0
if s  0 and H 0 (s)  0
otherwise.
To define the k-th order modified empirical entropy, I will
introduce the notion of suffix cover:
we say that a set of substrings of s of length at most k,
k
S k is a suffix cover of , and write S k ,ifk every string in
hasak unique suffix in .
Sk
} k = 3 then both {a, b}
For example, if   {a, band
and {a, ab, abb, bbb}are suffix covers for . 3
Definitions (cont)
We now define, for every set cover
H S*k ( s) 
1
s
:Sk
 * 
w
 s H 0 (ws )
wS k
Now we can finally define the k-th order modified
empirical entropy of s:
H k* ( s)  mink H S*k ( s )  H S** ( s)
S k 
For some optimal suffix cover
k
S. k*
Definitions (cont)


Three more notations
of mentioning,
but

  * also worthy

H k ( s)
wsstring of
H k (s)
briefly, are , ws and
, where
is the
single characters preceding the occurrences
of w in s
*
H k (s)  H k* (s R )
from left to right,

and H k (s)  H k (s R ).
I also wish to introduce the notion of prefix cover,
which is equivalent to the notion of suffix cover, just
k
with prefixes instead of suffixes. That is,
k
is a prefix cover of  kif every string in has
a unique
k
prefix in .
Definitions (cont)
Let us recall that BWT(s) constructs a matrix whose
rows are cyclic shifts of s$ sorted by lexicographical
order and returns the last column of that matrix.
Let w be a substring of s. then by the matrix’s
construction, all of the rows prefixed by w are
consecutive (because the matrix is sorted in
lexicographical order). this means that the single
symbols preceding every occurrence of w in s are
grouped together in a set of consecutive positions of
the string sˆ  BWT ( s). we denote this substring sˆ[.w]

w(s
It is easy to see that sˆ[ wis] a premutation of
. )
Definitions (cont)
Example: BWT matrix for the string t = mississippi.
Let w = s. The four occurrences of s in t are in the
last four rows of the BWT matrix. Then tˆ[ s]  ssii

ws  isis
and that is indeed a permutation of
.
Definitions (cont)
Let T be the suffix tree of the string s$. We assume
that the suffix tree edges are sorted lexicographically.
If so, then the i-th leaf (counting from the left) of the
suffix tree corresponds to the i-th row of the BWT
matrix. We associate the i-th leaf of T with the i-th
symbol of the string .ŝI’ll denote the i-th leaf of T by
 i and the symbol associated with it by . ̂ i
By definition, sˆ  ˆ 1...ˆ s 1.
Definitions (cont)
Let w be a substring of s. The locus of w, denoted
 [w], is the node of T that has associated the shortest
string prefixed by w.
Definitions (cont)
Example: suffix tree for the string s = mississippi$.
The locus of the substrings ss and ssi is the node
reachable by the path labelled by ssi.
Definitions (cont)
Another very important notion I would like to
introduce is that of the leaf cover. Given a suffix tree
T we say that a subset L of its nodes is a leaf cover if
every leaf of the suffix tree has a unique ancestor in
L.
For every node u of T we will denote by ŝ uthe
substring of ŝconcatenating, from left to right, the
symbols associated to the leaves descending from
node u. For example, in the suffix tree from the
previous slide, sˆ  [i]  pssm
.
Definitions (cont)
Note that these symbols are exactly the single
symbols preceding i in mississippi$. that is, for any
string w we have sˆ  [w]  sˆ[ w. ]
Definitions (cont)
A key observation in this article is the natural relation
between leaf covers and prefix covers.
let k*  {w1,..., wp } be the optimal prefix cover defining
*
H k (s) and let k be the set of nodes { [ w1 ],..., [ w p. ]}
*
k
since  kis a prefix cover of we
get that every leaf
of T corresponding to a suffix of length greater than k
has a unique ancestor in .on
k the other hand,
leaves of T corresponding to suffixes of length
smaller than k might not have an ancestor in . k
We would like to enhance in
k a way that will make
it a leaf cover of T.
Definitions (cont)
We will denote by Qkthe set of leaves corresponding
to suffixes of s$ of length at most k which are not
*

prefixed by a string in . kwe set L*k  k .Qk
Qk  k because s$ has at most k suffixes of length
smaller than k.
This relation is exploited next.
Definitions (cont)

The Cost of a Leaf Cover:
Let C denote the function which associates to every
string x over
, with
at most one occurrence of $,
  {$}
the positive real value
C ( x)   x' H 0* ( x' )  
where  ,  are constants and x’ is the string x with the
symbol $ removed, if it was present.
we will now define the value of C for a leaf cover L:
C ( L)   C ( sˆ u )
uL
Definitions (cont)
In this section, I only have the following lemma left to
prove:
For any given k  0there exists a constant
gk
such that for any string s:
*
*
C ( Lk )   s H k ( s)  g k
The next three slides details the proof for the lemma.
Definitions (cont)
Let us recall that L*k  k  Qkand that by definition
k  Qk   .
If so, then the following equation obviously holds:
C ( L*k ) 
 C (sˆ u )   C (sˆ u )   C (sˆ u )
uk
Qk
 u

uL*k
(1)
( 2)
Observe that every u  Qkis a leaf of T. By the
definition of C we get that for every u  Q: k
C ( sˆ u )   sˆ u ' H 0* ( sˆ u ' )      



 
1
1
Definitions (cont)
Also, recall that Qk . kCombined, we get that
summation (2) is bound by k (  . )
For us to evaluate summation (1), recall that
every u  k is the locus of a string w  
. k*
By the relation between the suffix tree and the BWT
k
*



ˆ
ˆ
matrix we have that s  [w]  s[w.]Also,
.
k
Then we get:


k
*


ˆ
ˆ
ˆ
ˆ
C
(
s
u
)

C
(
s
[
w
])


s
[
w
]
H
(
s
[
w
])






0


*
*
uk
w k
 w k

Definitions (cont)
For the last step, recall that sˆ[ wis] a permutation of


ws and therefore H 0* ( sˆ[ w])  H 0* ( ws )and, obviously,

sˆ[w]  ws .
Finally, we get:
*

 *  
k

C ( L )    ws H 0 ( ws )     k (   )   s H k ( s)  g k
 w *
 
 k

gk
*
k
Computing the Optimal Leaf Cover
Now that we’re finally done with all of the required
definitions, we can finally get on to business.
Perhaps the most important aspect of this boosting
technique is that the optimal leaf cover can be
computed in time linear in |s|.
In the following slides I will present an algorithm that
computes that optimal leaf cover in linear time, and
prove its correctness and time complexity.
Computing the Optimal Leaf Cover (cont)
Before I show the actually algorithm, I will prove the
following lemma:
An optimal leaf cover for the subtree rooted at u
consists of either the single node u, or of the union of
optimal leaf covers of the subtrees rooted at the
children of u in T.
Computing the Optimal Leaf Cover (cont)
Proof: Let Lmin (u) denote the optimal leaf cover for the
subtree of T rooted at u.
If u is a leaf then the result obviously holds. We
assume then that u is an internal node and that
u1 ,..., uc are its children.
c
) both leaf
It’s obvious that {u}and  Lmin (ui are
i 1
covers of the subtree rooted at u.
I will show that one of them is optimal.
Computing the Optimal Leaf Cover (cont)
Let’s assume that
Lmin (u)  {u. We
} can then say that
c
Lmin (u )   L(ui ) where each L(ui )is a leaf cover (not
i 1
necessarily the optimal one) for the subtree rooted at
ui the following holds:
then
c
c
i 1
i 1
C ( Lmin (u ))   C ( L(ui ))   C ( Lmin (ui ))
.
Computing the Optimal Leaf Cover (cont)
Since the cost of the optimal leaf cover is smaller or
equal to that of any other leaf cover we get that:
c
C ( Lmin (u ))   C ( Lmin (ui ))
i 1
Which means that the union of the optimal leaf
covers of the trees rooted at the children of u is
indeed an optimal leaf cover for the tree rooted at u.
Computing the Optimal Leaf Cover (cont)
The following algorithm computes the optimal leaf
cover in linear time:
The algorithm’s correctness follows immediately from
the previous lemma. I will show that it runs in O(|s|)
time.
Computing the Optimal Leaf Cover (cont)
The only nontrivial operation in the algorithm is the
calculation of C(sˆ u )at each step.
To do that, we have to know the number of
occurrences of each symbol in the alphabet in the
string ŝ u (Because in order to calculate the cost of a
string, we have to calculate H 0* (sˆ ).u )
Doing this is possible in constant time for each node
because if u is a leaf then each symbol in the
ŝ. u
alphabet appears either once or never in
Computing the Optimal Leaf Cover (cont)
If u is not a leaf, then the number of occurrences of
each symbol in ŝ u is the sum of the number of its
occurrences in ŝ u j where u1 ,..., uare
the children
c
of u (Recall that ŝ u is the concatenation of
sˆ u1 ,..., sˆ uc ).
Now we are finally ready to see the actual algorithm
describing the boosting technique.
The Boosting Technique
The following algorithm describes the technique:
The Boosting Technique
First, any compression algorithm we wish to use
the boosting technique on has to satisfy the following
property:
A is a compression algorithm such that, given an
input string x  ,*A first appends an end-of-string
symbol # to x and then compresses x# with the
following space and time bounds:
*
1.
A compresses x# in at most  x H 0 ( x)  bits.
2.
the running time of A is T(|x|) and its working space is S(|x|)
where T is convex and S is non-decreasing.
The Boosting Technique
The boosting algorithm can be used on any algorithm
satisfying the previous property to boost its
compression up to the k-th order entropy for any k
without any asymptotic loss in time efficiency and
and with a slightly larger working space complexity.
The Boosting Technique
Theorem: Given a compression algorithm A that
satisfies the aforementioned property, our boosting
technique yields the following results:
1.
2.
If applied
to s, it compresses it within
*
 s H k (s)  log s  g k bits, for any k.
R
If applied to s , it compresses it within
 s H k* (s)  log s  gk bits, for any k.