Download Reducing the Space Requirement of LZ

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Array data structure wikipedia , lookup

B-tree wikipedia , lookup

Transcript
A Lempel-Ziv text index
on secondary storage
Diego Arroyuelo and Gonzalo Navarro
Combinatorial Pattern Matching 2007
1
Introduction
The full-text searching problem:

to find all the occ occurrences of a pattern P[1..m] in a text
T[1..u]
(both over an alphabet S of size s)
We are interested in indexed text searching:

an index on T allows us to find quickly the pattern
occurrences
Index
In our work the index
replaces the text (self-indexing)
 is compressed (LZ)
T

(compression+search)
P
P
P
P
2
Applications and goals
Main applications of text searching:

Computational Biology (DNA and protein sequences)

Oriental language texts (Japanese, Chinese, Korean, etc.)

“Natural language” texts (English, Spanish, etc.)

Music (MIDI pitch sequences)

Program code

Etc.
Compressed self-indexes:

Reduce the space requirement (not storing the text + compressing)

Are useful in cases where accessing the text is expensive
(for example, web search engines)
3
Motivations
 The use of a compressed self-index may totally remove
the need to use the disk
 However… Huge texts
Sequential text
searching
+
compression
Compressed
self-indexes
improves
disk
performance
More disk
accesses
but smaller
seek time
4
Motivations
By reducing the space of the index we aim at:
 Saving disk space
(important for storage media of limited size)
 Reducing the seek time when searching
(because the index is smaller)
5
Model of computation
We assume a model of computation where:

A disk page of size B can be transferred to main memory
in a single disk access

We can hold a constant number of disk pages in main
memory

We count every disk access

The text is static
6
Related Works

String B-trees [FG, JACM 1999]: 3 – 4 times text size

Compact Pat Trees [CM, SODA 1996]: 5 – 6 times text size

Compressed Suffix Arrays [MNS, ISAAC 2003]

About 0.25 – 0.5 times text size

2(1 + m · logBu) accesses for counting

O(log u) extra accesses per occurrence!
Can we define a small an efficient index on secondary storage?7
Searching LZ78 compressed
texts: the LZ-index
LZTrie
RevTrie
Different types of
occurrences…
LZ78 parses the text into phrases
8
Occurrences of Type 1
Occurrences contained in a single phrase
Shortest possible LZ78 phrases containing P
LZTrie
By LZ78, P is a suffix of
such phrases
P
P
P
Subtrees containing ocurrences of type 1
9
Occurrences of Type 1
Occurrences contained in a single phrase
 As P is a suffix of such phrases, Pr is a prefix of the
corresponding reverse phrases
 We need the Reverse Trie (RevTrie) to solve this problem
LZTrie
RevTrie
Pr
P
P
P
navigation between tries!
10
Occurrences of Type 2
Occurrences spanning two consecutive phrases
k-1 k
P1 P2
P
RevTrie
LZTrie
Phrases starting
with P2
P2
Phrases ending
with P1
Pr1
k-1
k
Node
11
RNode
Occurrences of Type 3
Occurrences spanning more than two consecutive phrases
O(m2) occurrences of type 3 in the worst case
O(m2) random accesses in the worst case
12
The LZ-index
 A compressed full-text self-index based on the LZTrie
[Navarro, JDA 2004]
 Four data structures compose the LZ-index

LZTrie: the trie formed by all the LZ78 phrases B0,…,Bn

RevTrie: the trie formed by all the reverse LZ78 phrases Br0,…,Brn

Node: a mapping from phrase identifiers to their node in LZTrie

RNode: a mapping from phrase identifiers to their node in RevTrie
 Overall: the LZ-index requires
4nlogn(1+o(1)) = 4uHk + o(ulogs) bits, for k = o(logsu)
 We don’t need to store the text!
13
The LZ-index on
secondary storage
 The LZ-index was originally designed for main memory
 It has a non-regular pattern of access to the index components
 We define a version of LZ-index for secondary storage
 We divide the problem as follows:

Solving the Basic Trie Operations

Reducing the Navigation Between Structures
14
Solving the basic trie operations
 We cut the tries into disjoint blocks of size at most
B, using the
Clark and Munro Strategy
 Every block stores a subtree of the whole trie
 We arrange these blocks in a tree by adding inter-block
pointers
We are able to compute
•parent(x)
•child(x, a)
•depth(x)
•subtreesize(x)
•preorder(x)
•ancestor(x, y)
With one extra disk access in the worst case
15
Reducing the navigation between
structures
Occurrences contained in a single phrase
LZTrie
For
counting...
RevTrie
Pr
P
P
P
We avoid random accesses to report only one occurrence
We would need a data structure able of finding all these
subtrees without random accesses
16
Reducing the navigation between
structures
Occurrences spanning two consecutive phrases
k-1 k
P1 P2
LZTrie
P2
RevTrie
Pr1
y
k
k-1
y’
LR
mapping
17
Reducing the navigation between
structures
 We add some redundancy to reduce the number of
accesses between index components

Many random accesses now become a single access +
sequential scanning
(please read the paper for other technical details)
 The overall space requirement is 8uHk + o(ulogs) bits, for
any k = o(logsu)
 The space can be dropped to 6uHk + o(ulogs) bits if we
only need to count pattern occurrences
18
Experimental results
 We indexed:

XML file from Pizza&Chili Corpus (200 megabytes)
(http://pizzachili.dcc.uchile.cl)
 We searched for 5,000 random patterns

count and locate queries
 We assume a disk page of 32 kilobytes
(i.e., 8,192 integers of 32 bits)
19
Experimental results
 We compared against

Suffix Arrays for secondary storage:


String B-trees:


The two-level hierarchy of [BYBZ, 1996]
We use the model provided in [FG, 1996]
Compact Pat Trees (CPT) [CM, 1996]
20
Experimental results (count)
LZ-index
String B-trees
Suffix Array
CPT
3.3 times smaller than String B-trees
21
Experimental results (count)
LZ-index
String B-trees
Suffix Array
CPT
22
Experimental results (locate)
CPT
Suffix Array
String B-trees
LZ-index
Average number of
accesses to report the first
occurrence
•LZ-index  11
•String B-trees  12
2.6 times smaller than String B-trees
23
Experimental results (locate)
CPT
String B-trees
Suffix Array
LZ-index
24
Conclusions
 The LZ-index can be adapted to work on secondary storage
 Requiring up to
8uHk + o(ulogs) bits, for any k = o(logsu)
 Our index is significantly smaller than any other practical
secondary-memory data structure
 LZ-index requires more disk accesses
 But a smaller index would have a smaller seek time
25
Future work
 We assumed a constant main-memory space, but…
 To implement our index in a real practical setting
 Handling dynamism
(String B-trees require 13.5 times the text size!)
 Direct construction on secondary storage
 adapting [AN, ISAAC 2005] to work on disk
26
Questions?
Contact
[email protected]
[email protected]
27
Thanks!
Contact
[email protected]
[email protected]
28