Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
A Lempel-Ziv text index on secondary storage Diego Arroyuelo and Gonzalo Navarro Combinatorial Pattern Matching 2007 1 Introduction The full-text searching problem: to find all the occ occurrences of a pattern P[1..m] in a text T[1..u] (both over an alphabet S of size s) We are interested in indexed text searching: an index on T allows us to find quickly the pattern occurrences Index In our work the index replaces the text (self-indexing) is compressed (LZ) T (compression+search) P P P P 2 Applications and goals Main applications of text searching: Computational Biology (DNA and protein sequences) Oriental language texts (Japanese, Chinese, Korean, etc.) “Natural language” texts (English, Spanish, etc.) Music (MIDI pitch sequences) Program code Etc. Compressed self-indexes: Reduce the space requirement (not storing the text + compressing) Are useful in cases where accessing the text is expensive (for example, web search engines) 3 Motivations The use of a compressed self-index may totally remove the need to use the disk However… Huge texts Sequential text searching + compression Compressed self-indexes improves disk performance More disk accesses but smaller seek time 4 Motivations By reducing the space of the index we aim at: Saving disk space (important for storage media of limited size) Reducing the seek time when searching (because the index is smaller) 5 Model of computation We assume a model of computation where: A disk page of size B can be transferred to main memory in a single disk access We can hold a constant number of disk pages in main memory We count every disk access The text is static 6 Related Works String B-trees [FG, JACM 1999]: 3 – 4 times text size Compact Pat Trees [CM, SODA 1996]: 5 – 6 times text size Compressed Suffix Arrays [MNS, ISAAC 2003] About 0.25 – 0.5 times text size 2(1 + m · logBu) accesses for counting O(log u) extra accesses per occurrence! Can we define a small an efficient index on secondary storage?7 Searching LZ78 compressed texts: the LZ-index LZTrie RevTrie Different types of occurrences… LZ78 parses the text into phrases 8 Occurrences of Type 1 Occurrences contained in a single phrase Shortest possible LZ78 phrases containing P LZTrie By LZ78, P is a suffix of such phrases P P P Subtrees containing ocurrences of type 1 9 Occurrences of Type 1 Occurrences contained in a single phrase As P is a suffix of such phrases, Pr is a prefix of the corresponding reverse phrases We need the Reverse Trie (RevTrie) to solve this problem LZTrie RevTrie Pr P P P navigation between tries! 10 Occurrences of Type 2 Occurrences spanning two consecutive phrases k-1 k P1 P2 P RevTrie LZTrie Phrases starting with P2 P2 Phrases ending with P1 Pr1 k-1 k Node 11 RNode Occurrences of Type 3 Occurrences spanning more than two consecutive phrases O(m2) occurrences of type 3 in the worst case O(m2) random accesses in the worst case 12 The LZ-index A compressed full-text self-index based on the LZTrie [Navarro, JDA 2004] Four data structures compose the LZ-index LZTrie: the trie formed by all the LZ78 phrases B0,…,Bn RevTrie: the trie formed by all the reverse LZ78 phrases Br0,…,Brn Node: a mapping from phrase identifiers to their node in LZTrie RNode: a mapping from phrase identifiers to their node in RevTrie Overall: the LZ-index requires 4nlogn(1+o(1)) = 4uHk + o(ulogs) bits, for k = o(logsu) We don’t need to store the text! 13 The LZ-index on secondary storage The LZ-index was originally designed for main memory It has a non-regular pattern of access to the index components We define a version of LZ-index for secondary storage We divide the problem as follows: Solving the Basic Trie Operations Reducing the Navigation Between Structures 14 Solving the basic trie operations We cut the tries into disjoint blocks of size at most B, using the Clark and Munro Strategy Every block stores a subtree of the whole trie We arrange these blocks in a tree by adding inter-block pointers We are able to compute •parent(x) •child(x, a) •depth(x) •subtreesize(x) •preorder(x) •ancestor(x, y) With one extra disk access in the worst case 15 Reducing the navigation between structures Occurrences contained in a single phrase LZTrie For counting... RevTrie Pr P P P We avoid random accesses to report only one occurrence We would need a data structure able of finding all these subtrees without random accesses 16 Reducing the navigation between structures Occurrences spanning two consecutive phrases k-1 k P1 P2 LZTrie P2 RevTrie Pr1 y k k-1 y’ LR mapping 17 Reducing the navigation between structures We add some redundancy to reduce the number of accesses between index components Many random accesses now become a single access + sequential scanning (please read the paper for other technical details) The overall space requirement is 8uHk + o(ulogs) bits, for any k = o(logsu) The space can be dropped to 6uHk + o(ulogs) bits if we only need to count pattern occurrences 18 Experimental results We indexed: XML file from Pizza&Chili Corpus (200 megabytes) (http://pizzachili.dcc.uchile.cl) We searched for 5,000 random patterns count and locate queries We assume a disk page of 32 kilobytes (i.e., 8,192 integers of 32 bits) 19 Experimental results We compared against Suffix Arrays for secondary storage: String B-trees: The two-level hierarchy of [BYBZ, 1996] We use the model provided in [FG, 1996] Compact Pat Trees (CPT) [CM, 1996] 20 Experimental results (count) LZ-index String B-trees Suffix Array CPT 3.3 times smaller than String B-trees 21 Experimental results (count) LZ-index String B-trees Suffix Array CPT 22 Experimental results (locate) CPT Suffix Array String B-trees LZ-index Average number of accesses to report the first occurrence •LZ-index 11 •String B-trees 12 2.6 times smaller than String B-trees 23 Experimental results (locate) CPT String B-trees Suffix Array LZ-index 24 Conclusions The LZ-index can be adapted to work on secondary storage Requiring up to 8uHk + o(ulogs) bits, for any k = o(logsu) Our index is significantly smaller than any other practical secondary-memory data structure LZ-index requires more disk accesses But a smaller index would have a smaller seek time 25 Future work We assumed a constant main-memory space, but… To implement our index in a real practical setting Handling dynamism (String B-trees require 13.5 times the text size!) Direct construction on secondary storage adapting [AN, ISAAC 2005] to work on disk 26 Questions? Contact [email protected] [email protected] 27 Thanks! Contact [email protected] [email protected] 28