Survey							
                            
		                
		                * Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
A Lempel-Ziv text index on secondary storage Diego Arroyuelo and Gonzalo Navarro Combinatorial Pattern Matching 2007 1 Introduction The full-text searching problem:  to find all the occ occurrences of a pattern P[1..m] in a text T[1..u] (both over an alphabet S of size s) We are interested in indexed text searching:  an index on T allows us to find quickly the pattern occurrences Index In our work the index replaces the text (self-indexing)  is compressed (LZ) T  (compression+search) P P P P 2 Applications and goals Main applications of text searching:  Computational Biology (DNA and protein sequences)  Oriental language texts (Japanese, Chinese, Korean, etc.)  “Natural language” texts (English, Spanish, etc.)  Music (MIDI pitch sequences)  Program code  Etc. Compressed self-indexes:  Reduce the space requirement (not storing the text + compressing)  Are useful in cases where accessing the text is expensive (for example, web search engines) 3 Motivations  The use of a compressed self-index may totally remove the need to use the disk  However… Huge texts Sequential text searching + compression Compressed self-indexes improves disk performance More disk accesses but smaller seek time 4 Motivations By reducing the space of the index we aim at:  Saving disk space (important for storage media of limited size)  Reducing the seek time when searching (because the index is smaller) 5 Model of computation We assume a model of computation where:  A disk page of size B can be transferred to main memory in a single disk access  We can hold a constant number of disk pages in main memory  We count every disk access  The text is static 6 Related Works  String B-trees [FG, JACM 1999]: 3 – 4 times text size  Compact Pat Trees [CM, SODA 1996]: 5 – 6 times text size  Compressed Suffix Arrays [MNS, ISAAC 2003]  About 0.25 – 0.5 times text size  2(1 + m · logBu) accesses for counting  O(log u) extra accesses per occurrence! Can we define a small an efficient index on secondary storage?7 Searching LZ78 compressed texts: the LZ-index LZTrie RevTrie Different types of occurrences… LZ78 parses the text into phrases 8 Occurrences of Type 1 Occurrences contained in a single phrase Shortest possible LZ78 phrases containing P LZTrie By LZ78, P is a suffix of such phrases P P P Subtrees containing ocurrences of type 1 9 Occurrences of Type 1 Occurrences contained in a single phrase  As P is a suffix of such phrases, Pr is a prefix of the corresponding reverse phrases  We need the Reverse Trie (RevTrie) to solve this problem LZTrie RevTrie Pr P P P navigation between tries! 10 Occurrences of Type 2 Occurrences spanning two consecutive phrases k-1 k P1 P2 P RevTrie LZTrie Phrases starting with P2 P2 Phrases ending with P1 Pr1 k-1 k Node 11 RNode Occurrences of Type 3 Occurrences spanning more than two consecutive phrases O(m2) occurrences of type 3 in the worst case O(m2) random accesses in the worst case 12 The LZ-index  A compressed full-text self-index based on the LZTrie [Navarro, JDA 2004]  Four data structures compose the LZ-index  LZTrie: the trie formed by all the LZ78 phrases B0,…,Bn  RevTrie: the trie formed by all the reverse LZ78 phrases Br0,…,Brn  Node: a mapping from phrase identifiers to their node in LZTrie  RNode: a mapping from phrase identifiers to their node in RevTrie  Overall: the LZ-index requires 4nlogn(1+o(1)) = 4uHk + o(ulogs) bits, for k = o(logsu)  We don’t need to store the text! 13 The LZ-index on secondary storage  The LZ-index was originally designed for main memory  It has a non-regular pattern of access to the index components  We define a version of LZ-index for secondary storage  We divide the problem as follows:  Solving the Basic Trie Operations  Reducing the Navigation Between Structures 14 Solving the basic trie operations  We cut the tries into disjoint blocks of size at most B, using the Clark and Munro Strategy  Every block stores a subtree of the whole trie  We arrange these blocks in a tree by adding inter-block pointers We are able to compute •parent(x) •child(x, a) •depth(x) •subtreesize(x) •preorder(x) •ancestor(x, y) With one extra disk access in the worst case 15 Reducing the navigation between structures Occurrences contained in a single phrase LZTrie For counting... RevTrie Pr P P P We avoid random accesses to report only one occurrence We would need a data structure able of finding all these subtrees without random accesses 16 Reducing the navigation between structures Occurrences spanning two consecutive phrases k-1 k P1 P2 LZTrie P2 RevTrie Pr1 y k k-1 y’ LR mapping 17 Reducing the navigation between structures  We add some redundancy to reduce the number of accesses between index components  Many random accesses now become a single access + sequential scanning (please read the paper for other technical details)  The overall space requirement is 8uHk + o(ulogs) bits, for any k = o(logsu)  The space can be dropped to 6uHk + o(ulogs) bits if we only need to count pattern occurrences 18 Experimental results  We indexed:  XML file from Pizza&Chili Corpus (200 megabytes) (http://pizzachili.dcc.uchile.cl)  We searched for 5,000 random patterns  count and locate queries  We assume a disk page of 32 kilobytes (i.e., 8,192 integers of 32 bits) 19 Experimental results  We compared against  Suffix Arrays for secondary storage:   String B-trees:   The two-level hierarchy of [BYBZ, 1996] We use the model provided in [FG, 1996] Compact Pat Trees (CPT) [CM, 1996] 20 Experimental results (count) LZ-index String B-trees Suffix Array CPT 3.3 times smaller than String B-trees 21 Experimental results (count) LZ-index String B-trees Suffix Array CPT 22 Experimental results (locate) CPT Suffix Array String B-trees LZ-index Average number of accesses to report the first occurrence •LZ-index  11 •String B-trees  12 2.6 times smaller than String B-trees 23 Experimental results (locate) CPT String B-trees Suffix Array LZ-index 24 Conclusions  The LZ-index can be adapted to work on secondary storage  Requiring up to 8uHk + o(ulogs) bits, for any k = o(logsu)  Our index is significantly smaller than any other practical secondary-memory data structure  LZ-index requires more disk accesses  But a smaller index would have a smaller seek time 25 Future work  We assumed a constant main-memory space, but…  To implement our index in a real practical setting  Handling dynamism (String B-trees require 13.5 times the text size!)  Direct construction on secondary storage  adapting [AN, ISAAC 2005] to work on disk 26 Questions? Contact [email protected] [email protected] 27 Thanks! Contact [email protected] [email protected] 28