Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
A Simple Optimal Representation for Balanced Parentheses Richard Geary, Naila Rahman, Rajeev Raman (University of Leicester, UK) and Venkatesh Raman (Institute for Mathematical Sciences, Chennai, India) 5th July 2004 CPM 2004 1 A Parentheses Data Structure • Given: Balanced string of 2n parentheses. ( ( ( ( ))) ( ) ( ) ) • Support operations: – ENCLOSE ( i ) – FINDCLOSE ( i ), FINDOPEN( i ) – EXCESS ( i ) • Applications to suffix tree, ordinal trees and stacksortable permutations. 5th July 2004 CPM 2004 2 Parentheses Representation 2n bits, O(n) time. Θ(n lg n ) bits, O(1) time. O(n) bits, O(1) time. [Jacobson, `89] 2n+o(n) bits, O(1) time. [Munro, Raman, `01] 2n+o(n) bits, O(1) time. New data structure. • Our new DS – is simpler (no perfect hash tables), – smaller o(n) term, – uniform o(n) time and space construction algorithm. • Implemented and shown to be quite practical – far more compact than D/S using naïve representation, – speed comparable to D/S using naïve representation. 5th July 2004 CPM 2004 3 XML • XML: eXtensible Markup Language – de facto standard for electronic data interchange. • Document Object Model (DOM) standard API for manipulating XML documents – holds all data in memory, – large memory usage. 5th July 2004 CPM 2004 4 Example XML document person <person> <name> <first>Bill</first> <surname>Bloggs</surname> </name> name dob <dob> <day>1</day> <month>April</month> <year>1961</year> </dob> </person> firstname surname day month year • DOM NODE interface has methods PARENT(x), NEXTSIB(x), PREVSIB(x), LASTCHILD(x),FIRSTCHILD(x) 5th July 2004 CPM 2004 5 Obvious representation • 2n pointers – DOM: 3n. • Ω(n log n) bits. 5th July 2004 CPM 2004 6 Using parentheses 1 <person> <name> <first>Bill</first> <surname>Bloggs</surname> </name> <dob> <day>1</day> <month>April</month> <year>1961</year> </dob> </person> 2 3 4 5 6 7 8 parentheses representation: ( ( ( ) ( ) ) ( ( ) ( ) ( ) ) ) 12 3 4 5 6 7 8 2n + o(n) bits for tree structure. 5th July 2004 CPM 2004 7 Node interface ops using Parentheses DS Node interface PARENT NEXTSIB PREVSIB LASTCHILD 5th July 2004 Parentheses DS ENCLOSE FINDCLOSE FINDOPEN FINDCLOSE, FINDOPEN CPM 2004 8 Succinct DOM • Succinct DOM: – uses far less space than standard DOM, – performance competitive with DOM. • Node interface implemented by natural parentheses ops. • Operations supported by parentheses data structures – Jacobson `89, – Munro and Raman `01, – Our new data structure. 5th July 2004 CPM 2004 9 Our new D/S Input: balanced string of 2n parentheses. Assume recursive data structure to store balanced string of 2N 2n parentheses. If N is O(n / lg2 n) store answers explicitly for every pair of parentheses. Otherwise Divide into blocks of size B (lg N ) / 2 Number of blocks 4 N / lg N 5th July 2004 CPM 2004 10 FINDCLOSE(x) ( ( ( ) ( ( ( ) ) ) ( ) ( 1 2 3 4 5 6 8 9 7 ( ) ) ( ) ) ) 10 • FINDCLOSE(3)? • Matching parenthesis inside block – near parenthesis. • Pre-computed table stores position of matching parentheses for all near parentheses. – O(1) time if near parenthesis. – Table size is O( N (lg N 2 )) 5th July 2004 CPM 2004 11 Pioneer Parentheses ( ( ( ) 1 2 3 ( ( ( ) ) ) ( ) 4 5 6 7 ( ( ) ) ( ) ) ) 8 9 10 FINDCLOSE(5)? Matching parenthesis outside block – far parenthesis. b(p) = block# of parenthesis at position p ( p) = position of match of p q is 1st far parenthesis before p p is pioneer if b( ( p))! b( (q)) At most 2β-3 open pioneers. Similarly at most 2β-3 close pioneers. 5th July 2004 CPM 2004 12 Pioneer Family ( ( ( ) 1 2 3 ( ( ( ) ) ) ( ) 4 5 6 7 ( ( ) ) ( ) ) ) 8 9 10 ( ( ) ) • Pioneer family: set of all opening and closing pioneers along with their matching parentheses. • Balanced string of size at most 4β-6. 5th July 2004 CPM 2004 13 Our D/S 2N ( ( ( ) ( ( ( ) ) ) ( ) ( ( ) ) ( ) ) ) NND O(N / lg N) ( ( ) ) Two levels of recursion. When pioneer family is O(N/lg2N) we store explicit answers. 5th July 2004 CPM 2004 14 Space usage NND uses O(N lg lg N / lg N) bits. Tables use O( N lg lg N / lg N) bits. O(N lg N) if N is O( n / lg 2 n) S(N ) 2 N S( 16 N/ lg N) O(N lg lg N/ lg N) otherwise S(n) = 2n+ O(n lg lg n / lg n) = 2n +o(n) bits. 5th July 2004 CPM 2004 15 Pseudo-pioneers • Near blocks: blocks which have no pioneers. • Insert pseudo-pioneers at start and end of every near block. – Pseudo-pioneers do not effect FINDOPEN(x), FINDCLOSE(x), ENCLOSE(x) • Gap between pioneers now at most 2B = O(lg N). 5th July 2004 CPM 2004 16 NND • • • ( ( ( ) ( ( ( ) ) ) ( ) ( ( ) ) ( ) ) ) 1 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 2n-bit vector used to find the pioneer for a far parenthesis. If pioneer at pos i in parentheses string then 1 at i in NND. Operations we need: – – • • Find address of most recent 1 at position i r = Rank(i) p = Select(r) Find ith 1in bit vector p = Select(i) We want succinct representation. D/S should be simple and fast. 5th July 2004 CPM 2004 17 NND Bit vector of length M with N 1s. Gap between 1s at most (lg M)c. t = lg M / 2 c lg lg M. 5th July 2004 CPM 2004 18 Select(i) • Find ith 1 in bit vector. • Array A1 stores position of every tth 1 – Space is N lg M / t O( M lg lg M / lg M ) bits • Array A2 stores gaps between consecutive 1s – Space is O( N lg lg M ) or O( M lg lg M / lg M ) bits. • Table T1 allows us to lookup sum of upto t gap. – Space is O( M lg lg M ) bits. SELECT(i) i’ = (i 1)/t i’’ = (i+1)/mod t y = concat of A2[i’+1],..,A2[i’+i’’] return A1[x] + T1[y] 5th July 2004 CPM 2004 19 Rank(i) • Prefix sum at position i. • Need two more arrays and tables of size at most O(M lg lg M / lg M) bits. 5th July 2004 CPM 2004 20 Implementation Details • • • • C++ on Sun UltraSparc-III and Pentium 4. Implemented new and optimised Jacobson D/S. CenterPoint XML for DOM. Sample of 12 XML documents of varying sizes and node counts. • Blocksizes 32, 64, 128 and 256. • Test was depth first tree walk, counting nodes of a given XML type. 5th July 2004 CPM 2004 21 Space usage and performance • Space usage for tree structure – Std DOM: 96 bits per node. – Jacobson: 3.3 – 16 bits per node. – New D/S: 2.9 – 12.8 bits per node. • Avg performance for succinct D/S relative to std DOM – UltraSparc: 1 to 2.5 times slower. – Pentium 4: 1.7 to 4 times slower. 5th July 2004 CPM 2004 22 Conclusions and Future work • Conceptually simple succinct representation for balanced parentheses with O(1) time ops. • o(n) time and space construction algorithm. • Improved lower bound term for space bound. • Relative performance very good on UltraSparc but poorer on Pentium 4, which has small cache – Cache optimisation is an interesting problem. • Complete set of D/S for succinct DOM. 5th July 2004 CPM 2004 23