Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Index construction: Compression of postings Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 5.3 and a paper g-code for integer encoding 0000...........0 x in binary Length-1 x > 0 and Length = log2 x +1 e.g., 9 represented as <000,1001>. g-code for x takes 2 log2 x +1 bits (ie. factor of 2 from optimal) Optimal for Pr(x) = 1/2x2, and i.i.d integers It is a prefix-free encoding… Given the following sequence of g-coded integers, reconstruct the original sequence: 0001000001100110000011101100111 8 6 3 59 7 d-code for integer encoding g(Length) x Use g-coding to reduce the length of the first field Useful for medium-sized integers e.g., 19 represented as <00,101,10011>. d-coding x takes about log2 x + 2 log2( log2 x ) + 2 bits. Optimal for Pr(x) = 1/2x(log x)2, and i.i.d integers Variable-byte codes [10.2 bits per TREC12] Wish to get very fast (de)compress byte-align Given a binary representation of an integer Append 0s to front, to get a multiple-of-7 number of bits Form groups of 7-bits each Append to the last group the bit 0, and to the other groups the bit 1 (tagging) e.g., v=214+1 binary(v) = 100000000000001 10000001 10000000 00000001 Note: We waste 1 bit per byte, and avg 4 for the first byte. But it is a prefix code, and encodes also the value 0 !! PForDelta coding Use b (e.g. 2) bits to encode 128 numbers or create exceptions 3 11 42 2 3 3 1 1 … 10 11 11 01 01 … 3 3 23 1 11 11 a block of 128 numbers = 256 bits = 32 bytes Translate data: [base, base + 2b-1] [0,2b-1] Encode exceptions: ESC or pointers Choose b to encode 90% values, or trade-off: b waste more bits, b more exceptions 2 01 10 42 23 Random access to postings lists and other data types (e.g. encoding skips?) Paolo Ferragina Dipartimento di Informatica Università di Pisa A basic problem ! T Abaco#Battle#Car#Cold#Cod#Defense#Google#Yahoo#.... • Array of pointers • (log m) bits per string = (n log m) bits= 32 n bits. • We could drop the separating NULL Independent of string-length distribution It is effective for few strings It is bad for medium/large sets of strings A basic problem ! T Abaco#Battle#Car#Cold#Cod#Defense#Google#Yahoo#.... X AbacoBattleCarColdCodDefenseGoogleYahoo.... B 10000100000100100010010000001000010000.... A 10#2#5#6#20#31#3#3#.... We could drop msb X 1010101011101010111111111.... B 1000101001001000100001010.... We aim at achieving ≈ n log(m/n) bits < n log m Another textDB: Labeled Graph Rank/Select Wish to index the bit vector B compressed. Select1(3) = 8 B 00101001010101011111110000011010101.... Rank1(6) = 2 • Rankb(i) = number of b in B[1,i] m = |B| n = #1 • Selectb(i) = position of the i-th b in B Do exist data structures that solve this problem in O(1) query time and efficient space (i.e. +o(m) bits additional) The Bit-Vector Index: m+o(m) m = |B| n = #1s Goal. B is read-only, and the additional index takes o(m) bits. Rank B 00101001010101011 1111100010110101 0101010111000.... Z 8 4 5 8 18 z (absolute) Rank1 (bucket-relative) Rank1 Setting Z = poly(log m) and z=(1/2) log m: Space is |B| + (m/Z) log m + (m/z) log Z + o(m) block pos #1 0000 1 0 .... ... ... 1011 2 1 .... m + O(m loglog m / log m) bits Rank time is O(1) Term o(m) is crucial in practice, B is untouched (not compressed) The Bit-Vector Index m = |B| n = #1s B 0010100101010101111111000001101010101010111001.... size r is variable k consecutive 1s Sparse case: If r > k2 store explicitly the position of the k 1s Dense case: k ≤ r ≤ k2, recurse... One level is enough!! ... still need a table of size o(m). Setting k ≈ polylog m Space is m + o(m), and B is not touched! Select time is O(1) There exists a Bit-Vector Index taking o(m) extra bits and constant time for Rank/Select. B is read-only! Elias-Fano index&compress z = 3, w=2 If w = log (m/n) and z = log n, where m = |B| and n = #1 then - L takes n w = n log (m/n) bits - H takes n 1s + n 0s = 2n bits 0 In unary 1 2 3 4 5 6 7 (Select1 on H) Select1(i) on B uses L and (Select1(H,i) – i) in +o(n) space Actually you can do binary search over B, but compressed ! If you wish to play with Rank and Select m/10 + n log m/n Rank in 0.4 msec, Select in < 1 msec vs 32n bits of explicit pointers