Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Index construction: Compression of postings Paolo Ferragina Dipartimento di Informatica Università di Pisa Sec. 3.1 Delta encoding 1 1 2 7 20 14 … Then you compress the resulting integers with variable-length prefix-free codes, as follows… Variable-byte codes Wish to get very fast (de)compress byte-align Given a binary representation of an integer Append 0s to front, to get a multiple-of-7 number of bits Form groups of 7-bits each Append to the last group the bit 0, and to the other groups the bit 1 (tagging) e.g., v=214+1 binary(v) = 100000000000001 10000001 10000000 00000001 Note: We waste 1 bit per byte, and avg 4 for the first byte. But it is a prefix code, and encodes also the value 0 !! T-nibble: We could design this code over t-bits, not just t=8 PForDelta coding Use b (e.g. 2) bits to encode 128 numbers or create exceptions 2 10 42 2 1 2 1 1 … 10 01 10 01 01 … 2 2 23 1 10 10 a block of 128 numbers = 256 bits = 32 bytes Translate data: [base, base + 2b-2] [0,2b-2] Encode exceptions with value 2b-1 Choose b to encode 90% values, or trade-off: b waste more bits, b more exceptions 2 01 10 42 23 g-code 0000...........0 x in binary Binary Length-1 Binary length x > 0 and Binary length = log2 x +1 e.g., 9 represented as 0001001. g-code for x takes 2 log2 x +1 bits (ie. factor of 2 from optimal) Optimal for Pr(x) = 1/2x2, and i.i.d integers It is a prefix-free encoding… Given the following sequence of g-coded integers, reconstruct the original sequence: 0001000001100110000011101100111 8 6 3 59 7 d-code g(BinLen) x in binary Binary length Use g-coding to reduce the length of the first field Useful for medium-sized integers e.g., 19 represented as 0010110011. d-coding x takes about log2 x + 2 log2( log2 x ) + 2 bits. Optimal for Pr(x) = 1/2x(log x)2, and i.i.d integers Elias-Fano z = 3, w=2 If w = log (m/n) and z = log n, where m = |B| and n = #1 then - L takes n w = n log (m/n) bits - H takes n 1s + n 0s = 2n bits 0 In unary 1 2 3 4 5 6 7 (Select1 on H) How to get the i-th number ? Take the i-th group of w bits in L and then represent the value (Select1(H,i) – i) in z bits Rank and Select data structures A basic problem ! D Abaco, Battle, Car, Cold, Cod .... Array of n string pointers to strings of total length m • (n log m) bits = 32 n bits. • it depends on the number of strings • it is independent of string length D Abaco Battle Car Cold Cod .... B 10000 100000 100 1000 100 .... Spaces are introduced for simplicity Rank/Select Wish to index the bit vector B (possibly compressed). Select1(3) = 8 B 00101001010101011111110000011010101.... Rank1(6) = 2 • Rankb(i) = number of b in B[1,i] • Selectb(i) = position of the i-th b in B Two approaches: (1) Takes |B| + o(|B|) bits of space, (2) (2) Aims at achieving n log(m/n) bits m = |B| n = #1 m = |B| n = #1s The Bit-Vector Index: |B| + o(|B|) Goal. B is read-only, and the additional index takes o(m) bits. Rank B 00101001010101011 1111100010110101 0101010111000.... Z (absolute) Rank1 18 8 4 5 8 z (bucket-relative) Rank1 Setting Z = poly(log m) and z=(1/2) log m: block pos #1 0000 1 0 .... ... ... 1011 2 1 .... Extra space is + (m/Z) log m + (m/z) log Z + o(m) exists a Bit-Vector Index + O(m loglog m / log m) =There o(m) bits taking o(m) extra bits Rank time is O(1) and constant time for Rank/Select. Term o(m) is crucial in practice,BB is is needed untouched (not compressed) and read-only! The Select operation m = |B| n = #1s B 0010100101010101111111000001101010101010111001.... size r is variable until the subarray includes k 1s Sparse case: If r > k2 , we store explicitly the position of the k 1s Dense case: k ≤ r ≤ k2, recurse... One level is enough!! ... still need a table of size o(m). Setting k ≈ polylog m Extra space is + o(m), and B is not touched! Select time is O(1) There exists a Bit-Vector Index taking o(m) extra bits and constant time for Rank/Select. B is needed and read-only! Via Elias-Fano (B is not needed) Recall that by setting w = log (m/n) and z = log n, where m = |B| and n = #1 then - Space = n log (m/n) bits + 2n bits z = 3, w=2 0 1 2 3 4 5 6 7 (Build Select1 on H so we need extra |H| + o(|H|) bits = 2n + o(n) bits ) Select1(i) on B uses L and (Select1(H,i) – i) in +o(n) space Rank1(i) on B Needs binary search over B If you wish to play with Rank and Select m/10 + n log (m/n) Rank in 0.4 msec, Select in < 1 msec vs 32n bits of explicit pointers