Download 2 - DidaWiki - Università di Pisa

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Location arithmetic wikipedia , lookup

Transcript
Index construction:
Compression of postings
Paolo Ferragina
Dipartimento di Informatica
Università di Pisa
Sec. 3.1
Delta encoding
1 1
2
7 20 14 …
Then you compress the resulting integers with
variable-length prefix-free codes, as follows…
Variable-byte codes

Wish to get very fast (de)compress  byte-align

Given a binary representation of an integer



Append 0s to front, to get a multiple-of-7 number of bits
Form groups of 7-bits each
Append to the last group the bit 0, and to the other
groups the bit 1 (tagging)
e.g., v=214+1  binary(v) = 100000000000001
10000001 10000000 00000001
Note: We waste 1 bit per byte, and avg 4 for the first byte.
But it is a prefix code, and encodes also the value 0 !!
T-nibble: We could design this code over t-bits, not just t=8
PForDelta coding
Use b (e.g. 2) bits to encode 128 numbers or create exceptions
2
10
42 2
1
2
1
1
…
10 01 10 01 01 …
2
2
23 1
10 10
a block of 128 numbers = 256 bits = 32 bytes
Translate data: [base, base + 2b-2]  [0,2b-2]
Encode exceptions with value 2b-1
Choose b to encode 90% values, or trade-off:
b waste more bits, b more exceptions
2
01 10 42 23
g-code
0000...........0 x in binary
Binary Length-1

Binary length
x > 0 and Binary length = log2 x +1
e.g., 9 represented as 0001001.

g-code for x takes 2 log2 x +1 bits
(ie. factor of 2 from optimal)

Optimal for Pr(x) = 1/2x2, and i.i.d integers
It is a prefix-free encoding…

Given the following sequence of g-coded
integers, reconstruct the original sequence:
0001000001100110000011101100111
8
6
3
59
7
d-code
g(BinLen)
x in binary
Binary length


Use g-coding to reduce the length of the first field
Useful for medium-sized integers
e.g., 19 represented as 0010110011.


d-coding x takes about
log2 x + 2 log2( log2 x ) + 2 bits.
Optimal for Pr(x) = 1/2x(log x)2, and i.i.d integers
Elias-Fano
z = 3, w=2
If w = log (m/n) and z = log n, where m = |B| and n = #1
then
- L takes n w = n log (m/n) bits
- H takes n 1s + n 0s = 2n bits
0
In unary
1
2 3 4 5
6
7
(Select1
on H)
How to get the i-th number ? Take the i-th group of w bits in L
and then represent the value (Select1(H,i) – i) in z bits
Rank and Select
data structures
A basic problem !
D Abaco, Battle, Car, Cold, Cod ....
Array of n string pointers to strings of total length m
• (n log m) bits = 32 n bits.
• it depends on the number of strings
• it is independent of string length
D Abaco Battle
Car Cold
Cod ....
B 10000 100000 100 1000 100 ....
Spaces are introduced for simplicity
Rank/Select
Wish to index the bit vector B (possibly compressed).
Select1(3) = 8
B 00101001010101011111110000011010101....
Rank1(6) = 2
• Rankb(i)
= number of b in B[1,i]
• Selectb(i) = position of the i-th b in B
Two approaches:
(1) Takes |B| + o(|B|) bits of space,
(2) (2) Aims at achieving n log(m/n) bits
m = |B|
n = #1
m = |B|
n = #1s
The Bit-Vector Index: |B| + o(|B|)
Goal. B is read-only, and the additional index takes o(m) bits.
Rank
B 00101001010101011 1111100010110101 0101010111000....
Z
(absolute) Rank1
18
8
4
5
8
z
(bucket-relative) Rank1
 Setting Z = poly(log m) and z=(1/2) log m:



block pos
#1
0000 1
0
....
...
...
1011
2
1
....
Extra space is + (m/Z) log m + (m/z) log Z + o(m)
exists a Bit-Vector Index
 + O(m loglog m / log m) =There
o(m) bits
taking o(m) extra bits
Rank time is O(1)
and constant time for Rank/Select.
Term o(m) is crucial in practice,BB is
is needed
untouched
(not
compressed)
and
read-only!
The Select operation
m = |B|
n = #1s
B 0010100101010101111111000001101010101010111001....
size r is variable until the subarray includes k 1s
 Sparse case: If r > k2 , we store explicitly the position of the k 1s
 Dense case: k ≤ r ≤ k2, recurse... One level is enough!!
... still need a table of size o(m).
 Setting k ≈ polylog m

Extra space is + o(m), and B is not touched!

Select time is O(1)
There exists a Bit-Vector Index
taking o(m) extra bits
and constant time for Rank/Select.
B is needed and read-only!
Via Elias-Fano (B is not needed)
Recall that by setting w = log (m/n) and z = log n,
where m = |B| and n = #1 then
- Space = n log (m/n) bits + 2n bits
z = 3, w=2
0
1
2 3 4 5
6
7
(Build Select1 on H
so we need extra
|H| + o(|H|) bits
= 2n + o(n) bits )
Select1(i) on B  uses L and (Select1(H,i) – i) in +o(n) space
Rank1(i) on B  Needs binary search over B
If you wish to play with Rank and Select
m/10 + n log (m/n)
Rank in 0.4 msec, Select in < 1 msec
vs 32n bits of explicit pointers