Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Signature files Signature Files • Important alternative to inverted indexes. • Given a document, the signature is calculated as follows. - First, each word (term) is hashed into a bit-vector. - Then, these bit-vectors, are OR-ed to form that document’s signature. • Three main issues related to using signature files: 1. Generating signatures 2. Boolean logic on signatures 3. Accessing signatures Signatures Computing Signatures • W: width of signatures (in the range of 1,000 to 10,000). • Each term (word) sets b bits out of these W bits. • To generate the hash string of a term: for i = 1 to b sig[ hi(term) % W ] = 1 • Each hi() is a hash function. • The signature of the document is generated by OR-ing the hash strings of all its terms. Example • Sometimes some hash strings end up with less then b bits being set. Why? • Because a term may get hashed to the same location by two hash functions. Query logic on signature files • Fact 1. A document contains term T only if all the bits that are set by T’s hash string are also set in the document’s signature. • Fact 2. However, a document’s signature that has all these bits set doesn’t necessarily mean that T appears in that document. Why? - Because the particular “1”-bits can be set by some other terms. Query logic on signature files Query T Not T All bits set Maybe Maybe Some bits missing No Yes Three Valued Logic Search efficiency How to search for a set of given terms? • Naïve way: Access the signatures of all the documents. - For each document, the signature is compared with the OR-ed hash string of the query • to see whether for each “1”-bit of that hash string, the descriptor has its corresponding bits set. - This implies reading the entire signature set! • Not affordable in practice. • Better: Use bitslices. Bitslices • Signature files have to be stored on disk in transposed form. • Example: Search for “cold.” • Retrieve the bitslices for “cold” and then AND them. What should be the signature width? W = width of the signature (we are trying to determine best) b = bit slices per query (equals number of accesses, we specify what we tolerate) z = expected number of false match documents (we specify what we tolerate) f = number of (term, document) pairs N = number of documents B = average of "on"-bits per document. B = b * (f/N) p = probability that a random bit in a document signature is "on" p = 1- [(W-1)/W]B Probability for a bit to remain "off" is: [(W-1)/W]B since it must avoid selection B times, and the probability of not being selected once is (W-1)/W. z = expected number of false matches z=pb*N A false match document (FMD) requires that the bit slices of the query agree on the "on"-bit for the FMD. So, the probability for a random document to be a false match is pb (see note). The expected number of FMDs is z=pb*N. Random document – note • “Probability” of a good document to be a match is of course “1” (that’s a certain event). • Probability for a false match is the probability for a random document to end up being a match in the index (pb). What should be the signature width? W = width of the signature (we are trying to determine best) b = bit slices per query (equals number of accesses, we specify what we tolerate) z = expected number of false match documents (we specify what we tolerate) f = number of (term, document) pairs N = number of documents B = average of "on"-bits per document. p = probability that a random bit in a document signature is "on" B = b * (f/N) p = 1-[(W-1)/W]B z=pb*N (1) (2) (3) We can derive W from (2): W = 1/[1-(1-p)1/B] and substitute B using (1) and p using (3). TREC Collection example W = width of the signature (we are trying to determine best) b = bit slices per query (equals number of accesses, we specify what we tolerate) z = expected number of false match documents (we specify what we tolerate) f = number of (term, document) pairs N = number of documents B = average of "on"-bits per document. p = probability that a random bit in a document signature is "on" b = 8, z=1, N=741,856, f=135,017,792 We derive: p=0.185 f/N = 182 unique terms for the average document. B = 1,456 So, W = 7,134 This collection of 741,856 documents would need: 7,134 * 741,856 bits, that is 7,134 * 741,856 / 8 = 661,550,088 bytes 631Mb.