Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Detecting Near-Duplicates for Web Crawling Gurmeet Singh Manku, Arvind Jain, Anish Das Sarma WWW '07 Proceedings of the 16th international conference on World Wide Web Speaker: Kunwar Aditya Raghuwanshi 3/28/13 Detecting Near-Duplicates for Web Crawling 1 Near-Duplicates • Near-duplicates are documents which offer identical content but differ in small portions. These differing portions may be attributed to advertisements, banners, timestamps, news feeds etc. 3/28/13 Detecting Near-Duplicates for Web Crawling 2 Why the need to be detected? • To save storage space for crawled data. • Improve search quality (page rank) by finetuning incoming/outgoing link numbers. • Better top results. • Politeness 3/28/13 Detecting Near-Duplicates for Web Crawling 3 Near-duplicate detection considerations • Need to detect near-duplicate websites during crawl phase. • Scale of the World Wide Web is very large. Tens of billions of documents need to indexed while millions of pages are crawled every day. • Crawl decisions need to be made quickly. • Should support Single or Multiple document (batch) queries. 3/28/13 Detecting Near-Duplicates for Web Crawling 4 How to detect Near-Duplicates? • Use Charikar’s simhash – Obtain f-bit fingerprint for each document/webpage. – Documents are near-duplicates iff fingerprints differ at most k-bits. (Hamming Distance between two simhashes is less than or equal to k) – The paper demonstrated that the algorithm perform well for f=64 bits and k=3. • Exact or focused probes in a pre-sorted list of fingerprints is very slow and impractical. 3/28/13 Detecting Near-Duplicates for Web Crawling 5 Simhash Feature set can be Shingles, Document Vector, Connectivity Information, Phrases, Anchor Text, Anchor Window etc. 3/28/13 Detecting Near-Duplicates for Web Crawling 6 Efficient Approach • Only consider 2d sorted f-bit fingerprints considering by virtue of simhash – quite a few 2d bit- combinations exist – very few d-bit combinations are duplicated • Choose d’ such that |d′ − d| is small enough to do quick exact probes 3/28/13 Detecting Near-Duplicates for Web Crawling 7 Example • • • • Split 64-bits into 4 pieces. Make 4 tables of permuted fingerprints. Perform exact probes on 16 bits For selecting correct d, we need a tradeoff between space required for storing tables and the time required in matching. Optimal number of tables given by: 3/28/13 Detecting Near-Duplicates for Web Crawling 8 Example (contd.) 3/28/13 Detecting Near-Duplicates for Web Crawling 9 Optimization • Tables can be compressed as first d bits are common in successive fingerprints (sorted). • Compute Huffman code for distribution of position of the most-significant 1-bit in XOR of two successive fingerprints. • Record first fingerprint as it is. • XOR with successive fingerprint and append Huffman code for position of most-significant 1bit. • Results in half the storage size of crawled set. 3/28/13 Detecting Near-Duplicates for Web Crawling 10 Batch mode • Crawlers crawl millions of web pages everyday. Need quick decision for new pages. • Use Map-Reduce framework for batch queries – Map: Mappers work on chunks of main fingerprint set and whole query file. Output near-duplicate fingerprints. – Reduce: Collects all output, removes duplicates and produces single sorted file. – Chunk is small so no can be probed sequentially. Tables for query can be built in memory so no need for compression. 3/28/13 Detecting Near-Duplicates for Web Crawling 11 Map-Reduce 3/28/13 Detecting Near-Duplicates for Web Crawling 12 Why choose f=64, k=3 3/28/13 Detecting Near-Duplicates for Web Crawling 13 Related Works • Techniques are different when they deal with web documents, Files, Domain-specific corpora as the endgoals vary from detecting web mirrors, clustering, plagiarism to spam detection. • Efficient theoretical study for d=1. Progress is reported for large d. • Signature Schemes: – – – – 3/28/13 Mod-p shingles Min-hash for Jaccard similarity of sets Signatures/fingerprints over IR-based document vectors Checksums Detecting Near-Duplicates for Web Crawling 14 Conclusions • The paper addressed the problem of nearduplicate detection specifically for web-crawlers. • Proposed practical approaches and algorithms for single and batch mode queries. • Validated use of simhash properties. • Surveys existing de-duplicaton methodologies. • No comparison available as techniques have not been tested on such a large corpus (8 Billion Documents). 3/28/13 Detecting Near-Duplicates for Web Crawling 15 Pros & Cons • Pros – Provides good set of results on real data(100M queries on a corpus of 8B documents). Previous work done on the same field does not provide results for a real world query size. – Practical & Efficient. Solves space issues with compression. • Cons – Not based on content matching but similarity, so limited accuracy is expected. – Web-pages need to be downloaded, so bandwidth is not reduced. – Doesn’t provide comparison with existing techniques. 3/28/13 Detecting Near-Duplicates for Web Crawling 16 Relation to topics from Class • • • • Search Engine Basics Crawlers & Crawling Deduplication Map Reduce 3/28/13 Detecting Near-Duplicates for Web Crawling 17