Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
iRobot: An Intelligent Crawler for Web Forums Rui Cai, Jiang-Ming Yang, Wei Lai, Yida Wang, and Lei Zhang Microsoft Research, Asia May 22, 2017 Outline • Motivation & Challenge • iRobot – Our Solution – System Overview – Module Details • Evaluation 2 Outline • Motivation & Challenge • iRobot – Our Solution – System Overview – Module Details • Evaluation 3 Why Web Forum is Important • Forum is a huge resource of human knowledge – Popular all over the world – Contain any conceivable topics and issues • Forum data can benefit many applications – Improve quality of search result – Various data mining on forum data • Collecting forum data – Is the basis of all forum related research – Is not a trivial task 4 Why Forum Crawling is Difficult • Duplicate Pages – Forum is with complex in-site structure – Many shortcuts for browsing • Invalid Pages – Most forums are with access control – Some pages can only be visited after registration • Page-flipping – Long thread is shown in multiple pages – Deep navigation levels 5 The Limitation of Generic Crawlers • In general crawling, each page is treated independently – Fixed crawling depth – Cannot avoid duplicates before downloading – Fetch lots of invalid pages, such as login prompt – Ignore the relationships between pages from a same thread • Forum crawling needs a site-level perspective! 6 Statistics on Some Forums • Around 50% crawled pages are useless • Waste of both bandwidth and storage 7 Outline • Motivation & Challenge • Our Solution – iRobot – System Overview – Module Details • Evaluation 8 What is Site-Level Perspective? • Understand the organization structure • Find our an optimal crawling strategy List-of-Thread Entry Post-of-Thread List-of-Board Login Portal Search Result Digest Browse-by-Tag The site-level perspective of "forums.asp.net" 9 iRobot: An Intelligent Forum Crawler General Web Crawling Res tart Forum Crawling Sitemap Construction Crawler Segmentation & Archiving Traversal Path Selection Raw Pages Meta 10 Outline • Motivation & Challenge • Our Solution – iRobot – System Overview – Module Details • • • • Sitemap Construction How many kinds of pages? How do these pages link with each other? Which pages are valuable? Which links should be followed? • Evaluation Traversal Path Selection 11 Page Clustering • Forum pages are based on database & template • Layout is robust to describe template – Repetitive regions are everywhere on forum pages – Layout can be characterized by repetitive regions (a) (b) (c) (d) 12 Page Clustering 13 List-of-Thread Post-of-Thread List-of-Board Login Portal Search Result Digest Browse-by-Tag 14 Link Analysis • URL Pattern can distinguish links, but not reliable on all the sites • Location can also distinguish links 4. Thread List 5. Thread 1. Login A Link = URL Pattern + Location 15 List-of-Thread Entry Post-of-Thread List-of-Board Login Portal Search Result Digest Browse-by-Tag 16 Informativeness Evaluation • Which kind of pages (nodes) are valuable? • Some heuristic criteria – A larger node is more like to be valuable – Page with large size are more like to be valuable – A diverse node is more like to be valuable • Based on content de-dup 17 List-of-Thread Entry Post-of-Thread List-of-Board Login Portal Search Result Digest Browse-by-Tag 18 Traversal Path Selection • Clean sitemap – Remove valueless nodes – Remove duplicate nodes – Remove links to valueless / duplicate nodes • Find an optimal path – Construct a spanning tree – Use depth as cost • User browsing behaviors – Identify page-flipping links • Number, Pre/Next 19 List-of-Thread Entry Post-of-Thread List-of-Board Login Portal Search Result Digest Browse-by-Tag 20 Outline • Motivation & Challenge • iRobot – Our Solution – System Overview – Module Details • Evaluation 21 Evaluation Criteria 25% Mirrored Pages iRobot 20% • Duplicate ratio 15% 10% 5% 0% Biketo Asp Baidu Douban CQZG Tripadvisor Hoopchina 70% Mirrored Pages iRobot 60% • Invalid ratio 50% 40% 30% 20% 10% 0% Biketo • Coverage ratio Asp Baidu Douban 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% CQZG Tripadvisor Hoopchina Coverage ratio 22 Biketo Asp Baidu Douban CQZG Tripadvisor Hoopchina Effectiveness and Efficiency • Effectiveness 6000 Invalididate (a) A Generic Crawler Duplicate Valuable 6000 5000 5000 4000 4000 3000 3000 2000 2000 1000 1000 0 Invalididate Duplicate Valuable 0 Biketo Asp Baidu • Efficiency 20000 (b) iRobot Douban CQZG (a) A Generic Crawler Tripadvisor Hoopchina Invalididate 17500 Duplicate 15000 Valuable Biketo 20000 10000 10000 7500 7500 5000 5000 2500 2500 0 0 Baidu Douban CQZG Tripadvisor Hoopchina Douban CQZG (b) iRobot Tripadvisor Gentoo Invalididate Duplicate 15000 12500 Asp Baidu 17500 12500 Biketo Asp Valuable Biketo Asp Baidu Douban CQZG Tripadvisor Hoopchina 23 Performance vs. Sampled Page# 90% 80% 70% 60% 50% Coverage ratio 40% Duplicate ratio 30% Invalid ratio 20% 10% 0% 10 20 50 100 Number of Sampled Pages 500 1000 24 Preserved Discussion Threads Forums Mirrored Biketo Asp Baidu Douban CQZG Tripadvisor Hoopchina 1584 600 − 62 1393 326 2935 Crawled by iRobot 1313 536 − 60 1384 272 2829 Correctly Recovered 1293 536 − 37 1311 272 2593 94.5% 87.6% 25 Conclusions • An intelligent forum crawler based on sitelevel structure analysis – Identify page templates / valuable pages / link analysis / traversal path selection • Some modules can still be improved – More automated & mature algorithms in SIGIR’08 • More future work directions – Queue management – Refresh strategies 26 Thanks! 27