Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
國立雲林科技大學 National Yunlin University of Science and Technology A Intelligent crawling on the World Wide Web with arbitrary predicates Advisor:Dr. Hsu Graduate:Chien-Shing Chen Author:Charu C. Aggarwal Fatima AI-Garawi Philip S. Yu April 2001 Proceedings of the tenth international conference on World Wide Web ,ACM Intelligent Database Systems Lab N.Y.U.S.T. I.M. Outline Motivation Objective Introduction General Framework Context, URL token, Link, Sibling Implementation Issues Experimental Conclusions Personal Opinion Review Intelligent Database Systems Lab N.Y.U.S.T. I.M. Motivation With the rapid growth of the world wide web, the problem of resource collection on the world wide web has become very relevant in the past few years. Consequently, among them (ideas) a key technique is focused crawling which is able to crawl particular topical portions of the world wide web quickly without particular topical portions of the world wide web quickly without having to explore all web pages. Intelligent Database Systems Lab N.Y.U.S.T. I.M. Objective The learning crawler is capable of reusing the knowledge gained in a given crawl in order to provide more efficient crawling for closely related predicates. Starting at a few general starting points and collects all web pages which are relevant to the user-specified predicate. Intelligent Database Systems Lab N.Y.U.S.T. I.M. Introduction The work on focused crawling assumes two key properties: Linkage Locality: web pages on a given topic are more likely to link to those on the same topic. Sibling Locality: If a web page points to certain web pages on a given topic, then it is likely to point to other pages on the same topic. Intelligent Database Systems Lab N.Y.U.S.T. I.M. Introduction The intelligent crawler uses the inlinking web page content, candidate URL structure, or other behaviors of the inlinking web pages or siblings in order to estimate the probability that a candidate is useful for a given crawl. A predicate is implemented as a subroutine which uses the content and URL string of a web page in order to determine whether or not it is relevant to the crawl. Intelligent Database Systems Lab N.Y.U.S.T. I.M. Introduction Users may not be aware of the best possible starting points which are representatives of the predicate. It is also clear that a crawler which is intelligent enough to start at a few general pages and is still able to selectively mine web pages which are most suitable to a given predicate is more valuable from the perspective of resource discovery. Intelligent Database Systems Lab N.Y.U.S.T. I.M. Introduction Each (candidate) web page can be characterized by a large number of features such as the content of the inlinking pages, tokens in a given candidate URL, the predicate satisfaction of the inlinking web pages, and sibling predicate satisfaction. Intelligent Database Systems Lab N.Y.U.S.T. I.M. General Framework A web page is said to be a candidate when it has not yet been crawled, but some web page which links to it has already been crawled. Intelligent Database Systems Lab N.Y.U.S.T. I.M. Intelligent Database Systems Lab 2.1 Statistical Model Creation In order to create an effective statistical model which models linkage structure, what are we really searching for ? We are looking for specific features in the web page which makes it more likely that the page links to a given topic. Intelligent Database Systems Lab N.Y.U.S.T. I.M. 3. Overview of Statistical Model The model input consists of the features and linkage structure of that portion of the world wide web which has already been crawled so far, and the output is a priority order which determines how the candidate URLs are to be visited. A set of features in the given web page and computes a priority order for that web page using this information The set of features may be any of the types which have been enumerated above including the content information, URL tokens, linkage or sibling information. Intelligent Database Systems Lab N.Y.U.S.T. I.M. N.Y.U.S.T. I.M. The content of the web pages which are known to link to the candidate URL (the set of words) URL tokens from the candidate URL. The nature of the inlinking web pages of a given candidate URL. The number of siblings of a candidate which have already been crawled that satisfy the predicate. Intelligent Database Systems Lab N.Y.U.S.T. I.M. 3.1 Probabilistic Model for Priorities We discuss the probabilistic model for calculation of the priorities. 0.1%...5%... Word occurring is 5% / 0.1% = 50. Intelligent Database Systems Lab N.Y.U.S.T. I.M. P(C) is equal to the probability that the web page will indeed satisfy the user-defined predicate if it is crawled. Our knowledge of the event E may increase the probability that the web page satisfies the predicate. When the event E is favorable to the probability of the candidate satisfying the predicate, then the interest ratio I(C,E) is larger than 1. Intelligent Database Systems Lab N.Y.U.S.T. I.M. is larger than 1. We will now proceed to examine the different factors which are used for the purpose of intelligent crawling. Intelligent Database Systems Lab 3.2 Content Based Learning The order to identify the value of the content in determining the predicate satisfaction of a given candidate page, we find the set of words in the web pages which link to it ( inlinking web pages). These words are then used in order to decide the importance of the candidate page in terms of determining whether or not it satisfies the predicate. We define the event Qi to be true when the word I is present in one of the web pages pointing to the candidate. Intelligent Database Systems Lab N.Y.U.S.T. I.M. N.Y.U.S.T. I.M. In fact, since most words are unlikely to have much bearing on the probability of predicate satisfaction, the filtering of such features is important in reduction of the noise effects. Therefore, we use only those words which have high statistical significance. We calculate the level of significance at which it is more likely for them to satisfy the predicate. Intelligent Database Systems Lab N.Y.U.S.T. I.M. The interest ratio for content based learning is denoted by Ic(C) , and is calculated s the product of the interest ratios of the different words in any of the inlinking web pages which also happen to satisfy the statistical significance condition a value of t=2 results in about 95% level of statistical significance. Intelligent Database Systems Lab N.Y.U.S.T. I.M. Term weight 動武 0.00079925 落網 0.00041105 政權 0.00356069 罷黜 0.00067016 戰事 0.00040360 伊拉克 0.00330150 盟邦 0.00061155 中央社 0.00039791 總統 0.00231992 反戰 0.00053638 推翻 0.00039642 巴格達 0.00181096 攻擊 0.00052048 核子武器 0.00038653 美國 0.00172695 恐怖份子 0.00051897 相關 0.00037714 戰爭 0.00172395 克里特 0.00049826 資訊 0.00037649 聯軍 0.00157083 攻打 0.00045837 外電 0.00036076 美軍 0.00151840 垮台 0.00045536 白宮 0.00034855 報導 0.00111369 生擒 0.00043996 新聞 0.00034673 毀滅性 0.00104308 國際 0.00043659 編譯 0.00034366 飛彈 0.00099588 決議案 0.00042998 台灣 0.00031297 季辛吉 0.00088652 專電 0.00042519 軍事行動 0.00030796 聯合國 0.00083930 波斯灣 0.00042404 獨裁者 0.00028048 Intelligent Database Systems Lab N.Y.U.S.T. I.M. 3.3 URL Token Based Learning A URL which contains the word “ski” in it is more likely to be a web page about skiing related information. www.skiing.com/ We define the event Ri to be true when token i is present in the URL pointing to the candidate. Intelligent Database Systems Lab 3.4 Link Based Learning The idea in link based learning is somewhat similar to the focused crawler discussed. The intelligent crawler tries to learn the significance of link based information during the crawl itself. Intelligent Database Systems Lab N.Y.U.S.T. I.M. N.Y.U.S.T. I.M. Consider a web page which is pointed to by k other web pages, m of which satisfy the predicate, and k - m of which do not. For each of the m web pages which satisfy the predicate, the corresponding interest ratio is given by Similarly, for each of the k-m web pages which do not satisfy the predicate, the corresponding interest ratio is given by The final interest ratio Intelligent Database Systems Lab 3.5 Sibling Based Learning N.Y.U.S.T. I.M. For instance, consider a candidate that has 15 (already) visited siblings of which 9 satisfy the predicate. If the web were random , and if P(C)=0.1, the number of siblings we expect to satisfy the predicate is 15 , P(C) =1.5. Since a higher number of siblings satisfy the predicate (i.e. 9>1.5) , this is indicative that one or more parents might be a hub, and this increases the probability of the candidate satisfying the predicate. Intelligent Database Systems Lab N.Y.U.S.T. I.M. If s is the number of siblings that satisfy the predicate, and e the expected under the random assumption, then when s / e >1 , we suggests that the candidate is likely to satisfy the predicate. Intelligent Database Systems Lab N.Y.U.S.T. I.M. 3.6 Combining the Preferences The aggregate interest ratio is a (weighted ) product of the interest ratios for each of the individual factors. Here the values are weights which are used in order to normalize the different factors. Intelligent Database Systems Lab N.Y.U.S.T. I.M. c : Content based learning u : URL token based learning l : Link based learning s : Sibling based learning Intelligent Database Systems Lab N.Y.U.S.T. I.M. 3.6 Combining the Preferences By increasing the weight of a given factor, we can increase the importance of the corresponding priority. In our particular implementation, we chose to use weights such that each priority value was almost equally balanced, when averaged over all the currently available candidates. predicate, starting seed Intelligent Database Systems Lab 4. Implementation Issues Intelligent Database Systems Lab N.Y.U.S.T. I.M. 4. Implementation Issues Intelligent Database Systems Lab N.Y.U.S.T. I.M. N.Y.U.S.T. I.M. Intelligent Database Systems Lab N.Y.U.S.T. I.M. 5. Empirical Results The primary factor which was used to evaluate the performance of the crawling system was the harvest rate P(C) , which is the percentage of the web pages crawled satisfying the predicate. We ran experiments using the different learning factors over different kinds of predicates and starting points (seed). Intelligent Database Systems Lab N.Y.U.S.T. I.M. Intelligent Database Systems Lab N.Y.U.S.T. I.M. Intelligent Database Systems Lab N.Y.U.S.T. I.M. Intelligent Database Systems Lab N.Y.U.S.T. I.M. Intelligent Database Systems Lab N.Y.U.S.T. I.M. Intelligent Database Systems Lab N.Y.U.S.T. I.M. Intelligent Database Systems Lab N.Y.U.S.T. I.M. Intelligent Database Systems Lab N.Y.U.S.T. I.M. Intelligent Database Systems Lab N.Y.U.S.T. I.M. In addition we tested the crawling system with a wide variety of predicates and staring points and ran the system for about 10000 page crawls in each case. Intelligent Database Systems Lab 6. Conclusions and Summary In this paper, we proposed an intelligent crawling technique which uses a self-learning mechanism that can dynamically adapt to the particular structure of the relevant predicate. Based on these different factors, we were able to create a composite crawler which is able to perform robustly across different predicates. Intelligent Database Systems Lab N.Y.U.S.T. I.M. N.Y.U.S.T. I.M. Review Combining Context URL Token Link Sibling Intelligent Database Systems Lab