Download A Intelligent crawling on the World Wide Web with

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Knowledge representation and reasoning wikipedia , lookup

History of artificial intelligence wikipedia , lookup

AI winter wikipedia , lookup

Ethics of artificial intelligence wikipedia , lookup

World Wide Web wikipedia , lookup

Incomplete Nature wikipedia , lookup

Semantic Web wikipedia , lookup

Transcript
國立雲林科技大學
National Yunlin University of Science and Technology
A Intelligent crawling on the World
Wide Web with arbitrary predicates
Advisor:Dr. Hsu
Graduate:Chien-Shing Chen
Author:Charu C. Aggarwal
Fatima AI-Garawi
Philip S. Yu
April 2001 Proceedings of the tenth international conference on World Wide Web ,ACM
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.
Outline










Motivation
Objective
Introduction
General Framework
Context, URL token, Link, Sibling
Implementation Issues
Experimental
Conclusions
Personal Opinion
Review
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.
Motivation


With the rapid growth of the world wide web, the
problem of resource collection on the world
wide web has become very relevant in the past
few years.
Consequently, among them (ideas) a key
technique is focused crawling which is able to
crawl particular topical portions of the world
wide web quickly without particular topical
portions of the world wide web quickly without
having to explore all web pages.
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.
Objective


The learning crawler is capable of reusing the
knowledge gained in a given crawl in order to
provide more efficient crawling for closely
related predicates.
Starting at a few general starting points and
collects all web pages which are relevant to the
user-specified predicate.
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.
Introduction

The work on focused crawling assumes two key properties:
 Linkage Locality: web pages on a given topic are more
likely to link to those on the same topic.
 Sibling Locality: If a web page points to certain web pages
on a given topic, then it is likely to point to other pages on
the same topic.
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.
Introduction


The intelligent crawler uses the inlinking web page content,
candidate URL structure, or other behaviors of the inlinking
web pages or siblings in order to estimate the probability that
a candidate is useful for a given crawl.
A predicate is implemented as a subroutine which uses the
content and URL string of a web page in order to determine
whether or not it is relevant to the crawl.
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.
Introduction


Users may not be aware of the best possible starting points
which are representatives of the predicate.
It is also clear that a crawler which is intelligent enough to
start at a few general pages and is still able to selectively mine
web pages which are most suitable to a given predicate is
more valuable from the perspective of resource discovery.
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.
Introduction

Each (candidate) web page can be characterized by a large
number of features such as the content of the inlinking pages,
tokens in a given candidate URL, the predicate satisfaction of
the inlinking web pages, and sibling predicate satisfaction.
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.
General Framework

A web page is said to be a candidate when it has not yet been
crawled, but some web page which links to it has already been
crawled.
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.
Intelligent Database Systems Lab
2.1 Statistical Model Creation


In order to create an effective statistical model which models
linkage structure, what are we really searching for ?
We are looking for specific features in the web page which
makes it more likely that the page links to a given topic.
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.
3. Overview of Statistical Model



The model input consists of the features and linkage structure
of that portion of the world wide web which has already been
crawled so far, and the output is a priority order which
determines how the candidate URLs are to be visited.
A set of features in the given web page and computes a
priority order for that web page using this information
The set of features may be any of the types which have been
enumerated above including the content information, URL
tokens, linkage or sibling information.
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.
N.Y.U.S.T.
I.M.




The content of the web pages which are known to link to the
candidate URL (the set of words)
URL tokens from the candidate URL.
The nature of the inlinking web pages of a given candidate
URL.
The number of siblings of a candidate which have already
been crawled that satisfy the predicate.
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.
3.1 Probabilistic Model for Priorities


We discuss the probabilistic model for calculation of the
priorities.
0.1%...5%... Word occurring is 5% / 0.1% = 50.
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.



P(C) is equal to the probability that the web page will indeed
satisfy the user-defined predicate if it is crawled.
Our knowledge of the event E may increase the probability
that the web page satisfies the predicate.
When the event E is favorable to the probability of the
candidate satisfying the predicate, then the interest ratio I(C,E)
is larger than 1.
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.

is larger than 1. We will now proceed to examine the
different factors which are used for the purpose of intelligent
crawling.
Intelligent Database Systems Lab
3.2 Content Based Learning



The order to identify the value of the content in determining
the predicate satisfaction of a given candidate page, we find
the set of words in the web pages which link to it ( inlinking
web pages).
These words are then used in order to decide the importance
of the candidate page in terms of determining whether or not
it satisfies the predicate.
We define the event Qi to be true when the word I is present
in one of the web pages pointing to the candidate.
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.
N.Y.U.S.T.
I.M.



In fact, since most words are unlikely to have much bearing
on the probability of predicate satisfaction, the filtering of
such features is important in reduction of the noise effects.
Therefore, we use only those words which have high
statistical significance.
We calculate the level of significance at which it is more
likely for them to satisfy the predicate.
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.

The interest ratio for content based learning is denoted by
Ic(C) , and is calculated s the product of the interest ratios of
the different words in any of the inlinking web pages which
also happen to satisfy the statistical significance condition

a value of t=2 results in about 95% level of statistical
significance.
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.
Term
weight
動武
0.00079925
落網
0.00041105
政權
0.00356069
罷黜
0.00067016
戰事
0.00040360
伊拉克
0.00330150
盟邦
0.00061155
中央社
0.00039791
總統
0.00231992
反戰
0.00053638
推翻
0.00039642
巴格達
0.00181096
攻擊
0.00052048
核子武器
0.00038653
美國
0.00172695
恐怖份子
0.00051897
相關
0.00037714
戰爭
0.00172395
克里特
0.00049826
資訊
0.00037649
聯軍
0.00157083
攻打
0.00045837
外電
0.00036076
美軍
0.00151840
垮台
0.00045536
白宮
0.00034855
報導
0.00111369
生擒
0.00043996
新聞
0.00034673
毀滅性
0.00104308
國際
0.00043659
編譯
0.00034366
飛彈
0.00099588
決議案
0.00042998
台灣
0.00031297
季辛吉
0.00088652
專電
0.00042519
軍事行動
0.00030796
聯合國
0.00083930
波斯灣
0.00042404
獨裁者
0.00028048
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.
3.3 URL Token Based Learning


A URL which contains the word “ski” in it is more likely to
be a web page about skiing related information.
 www.skiing.com/
We define the event Ri to be true when token i is present in
the URL pointing to the candidate.
Intelligent Database Systems Lab
3.4 Link Based Learning


The idea in link based learning is somewhat similar to the
focused crawler discussed.
The intelligent crawler tries to learn the significance of link
based information during the crawl itself.
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.
N.Y.U.S.T.
I.M.




Consider a web page which is pointed to by k other web
pages, m of which satisfy the predicate, and k - m of which do
not.
For each of the m web pages which satisfy the predicate, the
corresponding interest ratio is given by
Similarly, for each of the k-m web pages which do not satisfy
the predicate, the corresponding interest ratio is given by
The final interest ratio
Intelligent Database Systems Lab
3.5 Sibling Based Learning


N.Y.U.S.T.
I.M.
For instance, consider a candidate that has 15 (already) visited
siblings of which 9 satisfy the predicate. If the web were
random , and if P(C)=0.1, the number of siblings we expect to
satisfy the predicate is 15 , P(C) =1.5.
Since a higher number of siblings satisfy the predicate (i.e.
9>1.5) , this is indicative that one or more parents might be a
hub, and this increases the probability of the candidate
satisfying the predicate.
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.

If s is the number of siblings that satisfy the
predicate, and e the expected under the random
assumption, then when s / e >1 , we suggests
that the candidate is likely to satisfy the
predicate.
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.
3.6 Combining the Preferences

The aggregate interest ratio is a (weighted ) product of the
interest ratios for each of the individual factors.

Here the values
are weights which are
used in order to normalize the different factors.
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.




c : Content based learning
u : URL token based learning
l : Link based learning
s : Sibling based learning
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.
3.6 Combining the Preferences



By increasing the weight of a given factor, we can increase
the importance of the corresponding priority.
In our particular implementation, we chose to use weights
such that each priority value was almost equally balanced,
when averaged over all the currently available candidates.
predicate, starting seed
Intelligent Database Systems Lab
4. Implementation Issues
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.
4. Implementation Issues
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.
N.Y.U.S.T.
I.M.
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.
5. Empirical Results


The primary factor which was used to evaluate the
performance of the crawling system was the harvest rate P(C) ,
which is the percentage of the web pages crawled satisfying
the predicate.
We ran experiments using the different learning factors over
different kinds of predicates and starting points (seed).
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.

In addition we tested the crawling system with a
wide variety of predicates and staring points
and ran the system for about 10000 page
crawls in each case.
Intelligent Database Systems Lab
6. Conclusions and Summary


In this paper, we proposed an intelligent
crawling technique which uses a self-learning
mechanism that can dynamically adapt to the
particular structure of the relevant predicate.
Based on these different factors, we were able
to create a composite crawler which is able to
perform robustly across different predicates.
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.
N.Y.U.S.T.
I.M.
Review

Combining
 Context
 URL Token
 Link
 Sibling
Intelligent Database Systems Lab