Download Multiple-goal Search Algorithms and their Application to Web Crawling

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Yield curve wikipedia , lookup

Fixed-income attribution wikipedia , lookup

Transcript
Multiple-goal Search Algorithms
and their Application to Web
Crawling
Dmitry Davidov and Shaul Markovitch
Computer Science Department
Technion, Haifa 32000, Israel
Introduction
What does the paper talk about?

Subject: Multiple goal Search

Application Domain: Web Crawling
Representation
Web is viewed as a large graph
Page
 Link

: nodes
: arcs
Web Crawling
Graph Searching
(Kumar et al. 2000)
Do the traditional graph search algorithms work?
characteristics
A set of goal states
Success criteria

Don’t complete as soon as a single goal is found

To collect as many goals as possible
Heuristics
How about traditional heuristic ?

Most are based on a heuristic function that
estimates the distance from a node to the
nearest goal node.

Not useful for multiple-goal search
Example
Method of Experimentation and
Evaluation
Two alternative stopping criteria

When it spends a given allocation of resource

When it finds a given portion of the goal states
Accordingly we have two evaluation methods


Number of goal states being found using the given
resource
Resources spent for finding the required portion
Sum-of-distance heuristic
Distance heuristic does not take into account goal
concentration
Given SG, we use the sum of distance to SG
One problem: we tend to try to progress towards all the goals
simultaneously.
One possible remedy to the problem: giving higher weight to
progress
Front Advancement
Given either explicitly goal list SG or a set of
distance heuristics to goals or goal groups
Instead of measuring the global progress towards
the whole goal set, we measure the global progress
towards each of the goals or goal groups and
prefer steps that lead to progress towards more
goals.
H (n)  g  SG | h(n, g )  min cClosed h(c, g )
Yield Heuristic
Definition:

Deal explicitly with the expected cost and expected benefit of
searching from the given node.

We prefer subgraphs where the cost is low and the benefit is high.

We would like high return for our resource investment.

A heuristic that tries to estimate this return is a yield heuristic
Yield Heuristic Application
It can be used in the traditional heuristic search
algorithms such as best-first search
One difference is the stopping criteria: when a
goal is encountered, the algorithm collects it and
continues.
Multiple-goal best-first search
Two Simple Yield Estimation
Pessimistic estimation
Y (n, SG )  N  SG N
Optimistic estimation
Y (n, SG )  N  SG g (n, N , SG )

Can include a depth limit d , on both methods
Side-effect of Yield Heuristic
The found goals continue to attract the search
front while we would have preferred that the
search would would progress towards
undiscovered goals
Reduce the weight of the subset of the discovered
goal.
Goal Elimination
Learning yield heuristics
In many domains, such heuristics are very difficult to
design
We can use learning approach to acquire such yield
heuristics


Accumulate partial yield information for every node in the search graph.
Assume that the yield of the explored part of a subtree is a good predictor
for the yield of the unexplored part
Accumulate yield statistics for explored nodes. Create a set of examples
of nodes with high yield and nodes with low yield. Apply an induction
algorithm to infer a classifier for identifying nodes with high yield.
Inferring from partial yield
Partial yield of node n at time t
Need a predefined depth limit D
Partial yield is used to do the estimation



We use the partial yield of a node to estimate the expected yield of
its brothers and their descendants.
For computing the partial yield we keep in the node, for each depth,
the number of nodes generated and the number of goals discovered
so far to this depth.
When generating a new node, we initialize the counters for the
node and recursively update its ancestors up to D levels above.
Inferring from partial yield


Remember to avoid updating an updated ancestor twice due to
multiple paths. We must mark already updated ancestors.
The algorithm is shown in Fig. 3
Partial yield estimation



The estimated yield of a node is the average yields of its supported
children(those with sufficiently large expanded subtrees).
If there are no such children, the yield is estimated(recursively) by the
average yield of the node parents. The depth values are adjusted
appropriately.
The algorithm is shown in Fig. 4
Updating the Partial yield
Yield Estimation Algorithm
Generalizing yield
Key idea : to explore domain-specific features and use
it to induct the estimated yield with some induction
algorithm.
Method : to infer the yield function; Or simply
distinguish between states with high yield and low
yield.
Learning Cost Discussion :
Application to WWW domain
Subject : Focused Crawling in the Web
Task : to find as many goal pages as
possible using limited resources.
Various other issues related to
implementation.
Experimental Methodology
Most are done on Web domain

Experiment is not done on the real web
Dynamic, changing over time
 Need enormous time

A reduced web collection from stanford.edu

350,000 valid and accessible HTML pages
Experimental Methodology
Algorithm Compared:
Standard BFS
 Best First minimal distance
 Best First sum & goal elimination
 Best First front advancement & goal
elimination

Front Advancement in WWW
Learning yield
Combining heuristic and yield
Conclusions
Presents a new framework for heuristic
search : Multiple goal search
Introduces the yield heuristic, and two
methods of online learning of the yield
heuristic
The framework is applicable for a wide
range of problems
The End
Thank You !!!