Download Report - Department of Computer Science and Engineering, IIT Delhi

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Icarus paradox wikipedia , lookup

Transcript
“Report on ‘Towards a Query Optimizer for Text-Centric Tasks’ by
Panos Ipeirotis, Eugene Agichtein, Pranay Jain and Luis Gravano”
Avinandan Sengupta, Varun Malhotra
1
Introduction
Processing textual data to derive structured relations from unstructured text form an important task in
information extraction applications as well as in focused crawlers that explore the Web to locate pages
relevant to specific topics. Such text-centric tasks can be broadly grouped into two categories based on the
technique employed to retrieve the information content. In the first category, a crawler based approach is
adopted, in which automated agents scan the documents in the text database; whereas in the second category
a query based technique is used in which queries are submitted to search engines and the relevant information
is extracted from the obtained results.
The choice between crawl and query-based execution plans can have a substantial impact on both execution time and recall. Nevertheless, this choice is typically ad-hoc and based on heuristics or plain intuition.
In this article, the authors introduce fundamental building blocks for the optimization of text-centric tasks
and propose a disciplined methodology that can be used to create query optimizer for text-centric tasks.
1.1
Motivation
Instead of relying on intuition or empirical knowledge altogether, the authors develop models for analyzing
query and crawl based techniques for a task in terms of both execution time and output recall, and use the
analysis to determine the right approach for a particular text centric task [3].
To analyze crawl-based plans, the authors apply techniques from statistics to model crawling as a document sampling process. To analyze query-based plans, the authors first abstract the querying process as
a random walk on a querying graph, and then apply results from the theory of random graphs to discover
relevant properties of the querying process. The resultant cost model reflects the fact that the performance
of the execution plans depends on fundamental task-specific properties of the underlying text databases. The
authors identify these properties and present efficient techniques for estimating the associated parameters of
the cost model.
1.2
Classification of Text Centric Tasks
Text centric tasks can be broadly classified into the following types:
• Task 1 - Information extraction: This task is specifically associated with extracting structured information embedded within unstructured text. Such information can be used for answering relational
queries or for data mining. Information extraction systems typically rely on patterns (either manually
created or learned from training examples) to extract the structured information from the documents
in a database.
• Task 2 - Content summary construction: Often valuable information in text databases is not available
publicly and is hidden behind search interfaces. This prevents general web search engines (e.g. Google)
to access and display results from such databases. To provide effective search over such databases,
metasearchers are used. Metasearch tools allow users to search over many databases at once through a
unified query interface [1]. A critical step for a metasearcher to process a query efficiently and effectively
is the selection of the most promising databases for the query. This step typically relies on statistical
summaries of the database contents. The content summary of a database generally lists each word that
appears in the database, together with its frequency.
1
If full access is allowed to the contents of a database, a crawl (scan) based strategy can be applied to
derive these simple content summaries. On the other hand a query based strategy is applied for constructing the content summary if access to the database contents is through a limited search interface.
• Task 3 - Focused resource discovery: Text databases often contain documents on a variety of topics.
Focused resource discovery is the identification of the database documents that are about the topic of a
specialized search engine (pertaining to a particular subject, e.g. computer science, etc). An expensive
strategy in this case would crawl all documents on the Web and apply a document classifier [5] to each
crawled page to decide whether it is about subject in question (and hence should be indexed) or not
(and hence should be ignored). As an alternative execution strategy, focused crawlers [2] concentrate
their effort on documents and hyperlinks that are on-topic, or likely to lead to on-topic documents, as
determined by a number of heuristics. Focused crawlers can then address the focused resource discovery
task efficiently at the expense of potentially missing relevant documents. As yet another alternative, a
query-based approach can be used for this task, where search engine indexes are exploited using queries
derived from a document classifier.
1.3
1.3.1
Modelling Text-Centric Tasks
Execution Time
Consider a text-centric task, a database of text documents D, and an execution strategy S for the task, with
an underlying document processor P. The execution time of S over D, T(S, D) is defined as:
X
X
X
T ime(S, D) = tT (S) +
tQ (q) +
(tR (d) + tF (d)) +
tP (d)
(1)
q∈Qsent
d∈Dretr
d∈Dproc
where
• Qsent : set of queries sent by S,
• Dretr : set of documents retrieved by S (Dretr ⊆ D),
• Dproc : set of documents that S processes with document processor P (Dproc ⊆ D),
• tT (S) : time for training the execution strategy S,
• tQ (q) : time for evaluating a query q,
• tR (d) : time for retrieving a document d,
• tF (d) : time for filtering a retrieved document d, and
• tP (d) : time for processing a document d with P.
Assuming that the time to evaluate a query is constant across queries, i.e., tQ = tQ (q), ∀q ∈ Qsent ,
and that the time to retrieve, filter, or process a single document is constant across documents i.e., tR =
tR (d), tF = tF (d), tP = tP (d), ∀d ∈ D, we have:
T ime(S, D) = tT (S) + tQ · | Qsent | + (tR + tF ) · | Dretr | + tP · | Dproc |
1.3.2
(2)
Recall
Recall of a execution strategy S with a document processor P, on a database of text document D is defined
as:
Recall(S, D) =
| T okens(P, Dproc ) |
| T okens(P, D) |
where
• D : database of text documents
2
(3)
• P : document processor
• Dproc : set of documents from D that S processes with P
• Tokens(P, D) : set of tokens that the document processor P extracts from the set of documents D
1.3.3
Problem Formulation
Based on the definition of execution time and recall of text-centric tasks the selection of an execution strategy
S from a set of alternative strategies S1 , ..., Sn given a recall of τ is governed by the following equations:
Recall(S, D) ≥ τ
(4)
T ime(S, D) ≥ T ime(Sj , D) ∀ Sj : Recall(Sj , D) ≥ τ
(5)
and
In other words, the goal is to identify an execution strategy S that is the fastest across the alternative
strategies that reach the recall target τ for the task.
1.4
1.4.1
Execution Strategies
Scan
The Scan (SC) strategy is a crawl-based strategy that processes each document in a database D exhaustively
until the number of tokens extracted satisfies the target recall τ . The Scan execution strategy does not need
training and does not send any queries to the database. Hence, tT (SC) = 0 and | Qsent |= 0. Furthermore,
Scan does not apply any filtering, hence tF = 0 and | Dproc | = | Dretr |. Therefore, the execution time of
Scan is:
T ime(SC, D) =| Dretr | · (tR + tP )
(6)
1.4.2
Filtered Scan
The Filtered Scan (FS) strategy is a variation of the basic Scan strategy. Filtered Scan first uses a classifier
C to decide whether a document d is useful. The training time tT (F S) for Filtered Scan is equal to the time
required to build the classifier C for a specific task. Training represents a one-time cost for a task, so in
a repeated execution of the task the classifier will be available with tT (F S) = 0. Since Filtered Scan does
not send any queries, | Qsent | = 0. Though Filtered Scan retrieves and classifies | Dretr | documents, it
processes only Cσ | Dretr | documents, where Cσ is the selectivity of the classifier C, defined as the fraction
of database documents that C judges as useful. Thus the execution time of Filtered Scan is:
T ime(F S, D) =| Dretr | · (tR + tF + Cσ · tP )
1.4.3
(7)
Iterative Set Expansion
Iterative Set Expansion (ISE) is a query-based strategy that queries a database with tokens as they are
discovered, starting with a typically small set of user-provided seed tokens T okensseed . The intuition behind
this strategy is that known tokens might lead to unseen tokens via documents that have both seen and unseen
tokens. Queries are derived from the tokens in a task-specific way. Iterative Set Expansion has no training
phase; hence tT (ISE) = 0. We assume that Iterative Set Expansion has to send | Qsent | queries to reach
the target recall. Since Iterative Set Expansion processes all the documents that it retrieves, tF = 0 and
| Dproc |=| Dretr | . Thus:
T ime(ISE, D) =| Qsent | · tQ + | Dretr | · (tR + tP ).
3
(8)
1.4.4
Automatic Query Generation
Automatic Query Generation (AQG) is a query-based strategy for retrieving useful documents for a task.
AQG works in two stages:
1. Query Generation Stage: In this stage a classifier is trained to categorize documents as useful or not
for the task; then, rule-extraction algorithms derive queries from the classifier.
2. Execution Stage: In this stage AQG searches a database using queries that are expected to retrieve
useful documents.
The training time for Automatic Query Generation involves downloading a training set Dtrain of documents and processing them with P, incurring a cost of | Dtrain | · | (tR + tP ). Training time also includes
the time for the actual training of the classifier. Training represents a one-time cost for a task, so in a
repeated execution of the task the classifier will be available with tT (AQG) = 0. During execution, AQG
sends | Qsent | queries and retrieves | Dretr | documents, which are then all processed by P. Thus:
T ime(AQG, D) =| Qsent | · tQ + | Dretr | · (tR + tP ).
1.5
(9)
Previous Work
Information extraction tasks traditionally use the scan strategy where every document is processed by the
information extraction system, whereas some systems use the Filtered Scan strategy, based on preliminary
regular expression based URL pattern matchers. In their prior work the authors presented query-based
execution strategies for Task 1 using Iterative set expansion and Automatic Query Generation strategies.
For Task 2, variants of Iterative Set Expansion and Automatic Query Generation have been employed.
In these cases it has been observed that over large crawl-able databases, where both query and crawl-based
strategies are possible, query-based strategies outperform crawl-based approaches for a related database
classification task.
2
Proposed Solutions
In this paper the authors propose algorithms that select the best possible execution strategy (or strategies)
for a given text-centric task, and apply that strategy to perform the task. The selection of an execution
strategy is done by estimating the time taken to complete the task using the various alternative strategies
(i.e. SC, FS, ISE, AQG) followed by selection of the strategy which takes the minimum time to reach the
required recall level.
2.1
Costs of Execution Strategies
To estimate the time taken by each execution strategy, the authors propose execution models for SC, FS, ISE,
and AQG. These models are based on the document degree g(d), token degree t(d), and query degree q(d)
for a given document corpus D[3]. Since the exact distributions of these degrees are not known a-priori, the
authors instead rely on the distribution families that these degrees tend to follow. Based on this knowledge
the authors argue that the identification of the actual distribution is a matter of estimating a few parameters.
Subsequently the authors provide estimates for the number of tokens retrieved E[T okenretr ] and the number
of documents retrieved | Dretr | to achieve a target recall τ for the four execution strategies discussed in the
previous sections.
2.1.1
Cost of SC
To compute the number of documents retrieved by SC, the authors observe that SC retrieves documents in
no particular order and does not retrieve the same document twice, and that this process is equivalent to
sampling from a finite population [4].
2.1.2
Cost of FS
The cost of FS is based on the cost model of SC with two additional properties of the classifier, namely the
classifier selectivity Cσ and classifier recall Cr , taken into consideration.
4
2.1.3
Cost of ISE
The cost of ISE is derived on the graph based representation of the querying process. To estimate the number
of tokens and documents retrieved using ISE, the authors compute the interesting properties of the querying
graph using theories of random graphs. These theories are based on the methodology suggested by Newman
et al., and use generating functions to describe the properties of querying graphs [6].
2.1.4
Cost of AQG
The cost of AQG is derived by estimating the recall after a set of Q queries are sent and the number of
documents retrieved at that point. The authors assume that the queries are biased only towards retrieving
useful documents and not towards any other particular set of documents, and propose that the queries are
conditionally independent within the set of documents Dusef ul and within the rest of the documents Duseless .
Based on this the authors propose the estimate on the number of document retrieved by AQG.
To compute the recall of AQG after issuing Q queries, the authors use the approach similar to the one
used in FS, and subsequently model AQG as sampling with replacement, where the sampling is over the
Dusef ul set instead of D.
2.2
Parameter Estimation
Using the cost models presented for the various execution strategies, the authors introduce the global optimization algorithm in which an execution plan is selected that will reach the target recall in minimum amount
of time. The optimizer starts by choosing one of the execution plans and switches to the most minimum cost
plan as the execution proceeds.
The cost models of the strategies presented rely on a number of parameters, which are generally unknown
before executing a task. Some of these parameters, such as classifier selectivity and recall, can be estimated
efficiently before the execution of the task; whereas other parameters, namely the token and document
distributions, are challenging to estimate. Rather than attempting to estimate these distributions without
prior information, the authors rely on the fact that for many text-centric tasks the general family of these
distributions are known. Thus the estimation task reduces to estimating a few parameters of well-known
distribution families. To estimate the parameters of a distribution family for a concrete text-centric task and
database, instead of resorting to a preprocessing estimation phase before the start of the actual execution,
the authors instead piggyback such estimation phase into the initial steps of an actual execution of the task
where they exploit the retrieved documents for on-the-fly parameter estimation.
2.3
Global Optimization
The authors propose a global optimization strategy in which the system starts off with an initial strategy
(which may not be optimal as this choice is made without accurate parameter estimates for the token and
document degree distributions), and as documents are retrieved and tokens extracted, the optimizer updates
the distribution parameters and cross validates the estimates.
At any point in time, if the estimated execution time for reaching the target recall, Time(S, D), of a
competing strategy S is smaller than that of the current strategy, then the optimizer switches to executing
the less expensive strategy, continuing from the execution point reached by the current strategy. The statistics
are refined after every N (=100) documents are processed.
5
Global Optimizer Algorithm
2.4
Local Optimization
Rather than choosing the best strategy for reaching a target recall τ , the local optimization approach partitions the execution into recall stages and successively identifies the best strategy for each stage. Therefore,
the local optimization approach chooses the best execution strategy for extracting the first k tokens, for some
predefined value of k, then identifies the best execution strategy for extracting the next k tokens, and so on,
until the target recall τ is reached. Hence, the local optimization approach can be regarded as invoking the
global optimization approach repeatedly, each time to find the best strategy for extracting the next k tokens.
Local Optimizer Algorithm
6
3
Discussion
The paper discusses various new ideas and novel techniques and is very well written. The strength of the
paper lies in the mathematical modelling of the various text-centric tasks and how the experimental evidence
backs up the formulated models. Also interesting is the experimental data that shows the behaviour of the
global and local optimizer algorithm which are aligned with the claims the authors make.
It has been pointed out in the paper that the ISE strategy can be employed to quickly proceed towards the
target recall, but suffers from the drawback that certain sections of the querying graph might not be reachable
from the set of initial seed tokens. To ameliorate the problem one might consider a hybrid approach, where
the AQG technique is employed to obtain queries that can be used to retrieve documents, and subsequently
tokens, to help the ISE executor recover from the reachability issue.
Still, there are certain assumptions which could be challenged, certain decisions which could be questioned.
We list some of our observations.
3.1
Observations
• Target Problem: Throughout the paper the target that the authors have considered is that of minimizing
execution time to achieve a given recall value. Some other possible targets that could be desirable
include :
1. Maximizing Recall given a fixed execution time
2. Given a fixed recall, minimize time to achieve some good fraction (say 90 percent) of the given
recall value i.e solving an approximate problem.
• Document Processor: The following assumptions have been made about the document processor:
1. Perfection: The processor can retrieve all the tokens and all them are correct. They have not
considered the case of a noisy processor and what should be done in that case.
2. Sequential: They have assumed sequential processing of documents and have not explored how
concurrency and parallelism could affect performance.
• Theoretical Bounds: The authors do not provide any theoretical bounds or formal proofs to establish
the efficiency of the 2 algorithms they propose. In fact the only basis for claiming that the algorithms
perform better is experimental evaluation. Thus the algorithms they propose are in principle heuristic
in nature.
• Implementation Details: The authors are vague about the implementation details. They do not discuss
the data structures, etc., that was used for the proposed algorithms.
• Execution Strategies: During the course of the paper only four execution strategies are considered;
all subsequent analysis is based on these four strategies. There are various other strategies that one
could think of, but they limit themselves to only these four. For example one way of forming new
strategies is by combining the strategies that they have mentioned. As an example one might consider
a combination of ISE and AQG as a hybrid strategy.
• Magic Numbers: N (the number of documents after which global optimization reconsiders it’s execution
plan) and maxD (constant used in ISE and AQG) have been magically set to the value 100 in the
experimental phase.
• Probability Distribution: The authors have assumed the distribution of the token, document and query
distribution is already known (power law distribution). Thus they have taken a parametric approach.
What if the degree’s are not distributed as they have assumed. They could have taken a non-parametric
approach i.e not assuming a prior distribution.
• Parameter Estimation: Estimating the parameters of the distributions they have assumed would also
consume some computational resources. They have not incorporated this in their execution time
analysis anywhere. They use Maximum Likelihood (MLE) approach to estimate parameters but they
could have also used more accurate measures like Maximum - A - Priori (MAP) estimate or the Bayesian
estimate.
7
• Training: All the training phases in the various strategies have assumed to be offline. They have not
considered the option of online training. eg , in Task 3 i.e. Focussed Resource Discovery the training
time will be significant if we consider change of topics as a possibility.
• Calculations: They have made many simplifying assumptions in the various calculations involved. e.g.,
time of querying is considered independent of the query, time of processing, filtering, retrieving is
considered independent of the document.
• Query Processing Unit: They have not specified any details regarding how the queries are processed
(Boolean Model/Inverted Index). They have taken the query processing time as a constant value.
4
Conclusion
In this paper the authors discuss three different text-centric tasks (information extraction, content summary
construction, focused resource discovery) and present disciplined approaches towards selection of execution
strategies for these tasks. They establish mathematical models for these strategies and employ statistical and
graph-theoretic techniques to establish the costs associated with each of these strategies. Based on these, the
authors propose the global optimizer, and the local optimizer algorithms which help achieve a specific recall
target in the minimum possible time.
This work establishes a framework for selecting execution strategies based on mathematical models that
were hitherto done based on empirical knowledge or pure intuition. A strong experimental section backs up
the claims made by the authors.
References
[1] James P. Callan, Zhihong Lu, and W. Bruce Croft. Searching distributed collections with inference
networks. In Proceedings of the 18th annual international ACM SIGIR conference on Research and
development in information retrieval, SIGIR ’95, pages 21–28, New York, NY, USA, 1995. ACM.
[2] Soumen Chakrabarti, Martin van den Berg, and Byron Dom. Focused crawling: a new approach to
topic-specific web resource discovery. Comput. Netw., 31:1623–1640, May 1999.
[3] Panagiotis G. Ipeirotis, Pranay Jain, and Luis Gravano. Towards a query optimizer for text-centric tasks.
ACM Transactions on Database Systems, 32, 2007.
[4] Sheldon M. Ross. Introduction to Probability Models, Ninth Edition. Academic Press, Inc., Orlando, FL,
USA, 2006.
[5] Fabrizio Sebastiani. Machine learning in automated text categorization. ACM Comput. Surv., 34:1–47,
March 2002.
[6] Herbert S. Wilf. Generating functionology. Academic Press Professional, Inc., San Diego, CA, USA, 1990.
8