Download Report - Department of Computer Science and Engineering, IIT Delhi

“Report on ‘Towards a Query Optimizer for Text-Centric Tasks’ by Panos Ipeirotis, Eugene Agichtein, Pranay Jain and Luis Gravano” Avinandan Sengupta, Varun Malhotra 1 Introduction Processing textual data to derive structured relations from unstructured text form an important task in information extraction applications as well as in focused crawlers that explore the Web to locate pages relevant to specific topics. Such text-centric tasks can be broadly grouped into two categories based on the technique employed to retrieve the information content. In the first category, a crawler based approach is adopted, in which automated agents scan the documents in the text database; whereas in the second category a query based technique is used in which queries are submitted to search engines and the relevant information is extracted from the obtained results. The choice between crawl and query-based execution plans can have a substantial impact on both execution time and recall. Nevertheless, this choice is typically ad-hoc and based on heuristics or plain intuition. In this article, the authors introduce fundamental building blocks for the optimization of text-centric tasks and propose a disciplined methodology that can be used to create query optimizer for text-centric tasks. 1.1 Motivation Instead of relying on intuition or empirical knowledge altogether, the authors develop models for analyzing query and crawl based techniques for a task in terms of both execution time and output recall, and use the analysis to determine the right approach for a particular text centric task [3]. To analyze crawl-based plans, the authors apply techniques from statistics to model crawling as a document sampling process. To analyze query-based plans, the authors first abstract the querying process as a random walk on a querying graph, and then apply results from the theory of random graphs to discover relevant properties of the querying process. The resultant cost model reflects the fact that the performance of the execution plans depends on fundamental task-specific properties of the underlying text databases. The authors identify these properties and present efficient techniques for estimating the associated parameters of the cost model. 1.2 Classification of Text Centric Tasks Text centric tasks can be broadly classified into the following types: • Task 1 - Information extraction: This task is specifically associated with extracting structured information embedded within unstructured text. Such information can be used for answering relational queries or for data mining. Information extraction systems typically rely on patterns (either manually created or learned from training examples) to extract the structured information from the documents in a database. • Task 2 - Content summary construction: Often valuable information in text databases is not available publicly and is hidden behind search interfaces. This prevents general web search engines (e.g. Google) to access and display results from such databases. To provide effective search over such databases, metasearchers are used. Metasearch tools allow users to search over many databases at once through a unified query interface [1]. A critical step for a metasearcher to process a query efficiently and effectively is the selection of the most promising databases for the query. This step typically relies on statistical summaries of the database contents. The content summary of a database generally lists each word that appears in the database, together with its frequency. 1 If full access is allowed to the contents of a database, a crawl (scan) based strategy can be applied to derive these simple content summaries. On the other hand a query based strategy is applied for constructing the content summary if access to the database contents is through a limited search interface. • Task 3 - Focused resource discovery: Text databases often contain documents on a variety of topics. Focused resource discovery is the identification of the database documents that are about the topic of a specialized search engine (pertaining to a particular subject, e.g. computer science, etc). An expensive strategy in this case would crawl all documents on the Web and apply a document classifier [5] to each crawled page to decide whether it is about subject in question (and hence should be indexed) or not (and hence should be ignored). As an alternative execution strategy, focused crawlers [2] concentrate their effort on documents and hyperlinks that are on-topic, or likely to lead to on-topic documents, as determined by a number of heuristics. Focused crawlers can then address the focused resource discovery task efficiently at the expense of potentially missing relevant documents. As yet another alternative, a query-based approach can be used for this task, where search engine indexes are exploited using queries derived from a document classifier. 1.3 1.3.1 Modelling Text-Centric Tasks Execution Time Consider a text-centric task, a database of text documents D, and an execution strategy S for the task, with an underlying document processor P. The execution time of S over D, T(S, D) is defined as: X X X T ime(S, D) = tT (S) + tQ (q) + (tR (d) + tF (d)) + tP (d) (1) q∈Qsent d∈Dretr d∈Dproc where • Qsent : set of queries sent by S, • Dretr : set of documents retrieved by S (Dretr ⊆ D), • Dproc : set of documents that S processes with document processor P (Dproc ⊆ D), • tT (S) : time for training the execution strategy S, • tQ (q) : time for evaluating a query q, • tR (d) : time for retrieving a document d, • tF (d) : time for filtering a retrieved document d, and • tP (d) : time for processing a document d with P. Assuming that the time to evaluate a query is constant across queries, i.e., tQ = tQ (q), ∀q ∈ Qsent , and that the time to retrieve, filter, or process a single document is constant across documents i.e., tR = tR (d), tF = tF (d), tP = tP (d), ∀d ∈ D, we have: T ime(S, D) = tT (S) + tQ · | Qsent | + (tR + tF ) · | Dretr | + tP · | Dproc | 1.3.2 (2) Recall Recall of a execution strategy S with a document processor P, on a database of text document D is defined as: Recall(S, D) = | T okens(P, Dproc ) | | T okens(P, D) | where • D : database of text documents 2 (3) • P : document processor • Dproc : set of documents from D that S processes with P • Tokens(P, D) : set of tokens that the document processor P extracts from the set of documents D 1.3.3 Problem Formulation Based on the definition of execution time and recall of text-centric tasks the selection of an execution strategy S from a set of alternative strategies S1 , ..., Sn given a recall of τ is governed by the following equations: Recall(S, D) ≥ τ (4) T ime(S, D) ≥ T ime(Sj , D) ∀ Sj : Recall(Sj , D) ≥ τ (5) and In other words, the goal is to identify an execution strategy S that is the fastest across the alternative strategies that reach the recall target τ for the task. 1.4 1.4.1 Execution Strategies Scan The Scan (SC) strategy is a crawl-based strategy that processes each document in a database D exhaustively until the number of tokens extracted satisfies the target recall τ . The Scan execution strategy does not need training and does not send any queries to the database. Hence, tT (SC) = 0 and | Qsent |= 0. Furthermore, Scan does not apply any filtering, hence tF = 0 and | Dproc | = | Dretr |. Therefore, the execution time of Scan is: T ime(SC, D) =| Dretr | · (tR + tP ) (6) 1.4.2 Filtered Scan The Filtered Scan (FS) strategy is a variation of the basic Scan strategy. Filtered Scan first uses a classifier C to decide whether a document d is useful. The training time tT (F S) for Filtered Scan is equal to the time required to build the classifier C for a specific task. Training represents a one-time cost for a task, so in a repeated execution of the task the classifier will be available with tT (F S) = 0. Since Filtered Scan does not send any queries, | Qsent | = 0. Though Filtered Scan retrieves and classifies | Dretr | documents, it processes only Cσ | Dretr | documents, where Cσ is the selectivity of the classifier C, defined as the fraction of database documents that C judges as useful. Thus the execution time of Filtered Scan is: T ime(F S, D) =| Dretr | · (tR + tF + Cσ · tP ) 1.4.3 (7) Iterative Set Expansion Iterative Set Expansion (ISE) is a query-based strategy that queries a database with tokens as they are discovered, starting with a typically small set of user-provided seed tokens T okensseed . The intuition behind this strategy is that known tokens might lead to unseen tokens via documents that have both seen and unseen tokens. Queries are derived from the tokens in a task-specific way. Iterative Set Expansion has no training phase; hence tT (ISE) = 0. We assume that Iterative Set Expansion has to send | Qsent | queries to reach the target recall. Since Iterative Set Expansion processes all the documents that it retrieves, tF = 0 and | Dproc |=| Dretr | . Thus: T ime(ISE, D) =| Qsent | · tQ + | Dretr | · (tR + tP ). 3 (8) 1.4.4 Automatic Query Generation Automatic Query Generation (AQG) is a query-based strategy for retrieving useful documents for a task. AQG works in two stages: 1. Query Generation Stage: In this stage a classifier is trained to categorize documents as useful or not for the task; then, rule-extraction algorithms derive queries from the classifier. 2. Execution Stage: In this stage AQG searches a database using queries that are expected to retrieve useful documents. The training time for Automatic Query Generation involves downloading a training set Dtrain of documents and processing them with P, incurring a cost of | Dtrain | · | (tR + tP ). Training time also includes the time for the actual training of the classifier. Training represents a one-time cost for a task, so in a repeated execution of the task the classifier will be available with tT (AQG) = 0. During execution, AQG sends | Qsent | queries and retrieves | Dretr | documents, which are then all processed by P. Thus: T ime(AQG, D) =| Qsent | · tQ + | Dretr | · (tR + tP ). 1.5 (9) Previous Work Information extraction tasks traditionally use the scan strategy where every document is processed by the information extraction system, whereas some systems use the Filtered Scan strategy, based on preliminary regular expression based URL pattern matchers. In their prior work the authors presented query-based execution strategies for Task 1 using Iterative set expansion and Automatic Query Generation strategies. For Task 2, variants of Iterative Set Expansion and Automatic Query Generation have been employed. In these cases it has been observed that over large crawl-able databases, where both query and crawl-based strategies are possible, query-based strategies outperform crawl-based approaches for a related database classification task. 2 Proposed Solutions In this paper the authors propose algorithms that select the best possible execution strategy (or strategies) for a given text-centric task, and apply that strategy to perform the task. The selection of an execution strategy is done by estimating the time taken to complete the task using the various alternative strategies (i.e. SC, FS, ISE, AQG) followed by selection of the strategy which takes the minimum time to reach the required recall level. 2.1 Costs of Execution Strategies To estimate the time taken by each execution strategy, the authors propose execution models for SC, FS, ISE, and AQG. These models are based on the document degree g(d), token degree t(d), and query degree q(d) for a given document corpus D[3]. Since the exact distributions of these degrees are not known a-priori, the authors instead rely on the distribution families that these degrees tend to follow. Based on this knowledge the authors argue that the identification of the actual distribution is a matter of estimating a few parameters. Subsequently the authors provide estimates for the number of tokens retrieved E[T okenretr ] and the number of documents retrieved | Dretr | to achieve a target recall τ for the four execution strategies discussed in the previous sections. 2.1.1 Cost of SC To compute the number of documents retrieved by SC, the authors observe that SC retrieves documents in no particular order and does not retrieve the same document twice, and that this process is equivalent to sampling from a finite population [4]. 2.1.2 Cost of FS The cost of FS is based on the cost model of SC with two additional properties of the classifier, namely the classifier selectivity Cσ and classifier recall Cr , taken into consideration. 4 2.1.3 Cost of ISE The cost of ISE is derived on the graph based representation of the querying process. To estimate the number of tokens and documents retrieved using ISE, the authors compute the interesting properties of the querying graph using theories of random graphs. These theories are based on the methodology suggested by Newman et al., and use generating functions to describe the properties of querying graphs [6]. 2.1.4 Cost of AQG The cost of AQG is derived by estimating the recall after a set of Q queries are sent and the number of documents retrieved at that point. The authors assume that the queries are biased only towards retrieving useful documents and not towards any other particular set of documents, and propose that the queries are conditionally independent within the set of documents Dusef ul and within the rest of the documents Duseless . Based on this the authors propose the estimate on the number of document retrieved by AQG. To compute the recall of AQG after issuing Q queries, the authors use the approach similar to the one used in FS, and subsequently model AQG as sampling with replacement, where the sampling is over the Dusef ul set instead of D. 2.2 Parameter Estimation Using the cost models presented for the various execution strategies, the authors introduce the global optimization algorithm in which an execution plan is selected that will reach the target recall in minimum amount of time. The optimizer starts by choosing one of the execution plans and switches to the most minimum cost plan as the execution proceeds. The cost models of the strategies presented rely on a number of parameters, which are generally unknown before executing a task. Some of these parameters, such as classifier selectivity and recall, can be estimated efficiently before the execution of the task; whereas other parameters, namely the token and document distributions, are challenging to estimate. Rather than attempting to estimate these distributions without prior information, the authors rely on the fact that for many text-centric tasks the general family of these distributions are known. Thus the estimation task reduces to estimating a few parameters of well-known distribution families. To estimate the parameters of a distribution family for a concrete text-centric task and database, instead of resorting to a preprocessing estimation phase before the start of the actual execution, the authors instead piggyback such estimation phase into the initial steps of an actual execution of the task where they exploit the retrieved documents for on-the-fly parameter estimation. 2.3 Global Optimization The authors propose a global optimization strategy in which the system starts off with an initial strategy (which may not be optimal as this choice is made without accurate parameter estimates for the token and document degree distributions), and as documents are retrieved and tokens extracted, the optimizer updates the distribution parameters and cross validates the estimates. At any point in time, if the estimated execution time for reaching the target recall, Time(S, D), of a competing strategy S is smaller than that of the current strategy, then the optimizer switches to executing the less expensive strategy, continuing from the execution point reached by the current strategy. The statistics are refined after every N (=100) documents are processed. 5 Global Optimizer Algorithm 2.4 Local Optimization Rather than choosing the best strategy for reaching a target recall τ , the local optimization approach partitions the execution into recall stages and successively identifies the best strategy for each stage. Therefore, the local optimization approach chooses the best execution strategy for extracting the first k tokens, for some predefined value of k, then identifies the best execution strategy for extracting the next k tokens, and so on, until the target recall τ is reached. Hence, the local optimization approach can be regarded as invoking the global optimization approach repeatedly, each time to find the best strategy for extracting the next k tokens. Local Optimizer Algorithm 6 3 Discussion The paper discusses various new ideas and novel techniques and is very well written. The strength of the paper lies in the mathematical modelling of the various text-centric tasks and how the experimental evidence backs up the formulated models. Also interesting is the experimental data that shows the behaviour of the global and local optimizer algorithm which are aligned with the claims the authors make. It has been pointed out in the paper that the ISE strategy can be employed to quickly proceed towards the target recall, but suffers from the drawback that certain sections of the querying graph might not be reachable from the set of initial seed tokens. To ameliorate the problem one might consider a hybrid approach, where the AQG technique is employed to obtain queries that can be used to retrieve documents, and subsequently tokens, to help the ISE executor recover from the reachability issue. Still, there are certain assumptions which could be challenged, certain decisions which could be questioned. We list some of our observations. 3.1 Observations • Target Problem: Throughout the paper the target that the authors have considered is that of minimizing execution time to achieve a given recall value. Some other possible targets that could be desirable include : 1. Maximizing Recall given a fixed execution time 2. Given a fixed recall, minimize time to achieve some good fraction (say 90 percent) of the given recall value i.e solving an approximate problem. • Document Processor: The following assumptions have been made about the document processor: 1. Perfection: The processor can retrieve all the tokens and all them are correct. They have not considered the case of a noisy processor and what should be done in that case. 2. Sequential: They have assumed sequential processing of documents and have not explored how concurrency and parallelism could affect performance. • Theoretical Bounds: The authors do not provide any theoretical bounds or formal proofs to establish the efficiency of the 2 algorithms they propose. In fact the only basis for claiming that the algorithms perform better is experimental evaluation. Thus the algorithms they propose are in principle heuristic in nature. • Implementation Details: The authors are vague about the implementation details. They do not discuss the data structures, etc., that was used for the proposed algorithms. • Execution Strategies: During the course of the paper only four execution strategies are considered; all subsequent analysis is based on these four strategies. There are various other strategies that one could think of, but they limit themselves to only these four. For example one way of forming new strategies is by combining the strategies that they have mentioned. As an example one might consider a combination of ISE and AQG as a hybrid strategy. • Magic Numbers: N (the number of documents after which global optimization reconsiders it’s execution plan) and maxD (constant used in ISE and AQG) have been magically set to the value 100 in the experimental phase. • Probability Distribution: The authors have assumed the distribution of the token, document and query distribution is already known (power law distribution). Thus they have taken a parametric approach. What if the degree’s are not distributed as they have assumed. They could have taken a non-parametric approach i.e not assuming a prior distribution. • Parameter Estimation: Estimating the parameters of the distributions they have assumed would also consume some computational resources. They have not incorporated this in their execution time analysis anywhere. They use Maximum Likelihood (MLE) approach to estimate parameters but they could have also used more accurate measures like Maximum - A - Priori (MAP) estimate or the Bayesian estimate. 7 • Training: All the training phases in the various strategies have assumed to be offline. They have not considered the option of online training. eg , in Task 3 i.e. Focussed Resource Discovery the training time will be significant if we consider change of topics as a possibility. • Calculations: They have made many simplifying assumptions in the various calculations involved. e.g., time of querying is considered independent of the query, time of processing, filtering, retrieving is considered independent of the document. • Query Processing Unit: They have not specified any details regarding how the queries are processed (Boolean Model/Inverted Index). They have taken the query processing time as a constant value. 4 Conclusion In this paper the authors discuss three different text-centric tasks (information extraction, content summary construction, focused resource discovery) and present disciplined approaches towards selection of execution strategies for these tasks. They establish mathematical models for these strategies and employ statistical and graph-theoretic techniques to establish the costs associated with each of these strategies. Based on these, the authors propose the global optimizer, and the local optimizer algorithms which help achieve a specific recall target in the minimum possible time. This work establishes a framework for selecting execution strategies based on mathematical models that were hitherto done based on empirical knowledge or pure intuition. A strong experimental section backs up the claims made by the authors. References [1] James P. Callan, Zhihong Lu, and W. Bruce Croft. Searching distributed collections with inference networks. In Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval, SIGIR ’95, pages 21–28, New York, NY, USA, 1995. ACM. [2] Soumen Chakrabarti, Martin van den Berg, and Byron Dom. Focused crawling: a new approach to topic-specific web resource discovery. Comput. Netw., 31:1623–1640, May 1999. [3] Panagiotis G. Ipeirotis, Pranay Jain, and Luis Gravano. Towards a query optimizer for text-centric tasks. ACM Transactions on Database Systems, 32, 2007. [4] Sheldon M. Ross. Introduction to Probability Models, Ninth Edition. Academic Press, Inc., Orlando, FL, USA, 2006. [5] Fabrizio Sebastiani. Machine learning in automated text categorization. ACM Comput. Surv., 34:1–47, March 2002. [6] Herbert S. Wilf. Generating functionology. Academic Press Professional, Inc., San Diego, CA, USA, 1990. 8

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Report - Department of Computer Science and Engineering, IIT Delhi