Download ppt - Panos Ipeirotis

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Microsoft Jet Database Engine wikipedia , lookup

Concurrency control wikipedia , lookup

Relational model wikipedia , lookup

Database model wikipedia , lookup

Clusterpoint wikipedia , lookup

Object-relational impedance mismatch wikipedia , lookup

Transcript
To Search or to Crawl?
Towards a Query Optimizer for Text-Centric Tasks
Panos Ipeirotis – New York University
Eugene Agichtein – Microsoft Research
Pranay Jain – Columbia University
Luis Gravano – Columbia University
Text-Centric Task I: Information Extraction

Information extraction applications extract structured
relations from unstructured text
May 19 1995, Atlanta -- The Centers for Disease Control
and Prevention, which is in the front line of the world's
response to the deadly Ebola epidemic in Zaire ,
is finding itself hard pressed to cope with the crisis…
Disease Outbreaks in The New York Times
Information
Extraction System
(e.g., NYU’s Proteus)
Date
Disease Name
Location
Jan. 1995
Malaria
Ethiopia
July 1995
Mad Cow Disease U.K.
Feb. 1995
Pneumonia
U.S.
May 1995
Ebola
Zaire
Information Extraction tutorial yesterday by AnHai Doan, Raghu Ramakrishnan, Shivakumar Vaithyanathan
2
Other Text-Centric Tasks


Task II: Database Selection
Task III: Focused Crawling
Details in the paper
3
An Abstract View of Text-Centric Tasks
Output Tokens
Text Database
…
Extraction
System
1. Retrieve documents
from database
2. Process documents
3. Extract output tokens
Task
Token
Information Extraction
Relation Tuple
Database Selection
Word (+Frequency)
Focused Crawling
Web Page about a Topic
For the rest of the talk
4
Executing a Text-Centric Task
Output Tokens
Text Database
Extraction
…
System
1. Retrieve
documents from
database
Similar to relational world
2. Process
documents
3. Extract output
tokens
Two major execution paradigms
 Scan-based: Retrieve and process documents
sequentially
 Index-based: Query database (e.g., [case fatality rate]),
retrieve and process documents in results
→underlying data distribution dictates what is best
Indexes are only “approximate”: index is on
keywords, not on tokens of interest
 Choice of execution plan affects output
completeness (not only speed)

Unlike the relational world
5
Execution Plan Characteristics
Output Tokens
Text Database
1.
Question: How
do we choose the…
Extraction
fastest execution
plan for reaching
System
a
target
recall
?
Retrieve documents
from database
2. Process documents
3. Extract output tokens
Execution Plans have two main characteristics:
Execution Time
Recall (fraction of tokens retrieved)
“What is the fastest plan for discovering 10% of the disease
outbreaks mentioned in The New York Times archive?”
6
Outline

Description and analysis of crawl- and query-based plans




Scan
Crawl-based
Filtered Scan
Iterative Set Expansion
Automatic Query Generation
Query-based
(Index-based)

Optimization strategy

Experimental results and conclusions
7
Scan
Output Tokens
Text Database
Extraction
…
System
1. Retrieve docs
from database

2. Process
documents
3. Extract output
tokens
Scan retrieves and processes documents sequentially (until reaching target recall)
Execution time = |Retrieved Docs| · (R + P)
Question: How many documents
does Scan retrieve to reach target
recall?
Time for retrieving a
document
Time for processing
a document
Filtered Scan uses a classifier to identify and process only promising documents (details in paper)
8
Estimating Recall of Scan
<SARS, China>
Modeling Scan for Token t:

What is the probability of seeing t (with
frequency g(t)) after retrieving S documents?

A “sampling without replacement” process
Token
t
d1

d2
S documents
...

After retrieving S documents, frequency of
token t follows hypergeometric distribution
Recall for token t is the probability that
frequency of t in S docs > 0
dS
...
dN
D
Probability of seeing token t
after retrieving S documents
g(t) = frequency of token t
Sampling
for t
9
Estimating Recall of Scan
<SARS, China>
<Ebola, Zaire>
Modeling Scan:

Multiple “sampling without replacement”
processes, one for each token

Overall recall is average recall across
tokens
Tokens
t1
t2
Sampling
for t1
Sampling
for t2
...
tM
d1
d2
→ We can compute number of documents
required to reach target recall
d3
...
Execution time = |Retrieved Docs| · (R + P)
dN
D
Sampling
for tM
10
Outline

Description and analysis of crawl- and query-based plans




Scan
Crawl-based
Filtered Scan
Iterative Set Expansion
Automatic Query Generation
Query-based

Optimization strategy

Experimental results and conclusions
11
Iterative Set Expansion
Output Tokens
Text Database
…
Extraction
Query
System
1. Query
database with
seed tokens
Generation
2. Process retrieved
documents
3. Extract tokens
from docs
(e.g., <Malaria, Ethiopia>)
4. Augment seed
tokens with
new tokens
(e.g., [Ebola AND Zaire])
Execution time = |Retrieved Docs| * (R + P) + |Queries| * Q
Question: How many queries
and how many documents does
Iterative Set Expansion need to
reach target recall?
Time for retrieving a Time for processing
document
a document
Time for answering
a query12
Querying Graph
Tokens

The querying graph is a
bipartite graph, containing
tokens and documents
t1
Documents
d1
<SARS, China>
t2
d2
<Ebola, Zaire>


Each token (transformed to a
keyword query) retrieves
documents
Documents contain tokens
t3
d3
<Malaria, Ethiopia>
t4
d4
t5
d5
<Cholera, Sudan>
<H5N1, Vietnam>
13
Using Querying Graph for Analysis
We need to compute the:

Number of documents retrieved after
sending Q tokens as queries (estimates time)

Number of tokens that appear in the
retrieved documents (estimates recall)
Tokens
t1
Documents
d1
<SARS, China>
t2
d2
<Ebola, Zaire>
To estimate these we need to compute the:

Degree distribution of the tokens
discovered by retrieving documents

Degree distribution of the documents
retrieved by the tokens

(Not the same as the degree distribution of a
randomly chosen token or document – it is easier to
discover documents and tokens with high degrees)
t3
d3
<Malaria, Ethiopia>
t4
d4
t5
d5
<Cholera, Sudan>
<H5N1, Vietnam>
14
Elegant analysis framework based on generating functions – details in the paper
Recall Limit: Reachability Graph
Tokens
Documents
t1
d1
t2
d2
t3
d3
t4
d4
t5
d5
Reachability Graph
t1
t2
t3
t5
t4
t1 retrieves document d1
that contains t2
Upper recall limit: determined by the size
of the biggest connected component
15
Automatic Query Generation

Iterative Set Expansion has recall limitation due to
iterative nature of query generation

Automatic Query Generation avoids this problem by
creating queries offline (using machine learning), which
are designed to return documents with tokens
Details in the paper
16
Outline

Description and analysis of crawl- and query-based plans

Optimization strategy

Experimental results and conclusions
17
Summary of Cost Analysis

Our analysis so far:


Takes as input a target recall
Gives as output the time for each plan to reach target recall
(time = infinity, if plan cannot reach target recall)

Time and recall depend on task-specific properties of database:



Token degree distribution
Document degree distribution
Next, we show how to estimate degree distributions on-the-fly
18
Estimating Cost Model Parameters
Token and document degree distributions belong to known distribution families
Task
Document Distribution
Token Distribution
Information Extraction
Power-law
Power-law
Content Summary Construction
Lognormal
Power-law (Zipf)
Focused Resource Discovery
Uniform
Uniform
10000
100000
y = 43060x-3.3863
10000
1000
y = 5492.2x-2.0254
Number of Tokens
Number of Documents
1000
100
10
1
1
10
Document Degree
100
100
10
1
1
10
100
Token Degree
1000
19
Can characterize distributions with only a few parameters!
Parameter Estimation

Naïve solution for parameter estimation:



Start with separate, “parameter-estimation” phase
Perform random sampling on database
Stop when cross-validation indicates high confidence

We can do better than this!

No need for separate sampling phase
Sampling is equivalent to executing the task:
→Piggyback parameter estimation into execution

20
On-the-fly Parameter Estimation
Correct (but unknown) distribution

Pick most promising execution
plan for target recall assuming
“default” parameter values

Start executing task
Update parameter estimates
during execution
Switch plan if updated statistics
indicate so


Initial,
default
estimation
Updated
estimation
Updated
estimation
Important
Only Scan acts as “random sampling”
21
All other execution plan need parameter adjustment (see paper)
Outline

Description and analysis of crawl- and query-based plans

Optimization strategy

Experimental results and conclusions
22
Correctness of Theoretical Analysis
100,000
Execution Time (secs)
10,000


Scan
1,000
Filt. Scan
Automatic Query Gen.
Iterative Set Expansion
100
0.00
0.10
0.20
0.30
0.40
0.50
Recall
0.60
Solid lines: Actual time
Dotted lines: Predicted time with correct parameters
0.70
0.80
0.90
1.00
Task: Disease Outbreaks
Snowball IE system
182,531 documents from NYT
23
16,921 tokens
Experimental Results (Information Extraction)
100,000
Execution Time (secs)
10,000


Scan
Filt. Scan
1,000
Iterative Set Expansion
Automatic Query Gen.
OPTIMIZED
100
0.00
0.10
0.20
0.30
0.40
0.50
Recall
0.60
0.70
0.80
0.90
1.00
Solid lines: Actual time
Green line: Time with optimizer
(results similar in other experiments – see paper)
24
Conclusions

Common execution plans for multiple text-centric tasks

Analytic models for predicting execution time and recall
of various crawl- and query-based plans

Techniques for on-the-fly parameter estimation

Optimization framework picks on-the-fly the fastest plan
for target recall
25
Future Work

Incorporate precision and recall of extraction
system in framework

Create non-parametric optimization (i.e., no
assumption about distribution families)

Examine other text-centric tasks and analyze new
execution plans

Create adaptive, “next-K” optimizer
26
Thank you!
Task
Filtered Scan
Iterative Set
Expansion
Automatic Query
Generation
Information
Extraction
Grishman et al.,
J.of Biomed. Inf. 2002
Agichtein and Gravano,
ICDE 2003
Agichtein and Gravano,
ICDE 2003
Content Summary
Construction
-
Callan et al.,
SIGMOD 1999
Ipeirotis and Gravano,
VLDB 2002
Focused Resource
Discovery
Chakrabarti et al.,
WWW 1999
-
Cohen and Singer,
AAAI WIBIS 1996
27