Download Nov-28-information-e..

Document related concepts

Microsoft Jet Database Engine wikipedia , lookup

Extensible Storage Engine wikipedia , lookup

Clusterpoint wikipedia , lookup

Database model wikipedia , lookup

Relational model wikipedia , lookup

Relational algebra wikipedia , lookup

Transcript
Scalable Information Extraction
Eugene Agichtein
1
Example: Angina treatments
guideline for unstable angina
unstable angina management
herbal treatment for angina pain
medications for treating angina
alternative treatment for angina pain
treatment for angina
angina treatments
Structured databases
(e.g., drug info, WHO drug
adverse effects DB, etc)
MedLine
PDR
Medical reference
and literature
Web search results
2
Research Goal
Accurate, intuitive, and efficient access
to knowledge in unstructured sources
Approaches:

Information Retrieval



Human Reading


Retrieve the relevant documents or passages
Question answering
Construct domain-specific “verticals” (MedLine)
Machine Reading


Extract entities and relationships
Network of relationships: Semantic Web
3
Semantic Relationships “Buried” in Unstructured Text
…
A number of well-designed and executed large-scale clinical trials have
now shown that treatment with statins
reduces recurrent myocardial infarction,
reduces strokes, and lessens the need
for revascularization or hospitalization
for unstable angina pectoris
…



RecommendedTreatment
Drug
Condition
statins
recurrent myocardial
infarction
statins
strokes
statins
unstable angina
pectoris
Web, newsgroups, web logs
Text databases (PubMed, CiteSeer, etc.)
Newspaper Archives


Corporate mergers, succession, location
Terrorist attacks
]
M essage
U nderstanding
C onferences
4
What Structured Representation
Can Do for You:
Large Text Collection





Structured Relation
… allow precise and efficient querying
… allow returning answers instead of documents
… support powerful query constructs
… allow data integration with (structured)
RDBMS
… provide useful content for Semantic Web
5
Challenges in Information Extraction

Portability



Scalability, Efficiency, Access



Reduce effort to tune for new domains and tasks
MUC systems: experts would take 8-12 weeks to tune
Enable information extraction over large collections
1 sec / document * 5 billion docs = 158 CPU years
Approach: learn from data ( “Bootstrapping” )


Snowball: Partially Supervised Information Extraction
Querying Large Text Databases for Efficient Information Extraction
6
Outline

Snowball: partially supervised information
extraction (overview and key results)

Effective retrieval algorithms for information
extraction (in detail)

Current: mining user behavior for web search

Future work
7
The Snowball System: Overview
Organization
Location
Conf
Microsoft
Redmond
1
IBM
Armonk
1
Intel
Santa Clara
1
AG Edwards
St Louis
0.9
Air Canada
Montreal
0.8
7th Level
Richardson
0.8
3Com Corp
Santa Clara
0.8
3DO
Redwood City
0.7
3M
Minneapolis
0.7
MacWorld
San Francisco
0.7
...
...
..
157th Street
Manhattan
0.52
15th Party
Congress
China
0.3
15th Century
Europe
Dark Ages
0.1
1
Snowball
2
Text Database
3
8
Snowball: Getting User Input
Get
Examples
Organization
Headquarters
Microsoft
Redmond
IBM
Armonk
Intel
Santa Clara
Find Example
Occurrences in Text
Evaluate
Tuples
Extract
Tuples
ACM DL 2000
Tag
Entities
Generate Extraction
Patterns
User input:
• a handful of example instances
• integrity constraints on the relation
e.g., Organization is a “key”, Age > 0, etc…
9
Snowball: Finding Example Occurrences
Can use any
Get
Examples
Find Example
Occurrences in Text
full-text search
engine
Organization Headquarters
Microsoft
Redmond
IBM
Armonk
Intel
Santa Clara
Search
Engine
Text Database
Evaluate
Tuples
Extract
Tuples
Tag
Entities
Generate Extraction
Patterns
Computer servers at Microsoft’s headquarters in Redmond…
In mid-afternoon trading, shares of Redmond, WA-based Microsoft Corp
The Armonk-based IBM introduced a new line…
Change of guard at IBM Corporation’s headquarters near Armonk, NY ...
10
Snowball: Tagging Entities
Named entity taggers
can recognize Dates,
People, Locations,
Organizations, …
MITRE’s Alembic, IBM’s
Talent, LingPipe, …
Get
Examples
Find Example
Occurrences in Text
Evaluate
Tuples
Extract
Tuples
Tag
Entities
Generate Extraction
Patterns
Computer servers at Microsoft ’s headquarters in Redmond…
In mid-afternoon trading, shares of Redmond, WA -based Microsoft Corp
The Armonk -based IBM introduced a new line…
Change of guard at IBM Corporation‘s headquarters near Armonk, NY ...
11
Snowball: Extraction Patterns
Computer servers at Microsoft’s headquarters in Redmond…

General extraction pattern model:
acceptor0,

Entity,
acceptor1, Entity, acceptor2
Acceptor instantiations:



String Match (accepts string “’s headquarters in”)
Vector-Space (~ vector [(-’s,0.5), (headquarters, 0.5),
(in, 0.5)] )
Classifier (estimate P(T=valid | ‘s, headquarters, in) )
12
Snowball: Generating Patterns
1 Represent occurrences
Get
Examples
2 Cluster similar
Evaluate
Tuples
as vectors of tags and
terms
occurrences.
Extract
Tuples
Find Example
Occurrences in Text
Tag
Entities
Generate Extraction
Patterns
ORGANIZATION {<'s 0.57>, <headquarters 0.57>, < in 0.57>}
ORGANIZATION
{<‘s 0.57>, <headquarters 0.57>, < near
0.57>}
LOCATION
LOCATION
LOCATION
{<- 0.71>, < based 0.71>}
ORGANIZATION
LOCATION
{<- 0.71>, < based 0.71>}
ORGANIZATION
13
Snowball: Generating Patterns
Represent occurrences
1 as vectors of tags and
terms
2 Cluster similar
occurrences.
3
Create patterns
as filtered cluster
centroids
ORGANIZATION
LOCATION
Get
Examples
Find Example
Occurrences in Text
Evaluate
Tuples
Extract
Tuples
Tag
Entities
Generate Extraction
Patterns
{ <'s 0.71>, <headquarters 0.71>}
{<- 0.71>, < based 0.71>}
LOCATION
ORGANIZATION
14
Snowball: Extracting New Tuples
Get
Examples
Match tagged text
fragments against
patterns
Find Example
Occurrences in Text
Evaluate
Tuples
Tag
Entities
Extract
Tuples
Google
V
Generate Extraction
Patterns
's new headquarters in
ORGANIZATION
Mountain View are …
{<'s 0.5>, <new 0.5> <headquarters 0.5>, < in 0.5>}
LOCATION {<are 1>}
Match=0.4
P2
ORGANIZATION
{<located 0.71>, < in 0.71>}
LOCATION
Match=0.8
P1
ORGANIZATION
{<'s 0.71>, <headquarters 0.71> }
LOCATION
Match=0
P3
LOCATION
{<- 0.71>, <based 0.71>
ORGANIZATION
15
Snowball: Evaluating Patterns
Get
Examples
Evaluate
Tuples
Automatically estimate
pattern confidence:
Conf(P4)= Positive / Total
Extract
Tuples
= 2/3 = 0.66
P4
ORGANIZATION
{<
Find Example
Occurrences in Text
, 1> }
Tag
Entities
Generate Extraction
Patterns
Current seed tuples
LOCATION
IBM, Armonk, reported…
Intel, Santa Clara, introduced...
“Bet on Microsoft”, New York
-based analyst Jane Smith said...
Positive
Positive


Organization
Headquarters
IBM
Armonk
Intel
Santa Clara
Microsoft
Redmond
Negative
16
Snowball: Evaluating Tuples
Get
Examples
Automatically evaluate
tuple confidence:
Conf(T) =
1 - 1 - Conf(P i) * Match( Pi)
Find Example
Occurrences in Text
Evaluate
Tuples
Tag
Entities
p
A tuple has high
confidence if generated
by high-confidence
patterns.
Extract
Tuples
Generate Extraction
Patterns
Conf(T): 0.83
3COM Santa Clara
, 1> }
0.4
P4: 0.66
ORGANIZATION
0.8
P3: 0.95
LOCATION {<- 0.75>, <based 0.75>} ORGANIZATION
{<
LOCATION
17
Snowball: Evaluating Tuples
Organization Headquarters
Conf
Microsoft
Redmond
1
IBM
Armonk
1
Intel
Santa Clara
1
AG Edwards
St Louis
0.9
Air Canada
Montreal
0.8
7th Level
Richardson
0.8
3Com Corp
Santa Clara
0.8
3DO
Redwood City
0.7
3M
Minneapolis
0.7
MacWorld
San Francisco
0.7
157th Street
Manhattan
0.52
15th Party
Congress
China
0.3
15th Century
Europe
Dark Ages
0.1
...
...
....
....
..
..
Get
Examples
Find Example
Occurrences in Text
Evaluate
Tuples
Extract
Tuples
Tag
Entities
Generate Extraction
Patterns
Keep only high-confidence
tuples for next iteration
18
Snowball: Evaluating Tuples
Organization Headquarters
Conf
Microsoft
Redmond
1
IBM
Armonk
1
Intel
Santa Clara
1
AG Edwards
St Louis
0.9
Air Canada
Montreal
0.8
7th Level
Richardson
0.8
3Com Corp
Santa Clara
0.8
3DO
Redwood City
0.7
3M
Minneapolis
0.7
MacWorld
San Francisco
0.7
Get
Examples
Find Example
Occurrences in Text
Evaluate
Tuples
Extract
Tuples
Tag
Entities
Generate Extraction
Patterns
Iteratenew
untiliteration
no newwith
tuples
are extracted
Start
expanded
example set
19
Pattern-Tuple Duality

A “good” tuple:



A “good” pattern:




Extracted by “good” patterns
Tuple weight  goodness
Generated by “good” tuples
Extracts “good” new tuples
Pattern weight  goodness
Edge weight:

Match/Similarity of tuple context
to pattern
20
How to Set Node Weights

Constraint violation (from before)


Conf(P) = Log(Pos) Pos/(Pos+Neg)
Conf(T) = 1 - 1 - Conf(P i) * Match( Pi)

p

HITS [Hassan et al., EMNLP 2006]


Conf(P) = ∑Conf(T)
Conf(T) = ∑Conf(P)

URNS [Downey et al., IJCAI 2005]

EM-Spy [Agichtein, SDM 2006]



Unknown tuples = Neg
Compute Conf(P), Conf(T)
Iterate
21
Snowball: EM-based Pattern Evaluation
22
Evaluating Patterns and Tuples: Expectation Maximization

EM-Spy Algorithm




“Hide” labels for some seed
tuples
Iterate EM algorithm to
convergence on tuple/pattern
confidence values
Set threshold t such that
(t > 90% of spy tuples)
Re-initialize Snowball using
new seed tuples
Organization
Headquarters
Initial
Final
Microsoft
Redmond
1
1
IBM
Armonk
1
0.8
Intel
Santa Clara
1
0.9
AG Edwards
St Louis
0
0.9
Air Canada
Montreal
0
0.8
7th Level
Richardson
0
0.8
3Com Corp
Santa Clara
0
0.8
3DO
Redwood City
0
0.7
3M
Minneapolis
0
0.7
MacWorld
San Francisco
0
0.7
…..
0
157th Street
Manhattan
0
0.52
15th Party
Congress
China
0
0.3
15th Century
Europe
Dark Ages
0
0.1
0
23
Adapting Snowball for New Relations

Large parameter space
 Initial seed tuples (randomly chosen, multiple runs)
 Acceptor features: words, stems, n-grams, phrases, punctuation, POS
 Feature selection techniques: OR, NB, Freq, ``support’’, combinations
 Feature weights: TF*IDF, TF, TF*NB, NB
 Pattern evaluation strategies: NN, Constraint violation, EM, EM-Spy

Automatically estimate parameter values:
 Estimate operating parameters based on occurrences of seed tuples
 Run cross-validation on hold-out sets of seed tuples for optimal perf.
 Seed occurrences that do not have close “neighbors” are discarded
24
Example Task 1: DiseaseOutbreaks
Proteus:
Snowball:
SDM 2006
0.409
0.415
25
Example Task 2: Bioinformatics
ISMB 2003
a.k.a. mining the “bibliome”

100,000+ gene and protein synonyms
extracted from
50,000+ journal articles

Approximately 40% of confirmed
synonyms not previously listed
in curated authoritative reference
(SWISSPROT)
“APO-1, also known as DR6…”
“MEK4, also called SEK1…”
26
Snowball Used in Various Domains

News: NYT, WSJ, AP [DL’00, SDM’06]


Medical literature: PDRHealth, Micromedex…
[Thesis]


CompanyHeadquarters, MergersAcquisitions,
DiseaseOutbreaks
AdverseEffects, DrugInteractions,
RecommendedTreatments
Biological literature: GeneWays corpus
[ISMB’03]

Gene and Protein Synonyms
27
CIKM 2005
Limits of Bootstrapping for
Task “easy” when context term distributions diverge from background
Extraction

President George W Bush’s three-day visit to India
0.07
0.06
frequency
0.05
0.04
0.03
0.02
0.01
0
the

to
and
said
's
company
won
president
Quantify as relative entropy (Kullback-Liebler divergence)
KL( LM C || LM BG )   LM C ( wi )  log
wV

mrs
LM C ( w)
LM BG ( w)
After calibration, metric predicts if bootstrapping likely to work
28
SIGIR 2005
Few Relations Cover Common Questions
25 relations cover > 50% of question types, 5 relations cover > 55% question instances
Relation
Type
Instance
<person> discovers <concept>
7.7
2.9
<person> has position <concept>
5.6
4.6
<location> has location <location>
5.2
1.5
<person> known for <concept>
4.7
1.7
<event> has date <date>
4.1
0.9
29
Outline

Snowball, a domain-independent, partially
supervised information extraction system

Retrieval algorithms for scalable information
extraction

Current: mining user behavior for web search

Future work
30
Extracting A Relation From a Large Text Database
Information
Extraction
System
Text Database





Structured
Relation
Brute force approach: feed all
Expensive for
docs to information extraction system large collections
Only a tiny fraction of documents are often useful
Many databases are not crawlable
Often a search interface is available, with existing
keyword index
How to identify “useful” documents?
]
31
Accessing Text DBs via Search Engines
Search
Engine
Information
Extraction
System
Text Database
Search engines impose limitations



Limit on documents retrieved per query
Support simple keywords and phrases
Ignore “stopwords” (e.g., “a”, “is”)
Structured
Relation
32
QXtract: Querying Text Databases for Robust
Scalable Information EXtraction
User-Provided Seed Tuples
DiseaseName
Location
Date
Malaria
Ethiopia
Jan. 1995
Ebola
Zaire
May 1995
Query Generation
Search Engine
Queries
Promising Documents
Text Database
Information Extraction System
DiseaseName
Location
Date
Malaria
Ethiopia
Jan. 1995
Ebola
Zaire
May 1995
Problem:
keyword
queries
Cow Disease
The
U.K.
July 1995
Extracted
RelationLearnMad
Pneumonia
The U.S.
Feb. 1995
to retrieve “promising”
documents
33
Learning Queries to Retrieve Promising Documents
1. Get document sample
with “likely negative” and
“likely positive” examples.
User-Provided Seed Tuples
Seed Sampling
Text Database
?
3. Train classifiers to
“recognize” useful
documents.
4. Generate queries from
classifier model/rules.
?
?
?
? ?
?
?
2. Label sample documents
using information
extraction system as
“oracle.”
Search Engine
Information Extraction System
tuple1
tuple2
tuple3
tuple4
tuple5
+
+
-
-
+
+
- -
Classifier Training
+
+
-
-
+
+
-
-
Query Generation
Queries
34
Training Classifiers to Recognize “Useful” Documents
D1
disease
reported
epidemic
expected
area
D2
virus
reported
expected
infected
patients
D3
products
made
used
exported
far
D4
past
old
homerun
sponsored
event
Ripper
SVM
products
disease
disease AND reported
exported
reported
=> USEFUL
used
epidemic
far
infected
virus
+
+
Document
features:
words
-
Okapi (IR)
virus
3
infected
2
sponsored
-1
35
Generating Queries from Classifiers
Ripper
disease AND reported
=> USEFUL
disease AND reported
SVM
disease products
reported exported
epidemic
used
infected
far
virus
epidemic
virus
Okapi (IR)
virus
3
infected
2
sponsored
-1
virus
infected
QCombined
disease and reported
epidemic
virus
36
SIGMOD 2003
Demonstration
37
Tuples: A Simple Querying Strategy
DiseaseName
Location
Date
Ebola
Zaire
May
1995
Malaria
Ethiopia
Jan.
1995
hemorrhagic fever
Africa
May
1995
1.
2.
3.
“Ebola” and “Zaire”
Search
Engine
Information
Extraction
System
Convert given tuples into queries
Retrieve matching documents
Extract new tuples from documents and
iterate
38
Comparison of Document Access Methods
80
70
recall (%)
60
50
40
30
20
10
0
5%
10%
25%
MaxFractionRetrieved
QXtract
Manual
Tuples
Baseline
QXtract: 60% of relation extracted from 10% of
documents of 135,000 newspaper article database
Tuples strategy: Recall at most 46%
39
How to choose the best strategy?



Tuples: Simple, no training, but limited recall
QXtract: Robust, but has training and query overhead
Scan: No overhead, but must process all documents
40
WebDB 2003
Predicting Recall of Tuples Strategy
Seed
Seed
Tuple
Tuple
SUCCESS!
FAILURE 
Can we predict if Tuples will succeed?
41
Abstract the problem: Querying Graph
Tuples
t1
Documents
“Ebola” and “Zaire”
d1
t2
Search
Engine
t3
d2
d3
t4
d4
t5
d5
Note: Only top K docs returned for each query.
<Violence, U.S.>  retrieves many documents that do not contain tuples;
 searching for an extracted tuple may not retrieve source document
42
Information Reachability Graph
Tuples
t1
Documents
d1
t2
d2
t3
d3
t4
t1
t2
t3
t5
t4
d
t1 retrieves document d1
4
t2, t3, and t4 “reachable”
from
t
1 t
that contains
2
t5
d5
43
Connected Components
t1
In
t2
t3
Core
(strongly
connected)
Out
t4
Tuples that retrieve
other tuples but
are not reachable
Tuples that retrieve
other tuples and
themselves
Reachable Tuples,
do not retrieve
tuples in Core
44
Sizes of Connected Components
How many tuples are in largest Core + Out?
In
Core
Out
In


t0
Core
(strongly
connected)
Out
In
Core
Out
Conjecture:

Degree distribution in reachability graphs follows “power-law.”

Then, reachability graph has at most one giant component.
Define Reachability as Fraction of tuples in largest Core + Out
45
NYT Reachability Graph:
Outdegree Distribution
Matches the power-law distribution
MaxResults=10
MaxResults=50
46
NYT: Component Size Distribution
Not “reachable”
MaxResults=10
CG / |T| = 0.297
“reachable”
MaxResults=50
CG / |T| = 0.620
47
Connected Components Visualization
DiseaseOutbreaks,
48
New York Times 1995
Estimating Reachability
In a power-law random graph G a giant
component
CG emerges* if d (the average outdegree) > 1,
and:


Estimate: Reachability ~ CG / |T|
Depends only on d (average outdegree)
* For b < 3.457 Chung and Lu, Annals of Combinatorics, 2002
49
Estimating Reachability Algorithm
Tuples
1.
2.
3.
4.
5.
Pick some random tuples
Use tuples to query database
Extract tuples from matching
documents to compute
reachability graph edges
Estimate average outdegree
Estimate reachability using
results of Chung and Lu,
Annals of Combinatorics, 2002
t2
t2
t4
t1
Documents
d1
t2
d2
t3
d3
t4
d4
d =1.5
50
Estimating Reachability of NYT
S=10
S=50
S=100
S=200
Real Graph
1
0.9
Reachability
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
MR=1
MR=10
MR=50
MR=100
MR=200
MR=1000
MaxResults
.46
Approximate reachability is estimated after ~ 50
queries.
Can be used to predict success (or failure) of a
Tuples querying strategy.
51
To Search or to Crawl? Towards a Query Optimizer for TextCentric Tasks, [Ipeirotis, Agichtein, Jain, Gavano, SIGMOD 2006]

Information extraction applications extract structured
relations from unstructured text
May 19 1995, Atlanta -- The Centers for Disease Control
and Prevention, which is in the front line of the world's
response to the deadly Ebola epidemic in Zaire ,
is finding itself hard pressed to cope with the crisis…
Disease Outbreaks in The New York Times
Information
Extraction System
(e.g., NYU’s Proteus)
Date
Disease Name
Location
Jan. 1995
Malaria
Ethiopia
July 1995
Mad Cow Disease U.K.
Feb. 1995
Pneumonia
U.S.
May 1995
Ebola
Zaire
52
An Abstract View of Text-Centric Tasks
[Ipeirotis, Agichtein, Jain, Gavano, SIGMOD 2006]
Output tuples
Text Database
…
Extraction
System
1. Retrieve documents
from database
2. Process documents
3. Extract output tuples
Task
tuple
Information Extraction
Relation Tuple
Database Selection
Word (+Frequency)
Focused Crawling
Web Page about a Topic
For the rest of the talk
53
Executing a Text-Centric Task
Output tuples
Text Database
Extraction
…
System
1. Retrieve
documents from
database
Similar to relational world
2. Process
documents
3. Extract output
tuples
Two major execution paradigms
 Scan-based: Retrieve and process documents
sequentially
 Index-based: Query database (e.g., [case fatality rate]),
retrieve and process documents in results
→underlying data distribution dictates what is best
Indexes are only “approximate”: index is on
keywords, not on tuples of interest
 Choice of execution plan affects output
completeness (not only speed)

Unlike the relational world
54
Execution Plan Characteristics
Output tuples
Text Database
1.
Question: How
do we choose the…
Extraction
fastest execution
plan for reaching
System
a
target
recall
?
Retrieve documents
from database
2. Process documents
3. Extract output tuples
Execution Plans have two main characteristics:
Execution Time
Recall (fraction of tuples retrieved)
“What is the fastest plan for discovering 10% of the disease
outbreaks mentioned in The New York Times archive?”
55
Outline

Description and analysis of crawl- and query-based plans




Scan
Crawl-based
Filtered Scan
Iterative Set Expansion
Automatic Query Generation
Query-based
(Index-based)

Optimization strategy

Experimental results and conclusions
56
Scan
Output tuples
Text Database
Extraction
…
System
1. Retrieve docs
from database

2. Process
documents
3. Extract output
tuples
Scan retrieves and processes documents sequentially (until reaching target recall)
Execution time = |Retrieved Docs| · (R + P)
Question: How many documents
does Scan retrieve to reach target
recall?
Time for retrieving a
document
Time for processing
a document
Filtered Scan uses a classifier to identify and process only promising documents (details in paper)
57
Estimating Recall of Scan
<SARS, China>
Modeling Scan for tuple t:

What is the probability of seeing t (with
frequency g(t)) after retrieving S documents?

A “sampling without replacement” process
Token
t
d1

d2
S documents
...

After retrieving S documents, frequency of
tuple t follows hypergeometric distribution
Recall for tuple t is the probability that
frequency of t in S docs > 0
dS
...
dN
D
Probability of seeing tuple t
after retrieving S documents
g(t) = frequency of tuple t
Sampling
for t
58
Estimating Recall of Scan
<SARS, China>
<Ebola, Zaire>
Modeling Scan:

Multiple “sampling without replacement”
processes, one for each tuple

Overall recall is average recall across
tuples
Tokens
t1
t2
Sampling
for t1
Sampling
for t2
...
tM
d1
d2
→ We can compute number of documents
required to reach target recall
d3
...
Execution time = |Retrieved Docs| · (R + P)
dN
D
Sampling
for tM
59
Iterative Set Expansion
Output tuples
Text Database
…
Extraction
Query
System
1. Query
database with
seed tuples
Generation
2. Process retrieved
documents
3. Extract tuples
from docs
(e.g., <Malaria, Ethiopia>)
4. Augment seed
tuples with new
tuples
(e.g., [Ebola AND Zaire])
Execution time = |Retrieved Docs| * (R + P) + |Queries| * Q
Question: How many queries
and how many documents does
Iterative Set Expansion need to
reach target recall?
Time for retrieving a Time for processing
document
a document
Time for answering
a query60
Using Querying Graph for Analysis
We need to compute the:

Number of documents retrieved after
sending Q tuples as queries (estimates time)

Number of tuples that appear in the
retrieved documents (estimates recall)
tuples
t1
Documents
d1
<SARS, China>
t2
d2
<Ebola, Zaire>
To estimate these we need to compute the:

Degree distribution of the tuples
discovered by retrieving documents

Degree distribution of the documents
retrieved by the tuples

(Not the same as the degree distribution of a
randomly chosen tuple or document – it is easier to
discover documents and tuples with high degrees)
t3
d3
<Malaria, Ethiopia>
t4
d4
t5
d5
<Cholera, Sudan>
<H5N1, Vietnam>
61
Summary of Cost Analysis

Our analysis so far:


Takes as input a target recall
Gives as output the time for each plan to reach target recall
(time = infinity, if plan cannot reach target recall)

Time and recall depend on task-specific properties of database:



tuple degree distribution
Document degree distribution
Next, we show how to estimate degree distributions on-the-fly
62
Estimating Cost Model Parameters
tuple and document degree distributions belong to known distribution families
Task
Document Distribution
tuple Distribution
Information Extraction
Power-law
Power-law
Content Summary Construction
Lognormal
Power-law (Zipf)
Focused Resource Discovery
Uniform
Uniform
10000
100000
y = 43060x-3.3863
10000
1000
y = 5492.2x-2.0254
Number of Tokens
Number of Documents
1000
100
10
1
1
10
Document Degree
100
100
10
1
1
10
100
Token Degree
1000
63
Can characterize distributions with only a few parameters!
Parameter Estimation

Naïve solution for parameter estimation:



Start with separate, “parameter-estimation” phase
Perform random sampling on database
Stop when cross-validation indicates high confidence

We can do better than this!

No need for separate sampling phase
Sampling is equivalent to executing the task:
→Piggyback parameter estimation into execution

64
On-the-fly Parameter Estimation
Correct (but unknown) distribution

Pick most promising execution
plan for target recall assuming
“default” parameter values

Start executing task
Update parameter estimates
during execution
Switch plan if updated statistics
indicate so


Initial,
default
estimation
Updated
estimation
Updated
estimation
Important
Only Scan acts as “random sampling”
65
All other execution plan need parameter adjustment (see paper)
Outline

Description and analysis of crawl- and query-based plans

Optimization strategy

Experimental results and conclusions
66
Correctness of Theoretical Analysis
100,000
Execution Time (secs)
10,000


Scan
1,000
Filt. Scan
Automatic Query Gen.
Iterative Set Expansion
100
0.00
0.10
0.20
0.30
0.40
0.50
Recall
0.60
Solid lines: Actual time
Dotted lines: Predicted time with correct parameters
0.70
0.80
0.90
1.00
Task: Disease Outbreaks
Snowball IE system
182,531 documents from NYT
67
16,921 tuples
Experimental Results (Information Extraction)
100,000
Execution Time (secs)
10,000


Scan
Filt. Scan
1,000
Iterative Set Expansion
Automatic Query Gen.
OPTIMIZED
100
0.00
0.10
0.20
0.30
0.40
0.50
Recall
0.60
0.70
0.80
0.90
1.00
Solid lines: Actual time
Green line: Time with optimizer
(results similar in other experiments – see paper)
68
Conclusions

Common execution plans for multiple text-centric tasks

Analytic models for predicting execution time and recall
of various crawl- and query-based plans

Techniques for on-the-fly parameter estimation

Optimization framework picks on-the-fly the fastest plan
for target recall
69
Can we do better?

Yes. For some information extraction systems
70
Bindings Engine (BE) [Slides: Cafarella 2005]

Bindings Engine (BE) is search engine where:




No downloads during query processing
Disk seeks constant in corpus size
#queries = #phrases
BE’s approach:



“Variabilized” search query language
Pre-processes all documents before query-time
Integrates variable/type data with inverted index,
minimizing query seeks
71
BE Query Support
cities such as <NounPhrase>
President Bush <Verb>
<NounPhrase> is the capital of <NounPhrase>
reach me at <phone-number>
 Any sequence of concrete terms and typed
variables
 NEAR is insufficient
 Functions (e.g., “head(<NounPhrase>)”)
72
BE Operation

Like a generic search engine, BE:




BE further requires:



Downloads a corpus of pages
Creates an index
Uses index to process queries efficiently
Set of indexed types (e.g., “NounPhrase”), with a
“recognizer” for each
String processing functions (e.g., “head()”)
A BE system can only process types and
functions that its index supports
73
as
billy
cities
friendly
give
mayors
nickels
seattle
such
words
docid2
…
docid1
docid2
docid3
docid0
docid1
docid2
…
docid#docs-1
#docs
docid0
docid1
docid2
#docs
docid0
docid1
#docs
docid0
docid1
docid2
docid#docs-1
#docs
docid0
docid1
docid2
…
…
#docs
docid0
#docs
docid0
docid1
#docs
docid0
docid1
#docs
docid0
#docs
docid0
#docs
docid#docs-1
docid#docs-1
74
Query: such as
as
billy
cities
friendly
give
mayors
nickels
seattle
such
words
Returned docs:
#docs
docid0
docid1
docid2
104
21
150
322
…
docid#docs-1
2501
1. Test for equality
2. Advance smaller pointer
3. Abort when a list is exhausted
322
#docs
docid0
docid1
docid2
15
99
322
426
…
docid#docs-1
1309
75
“such as”
as
billy
cities
friendly
give
mayors
nickels
seattle
such
words
docid docid pos
……
#docs
#docsdocid
docid
docid
docid
0 pos
0 0 docid
1 1 pos
1 2
#posns
pos0
pos1
…
#docs-1 #docs-1
#docs-1
pos#pos-1
In phrase queries, match positions as well
docid docid pos
……
#docs
#docsdocid
docid
docid1 1 pos
docid
0 pos
0 0 docid
1 2
#posns
pos0
pos1
…
#docs-1 #docs-1
#docs-1
pos#pos-1
76
Neighbor Index


At each position in the index, store “neighbor text”
that might be useful
Let’s index <NounPhrase> and <Adj-Term>
“I love cities such as Atlanta.”
Left
Right
AdjT: “love”
77
Neighbor Index


At each position in the index, store “neighbor text”
that might be useful
Let’s index <NounPhrase> and <Adj-Term>
“I love cities such as Atlanta.”
Left
AdjT: “I”
NP: “I”
Right
AdjT: “cities”
NP: “cities”
78
Neighbor Index
Query: “cities such as <NounPhrase>”
“I love cities such as Atlanta.”
Left
AdjT: “such”
Right
AdjT: “Atlanta”
NP: “Atlanta”
79
“cities such as <NounPhrase>”
as
billy
cities
friendly
give
mayors
nickels
Atlanta
such
words
#docs docid0 pos0 docid1 pos1
19
…
docid#docs-1 pos#docs-1
… neighbor
pos
…
#posns
#posns
pos0 pos
neighbor
pos
0
0 1 pos1
#pos-1
1
pos#pos-1
…
12
blk_offset #neighbors neighbor0
str0
neighbor1
str1
<offset>
such
NPright
Atlanta
3
AdjTleft
In doc 19, starting at posn 8:
“I love cities such as Atlanta.”
1. Find phrase query positions, as with phrase queries
80
2. If term is adjacent to variable, extract typed value
Current Research Directions

Modeling explicit and Implicit network structures




Knowledge Discovery from Biological and Medical Data



Automatic sequence annotation  bioinformatics, genetics
Actionable knowledge extraction from medical articles
Robust information extraction, retrieval, and query processing






Modeling evolution of explicit structure on web, blogspace, wikipedia
Modeling implicit link structures in text, collections, web
Exploiting implicit & explicit social networks (e.g., for epidemiology)
Integrating information in structured and unstructured sources
Robust search/question answering for medical applications
Confidence estimation for extraction from text and other sources
Detecting reliable signals from (noisy) text data (e.g.,: medical surveillance)
Accuracy (!=authority) of online sources
Information diffusion/propagation in online sources


Information propagation on the web
In collaborative sources (wikipedia, MedLine)
81
Page Quality: In Search of an Unbiased Web Ranking
[Cho, Roy, Adams, SIGMOD 2005]

“popular pages tend to get even more popular, while
unpopular pages get ignored by an average user”
82
Sic Transit Gloria Telae: Towards an Understanding of the
Web’s Decay [Bar-Yossef, Broder, Kumar, Tomkins, WWW 2004]
83
Modeling Social Networks for

Epidemiology, security, …
Email exchange mapped onto cubicle locations.
84
Some Research Directions

Modeling explicit and Implicit network structures




Knowledge Discovery from Biological and Medical Data



Automatic sequence annotation  bioinformatics, genetics
Actionable knowledge extraction from medical articles
Robust information extraction, retrieval, and query processing






Modeling evolution of explicit structure on web, blogspace, wikipedia
Modeling implicit link structures in text, collections, web
Exploiting implicit & explicit social networks (e.g., for epidemiology)
Integrating information in structured and unstructured sources
Query processing over unstructured text
Robust search/question answering for medical applications
Confidence estimation for extraction from text and other sources
Detecting reliable signals from (noisy) text data (e.g.,: medical surveillance)
Information diffusion/propagation in online sources


Information propagation on the web
In collaborative sources (wikipedia, MedLine)
85
Agichtein & Eskin, PSB 2004
Mining Text and Sequence Data
ROC50 scores for each class and method
86
Some Research Directions

Modeling explicit and Implicit network structures




Knowledge Discovery from Biological and Medical Data



Automatic sequence annotation  bioinformatics, genetics
Actionable knowledge extraction from medical articles
Robust information extraction, retrieval, and query processing






Modeling evolution of explicit structure on web, blogspace, wikipedia
Modeling implicit link structures in text, collections, web
Exploiting implicit & explicit social networks (e.g., for epidemiology)
Integrating information in structured and unstructured sources
Robust search/question answering for medical applications
Confidence estimation for extraction from text and other sources
Detecting reliable signals from (noisy) text data (e.g.,: medical surveillance)
Accuracy (!=authority) of online sources
Information diffusion/propagation in online sources


Information propagation on the web
In collaborative sources (wikipedia, MedLine)
87
Structure and evolution of blogspace
[Kumar, Novak, Raghavan, Tomkins, CACM 2004, KDD 2006]
Fraction of nodes in components of various sizes within Flickr and Yahoo!
360 timegraph, by week.
88
Current Research Directions

Modeling explicit and Implicit network structures




Knowledge Discovery from Biological and Medical Data



Automatic sequence annotation  bioinformatics, genetics
Actionable knowledge extraction from medical articles
Robust information extraction, retrieval, and query processing






Modeling evolution of explicit structure on web, blogspace, wikipedia
Modeling implicit link structures in text, collections, web
Exploiting implicit & explicit social networks (e.g., for epidemiology)
Integrating information in structured and unstructured sources
Robust search/question answering for medical applications
Confidence estimation for extraction from text and other sources
Detecting reliable signals from (noisy) text data (e.g.,: medical surveillance)
Accuracy (!=authority) of online sources
Information diffusion/propagation in online sources


Information propagation on the web, news
In collaborative sources (wikipedia, MedLine)
89
Thank You

Details:
http://www.mathcs.emory.edu/~eugene/
90