Download slides - IIT Bombay

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Extensible Storage Engine wikipedia , lookup

Entity–attribute–value model wikipedia , lookup

Microsoft Jet Database Engine wikipedia , lookup

Concurrency control wikipedia , lookup

Database wikipedia , lookup

Relational model wikipedia , lookup

ContactPoint wikipedia , lookup

Clusterpoint wikipedia , lookup

Database model wikipedia , lookup

Transcript
Integration and representation of
unstructured text in relational
databases
Sunita Sarawagi
IIT Bombay
Database
Unstructured data
Citeseer/Google scholar
Structured records from
publishers
Publications from
homepages
Company database:
products with features
Product reviews on the web
Customer emails
HR database
Resumes: skills, experience,
references (emails)
Text resume in an email
Personal Databases:
bibtex, address book
Extract bibtex entries when I
download a paper
Enter missing contacts via
web search
R. Fagin and J. Helpern, Belief, awareness, reasoning. In AI 1988 [10] also see
Articles
3 Toplevel
entities
Id
2
Title
Journals
Year
Update Semantics
1983
Journal
Canonical
Author
2
11
2
2
2
3
Database: imprecise
Name
10
ACM TODS
17
AI
16
ACM Trans.
Databases
Canonical
10
Writes
Article
Id
Authors
Id
Name
11
M Y Vardi
2
J. Ullman
4
3
Ron Fagin
3
4
Jeffrey Ullman
4
Canonical
17
Probabilistic
variant links
to canonical
entries
R. Fagin and J. Helpern, Belief, awareness, reasoning. In AI 1988 [10] also see
Extraction
Articles
Id
2
Author: R. Fagin A
Author: J. Helpern
Title Belief,..reasoning
Journal:
AI
Year: 1998
Integration
7
Title
Year
Update Semantics
Belief, awareness,
reasoning
Journal
1983
Journals
Canonical
Id
Name
10
ACM TODS
17
AI
17
16
ACM Trans.
Databases
10
10
1988
17
Writes
Article
Author
Canonical
Authors
Id
Name
Canonical
2
11
11
M Y Vardi
2
2
2
J. Ullman
4
2
3
3
Ron Fagin
3
7
8
4
Jeffrey Ullman
4
7
9
8
9
R Fagin
J Helpern
3
8
Match with existing linked entities while respecting
all constraints
Outline

Statistical models for integration

Extraction while fully exploiting existing database



Integrate extracted entities, resolve if entity already in
database
Performance challenges



Entity match, Entity pattern, link/relationship constraints,
Efficient graphical model inference algorithms
Indexing support
Representing uncertainty of integration in DB

Imprecise databases and queries
Extraction using chain CRFs
R. Fagin and J. Helpern, Belief, awareness, reasoning
t
1
2
3
4
5
6
x
y
R.
Fagin
and
J.
Helpbern
Belief
Author
Author
Other
Author
Author
Title
Title
y6
y7
y1
y2
y3
y4
y5
7
8
Awareness Reasoning
Title
y8
Flexible overlapping features
•identity of word
•ends in “-ski”
•is capitalized
•is part of a noun phrase?
•is under node X in WordNet
•is in bold font
•is indented
•next two words are “and Associates”
•previous label is “Other”
Difficult to effectively
combine features from
labeled unstructured data
and structured DB
CRFs for Segmentation
t
x
y
1
2
3
4
5
6
7
8
R.
Fagin
and
J.
Helpbern
Belief
Awareness
Reasoning
Author
Author
Other
Author
Author
Title
Title
Title
Features describe the
single word “Fagin”
l,u
x
y
l1=1, u1=2
R.
Fagin
Author
l1=u1=3
and
Other
l1=4, u1=5
J.
Helpbern
Author
l1=6, u1=8
Belief
Awareness
Reasoning
Title
Similarity to author’s column in database
Features describe the
segment from l to u
Features from database

Similarity to a dictionary entry


Similarity to a pattern level dictionary


JaroWinkler, TF-IDF
Regex based pattern index for database entities
Entity classifier


A multi-class regression model which gives likelihood
of a segment being a particular entity type
Features for the classifier: all standard entity-level
extraction features
Segmentation models

Segmentation



Input: sequence x=x1,x2..xn, label set Y
Output: segmentation S=s1,s2…sp
 sj = (start position, end position, label) = (tj,uj,yj)
Score: F(x,s) =

Transition potentials

Segment potentials





Segment starting at i has label y and previous label is y’
Segment starting at i’, ending at i, and with label y.
All positions from i’ to i get same label.
Probability of a segmentation:
Inference O(nL2)

Most likely segmentation, Marginal around segments
R. Fagin and J. Helpern, Belief, awareness, reasoning. In AI 1988 [10] also see
Extraction
Articles
Id
2
Author: R. Fagin A
Author: J. Helpern
Title Belief,..reasoning
Journal:
AI
Year: 1998
Integration
7
Title
Year
Update Semantics
Belief, awareness,
reasoning
Journal
1983
Journals
Canonical
Id
Name
10
ACM TODS
17
AI
17
16
ACM Trans.
Databases
10
10
1988
17
Writes
Article
Author
Canonical
Authors
Id
Name
Canonical
2
11
11
M Y Vardi
2
2
2
J. Ullman
4
2
3
3
Ron Fagin
3
7
8
4
Jeffrey Ullman
4
7
9
8
9
R Fagin
J Helpern
3
8
Match with existing linked entities while respecting
all constraints
CACM 2000, R. Fagin and J. Helpern, Belief, awareness, reasoning in AI
Combined
Extraction+integration
Only extraction
Author: R. Fagin
Author: J. Helpern
Title: Belief,..reasoning
Journal:
AI
Year: 2000
Author: R. Fagin
Author: J. Helpern
Title: Belief,..reasoning in AI
Journal: CACM
Year: 2000
Id
Title
Year
Journal Canonical
2 Update
Semantics
1983
10
7 Belief,
awareness,
reasoning
1988
17
Year mismatch!
Combined extraction + matching
Convert predicted label to be a pair y = (a,r)
(r=0) means none-of-the-above or a new entry


l,u
x
y
r
l1=1, u1=2
CACM.
2000
Journal
Year
0
7
l1=u1=3
Fagin
l1=4, u1=8
Belief Awareness Reasoning In AI
Author
Title
3
7
Id of matching
entity
Constraints exist on ids that can be assigned to
two segments
Constrained models

Two kinds of constraints between arbitrary
segments



Training


Foreign key constraint across their canonical-ids
Cardinality constraint
Ignore constraints or use max-margin methods that
require only MAP estimates
Application:


Formulate as a constrained integer programming
problem (expensive)
Use general A-star search to find most likely
constrained assignment
Effect of database on extraction
performance
L
PersonalBib
Address
L+DB
%Δ
author
75.7
79.5
4.9
journal
33.9
50.3
48.6
title
61.0
70.3
15.1
city_name
72.4
76.7
6.0
state_name
13.9
33.2
138.5
zipcode
91.6
94.3
3.0
L = Only labeled structured data
L + DB: similarity to database entities and other DB features
(Mansuri and Sarawagi ICDE 2006)
Train=5%
Train=10%
"-L_edge
"-L_context
"-L_entity
"+db_regex
"+db_classifier
"+db_similarity
"+cardinality
only_L (noDB)
"-L_edge
"-L_context
"-L_entity
"+db_link
"+db_regex
"+db_classifier
"+db_similarity
"+cardinality
only_L (noDB)
F1
Effect of various features
85
80
75
70
65
60
55
Full integration performance
L
PersonalBib
Address
L+DB
%Δ
author
70.8
74.0
4.5
journal
29.6
45.5
53.6
title
51.6
65.0
25.9
city_name
70.1
74.6
6.4
9.0
28.3
213.8
87.8
90.7
3.3
state_name
pincode

L = conventional extraction + matching
L + DB = technology presented here

Much higher accuracies possible with more training data

(Mansuri and Sarawagi ICDE 2006)
Outline

Statistical models for integration

Extraction while fully exploiting existing database



Integrate extracted entities, resolve if entity already in
database
Performance challenges



Entity match, Entity pattern, link/relationship constraints,
Efficient graphical model inference algorithms
Indexing support
Representing uncertainty of integration in DB

Imprecise databases and queries
Inference in segmentation models
R. Fagin and
J.
Helpern, Belief, awareness, reasoning, In AI 1998
Surface features
(cheap)
Many large
tables
Database lookup features Authors
Name
(expensive!)
M Y Vardi
1. Batch up to do better than
individual top-k?
Efficient search for
top-k most similar
entities
2. Find top segmentation without
top-k matches for all segments?
J. Ullman
Ron Fagin
Claire Cardie
J. Gherke
Thorsten
J Kleinberg
S Chakrabarti
Inverted
index
Jay Shan
Jackie Chan
Bill Gates
Top-k similarity search
Q: query segment
E: an entry in the database D
Similarity score:
Goal: get k highest scoring Es in D
1. Fetch/merge
Bounds on
normalized
tidlist subsets
idf values
2. Point queries
Score bounds
Tuple id upper lower
(cached)
t1
t2
Tidlists: pointers to
DB tuples (on disk)
t3 - - -
tU
Candidate matches
Upper and lower bounds on
dictionary match scores
Best segmentation with inexact,
bounded features

Normal Viterbi:

Forward pass over data positions, at each position
maintain


Best segmentation ending at that position
Modify to: best-first search with selective feature
refinement
s(3,3)
s(1,1)
s(3,4)
s(0,0)
s(5,5)
End state
s(1,2)
s(3,5)
s(1,3)
s(4,4)
(Chandel, Nagesh and Sarawagi, ICDE 2006)
Suffix upper/lower
bound: from a
backward Viterbi
with bounded
features
Performance results
DBLP authors and titles
100 citations
(Chandel, Nagesh and Sarawagi, ICDE 2006)
Inference in segmentation models
R. Fagin and
J.
Helpern, Belief, awareness, reasoning, In AI 1998
Surface features
(cheap)
Not quite!
Semi-CRFs 3—8 times slower
than chain CRFs
Key insight




Applications have a mix of token-level and
segment-level features
Many features applicable to several overlapping
segments
Compactly represent the overlap through new
forms of potentials
Redesign inference algorithms to work on
compact features

Cost is independent of number of segments a feature
applies to
(Sarawagi, ICML 2006)
Compact potentials

Four kinds of potentials
Running time and Accuracy
92
90
88
F1 Accuracy
86
Sequence-BCEU
84
Segment
82
80
78
Address
Cora
Articles
Address
Sequence-BCEU
SegmentOpt
Segment
Time (sec)
2550
2050
Sequence-BCEU
12050
Time (seconds)
3050
Cora
1550
SegmentOpt
10050
8050
Segment
6050
4050
1050
2050
550
50
0
50
0
10
20
30
Training %
40
50
60
10
20
30
40
Training %
50
60
70
Outline

Statistical models for integration

Extraction while fully exploiting existing database



Integrate extracted entities, resolve if entity already in
database
Performance challenges



Entity match, Entity pattern, link/relationship constraints,
Efficient graphical model inference algorithms
Indexing support
Representing uncertainty of integration in DB

Imprecise databases and queries
Probabilistic Querying Systems


Integration systems while improving, cannot be
perfect particularly for domains like the web
Users supervision of each integration result
impossible
 Create uncertainty-aware storage and querying
engines

Two enablers:


Probabilistic database querying engines over generic uncertainty
models
Conditional graphical models produce well-calibrated probabilities
Probabilities in CRFs are wellcalibrated
Cora citations
Ideal
Cora headers
Ideal
Probability of segmentation  Probability correct
E.g: 0.5 probability  Correct 50% of the times
Uncertainty in integration systems
Unstructured
text
Additional training data
Model
Other more
compact
models?
Entities
p1
Entities
p2
Entities
pk
Very
uncertain?
IEEE Intl. Conf. On data mining 0.8
Probabilistic database system
Conf. On data mining
Select conference name of article RJ03?
Find most cited author?
0.2
D Johnson 16000
0.6
J Ullman
0.4
13000
Segmentation-per-row model
(Rows: Uncertain; Cols: Exact)
HNO
AREA
CITY
PINCODE
PROB
52
Bandra West
Bombay
400 062
0.1
52-A
Bandra
West
Bombay
400 062
0.2
52-A
Bandra West
Bombay
400 062
0.5
52
Bandra
West
Bombay
400 062
0.2
Exact but impractical. We can have too
many segmentations!
One-row Model
Each column is a multinomial distribution
(Row: Exact; Columns: Independent, Uncertain)
HNO
AREA
CITY
PINCODE
52 (0.3)
Bandra West
(0.6)
Bombay (0.6)
400 062 (1.0)
52-A (0.7)
Bandra (0.4)
West Bombay
(0.4)
e.g. P(52-A, Bandra West, Bombay, 400 062)
= 0.7 x 0.6 x 0.6 x 1.0 = 0.252
Simple model, closed form solution, poor approximation.
Multi-row Model
Segmentation generated by a ‘mixture’ of rows
(Rows: Uncertain; Columns: Independent, Uncertain)
HNO
AREA
CITY
PINCODE
Prob
52 (0.167)
52-A (0.833)
Bandra
West (1.0)
Bombay (1.0)
400 062 (1.0)
0.6
52 (0.5)
52-A (0.5)
Bandra (1.0) West Bombay
(1.0)
400 062 (1.0)
0.4
Excellent storage/accuracy tradeoff
Populating probabilities challenging
(Gupta and Sarawagi, VLDB 2006)
Populating a multi-row model

Challenge


Learning parameters of a mixture model to
approximate the SemiCRF but without enumerating
the instances from the model
Solution

Find disjoint partitions of string


Direct operation on marginal probability vectors (efficiently
computable for SemiCRFs)
Each partition  a row
Experiments: Need for multi-row
• KL very high at m=1. One-row model clearly inadequate.
• Even a two-row model is sufficient in many cases.
What next in data integration?


Lots to be done in building large-scale, viable
data integration systems
Online collective inference




Cannot freeze database
Cannot batch too many inferences
Need theoretically sound, practical alternatives to
exact, batch inference
Queries and Mining over imprecise databases

Models of imprecision for results of deduplication
Thank you.
Summary


Data integration with statistical models an
exciting research direction + a useful problem
Four take-home messages




Segmentation models (semi-CRFs) provide a more
elegant way to exploit entity features and build
integrated models (NIPS 2004, ICDE 2006a)
A-star search adequate for link and cardinality
constraints (ICDE 2006a)
Recipe for combing two top-k searches so that
expensive DB lookup features are refined gradually
(ICDE 2006b)
An efficient segmentation model with succinct
representation of overlapping features + message
passing over partial potentials (NIPS 2005 workshop)
Software: http://crf.sourceforge.net
Outline


Problem statement and goals
Models for data integration

Information Extraction





State-of-the-art
 Overview: Conditional Random Fields
Our extensions to incorporate database of entity names
Entity matching
Combined model for extraction and matching
Extending to multi-relational data
Outline


Problem statement and goals
Models for data integration

Information Extraction





State-of-the-art
 Overview: Conditional Random Fields
Our extensions to incorporate database of entity names
Entity matching
Combined model for extraction and matching
Extending to multi-relational data
Entity resolution
Variants
J. Ullmann
Jefry Ulman
Prof. J. Ullman
J Smith
Mike Stonebraker
M, Stonebraker
Labeled data:
Authors
Jeffrey Ullman
Jeffrey Smith
Michael Stonebraker
Pedro Domingos
?
Domingos, P.
record pairs with labels 0 (red-edges) 1 (black-edges)
Input features:

Various kinds of similarity functions between attributes




Edit distance, Soundex, N-grams on text attributes
Jaccard, Jaro-Winkler,
Subset match
Classifier: any binary classifier

CRF for extensibility
CRFs for predicting matches

Given record pair (x1 x2), predict y=1 or 0 as

Efficiency:

Training: filter and only include pairs which satisfy
conditions like at least one common n-gram
Link constraints in multi-relational data

Any pair of segments in previous output needs to
satisfy two conditions



Foreign key constraint across their canonical-ids
Cardinality constraint
Our solution: Constrained Viterbi (branch and
bound search)


Modified search that retains with best path labels
along the path
Backtracks when constraints are violated
The final picture.

Entity column names in the database:

Surface patterns, regular expression:


Commonly occurring words:





Normal CRF
Part after “In” is journal name
Similarity-based features:
Labeled data: Order of attributes:


Journal, IEEE  journal name
Ordering of words:


Example: pattern: X. [X.] Xx*  author name
Title before journal name
Semi-CRF
Normal CRF
Canonical links:
Compound-label
Schema-level: cardinality of attributes
Links between entities:
Constrained Viterbi

what entity is allowed to go with what.
Summary




Exploiting existing large databases to bridge with
unstructured data, an exciting research problem
with many applications
Conditional graphical models to combine all
possible clues for extraction/matching in a simple
framework
Probabilistic: robust to noise, soft predictions
Ongoing work:

Probabilistic output for imprecise query processing
Available clues..

Entity column names in the database:

Surface patterns, regular expression:


Commonly occurring words:




Part after “In” is journal name
TF-IDF similarity with stored entities
Labeled data: Order of attributes:


Journal, IEEE  journal name
Ordering of words:


Example: pattern: X. [X.] Xx*  author name
Title before journal name
Schema-level: cardinality of attributes
Links between entities:

what entity is allowed to go with what.
Adding structure to unstructured data


Extensive research in web, NLP, machine learning,
data mining and database communities.
Most current research ignores existing structured
databases


Database just a store at the last step of data integration.
Our goal


Extend statistical models to exploit database of entities
and relationships
Models: persistent, part of database, stored, indexed,
evolving and improving along with data.