Download Update Summarization - NUS School of Computing

Document related concepts

World Wide Web wikipedia , lookup

Semantic Web wikipedia , lookup

Automatic summarization wikipedia , lookup

Transcript
Linking and Summarizing
Information on Entities
Presented by
Min-Yen Kan
Web IR / NLP Group (WING)
Department of Computer Science
National University of Singapore, Singapore
This talk archived as http://wing.comp.nus.edu.sg/~kanmy/talks/080407-nihLMC.htm
Entity Linkage using the Web and Graph-based Update Summarization
Singapore, the garden city
• 4M+ people, sandwiched between
Malaysia and Indonesia
• 50 km from the equator: hot and
humid year-long
• Known for: urban planning,
fondness for acronyms and aversion
to bubble gum litterers :-D
WING @ NUS
http://wing.comp.nus.edu.sg
• 1 postdoc, 6 Ph.D. students, 5
undergraduates
• Projects of in natural language
processing, digital libraries, and
information retrieval.
NIH Lister Hill Medical Center
2
Entity Linkage using the Web and Graph-based Update Summarization
Entity Centric Information Management
“Collate all studies on SBP2 that new findings in the last year.”
“Oh, I meant the PROTEIN SBP2, not the gene.”
“What other proteins does SBP2 bind to?”
“Tell me more about the contradiction from previous results.”
“Which Miller did the study on SBP2 in 2002?”
NIH Lister Hill Medical Center
3
Entity Linkage using the Web and Graph-based Update Summarization
Entity Centric Information Management
Two consequences to discuss today:
• Linkage
Joint work with Yee Fan TAN, Dongwon LEE (PSU) et al.
• Summarization
Joint work with Ziheng LIN et al.
NIH Lister Hill Medical Center
4
Entity Linkage using the Web and Graph-based Update Summarization
What’s Entity Linkage?
Aggregating data on an object together from heterogeneous
resources
Problem: Entity names are ambiguous!
–Medical terms
–Person names
–Products
–Customer records
By UV cross-linking and
immunoprecipitation, we show that SBP2
specifically
Proteinbinds selenoprotein mRNAs
both in vitro and in vivo.
These problems exist even
when we have controlled
vocabulary and lexicons
(Specialist, UMLS, MeSH)
The SBP2 clone used in this study
generates a 3173 nt transcript Gene
(2541 nt of
coding sequence plus a 632 nt 3’ UTR
truncated at the polyadenylation site).
NIH Lister Hill Medical Center
5
Entity Linkage using the Web and Graph-based Update Summarization
Examples of Split Records
Dongwon Lee, 110 E. Foster
Ave. #410, State College, PA,
16802
Honda Fix
Joint Conf. on Digital Libraries
Apple iPod Nano 4GB
Entity Linkage
LEE Dong, 110 East Foster
Avenue Apartment 410,
University Park, PA 16802-2343
Honda Jazz
JCDL
4GB iPod nano 4GB
De-duplication
Ironic, isn’t it?
NIH Lister Hill Medical Center
6
Entity Linkage using the Web and Graph-based Update Summarization
All over the web!
Jeffrey D. Ullman
(Stanford University)
NIH Lister Hill Medical Center
7
Entity Linkage using the Web and Graph-based Update Summarization
Record linkage, formally defined
• Input
– Two lists of records, A and B
• Output
– For each record a in A and for each record b in B,
does a and b refer to the same entity?
• Note:
– Entities do not come with unique identifiers
– To disambiguate (deduplicate) items in a single list L, we set A =
B=L
NIH Lister Hill Medical Center
8
Entity Linkage using the Web and Graph-based Update Summarization
Talk Outline
• Linkage using the Web
– Introduction
>> Record linkage using internal knowledge
String matching
Classification or clustering
Graphical formalisms
Blocking
– Record linkage using search engines
• Update Summarization
NIH Lister Hill Medical Center
9
Entity Linkage using the Web and Graph-based Update Summarization
Frequency of Similarity
Fellegi-Sunter
no-decision region
model
(hold for human review)
designate as
definite non-match
designate as
definite match
* true matches
○ true non-matches
false non-matches
false matches
Similarity (a, b)
NIH Lister Hill Medical Center
10
Entity Linkage using the Web and Graph-based Update Summarization
String matching
•String similarity
–Strings as ordered sequences
Edit distance
Jaro and Jaro-Winkler
([a], [b], [c]) ≠ ([c], [b], [a])
–Strings as unordered sets
Jaccard similarity
Cosine similarity
{[a], [b], [c]} = {[c], [b], [a]}
•Abbreviation matching
– Pattern detection: e.g. “National Institute of Health (NIH)”
NIH Lister Hill Medical Center
11
Entity Linkage using the Web and Graph-based Update Summarization
Machine Learning
– Create features
String similarity, relationships (e.g. collaborators)
– Then learn a model
Naïve Bayes, Support Vector Machine, K-means,
Agglomerative Clustering, …
Yoojin Hong, Byung-Won On and Dongwon Lee. System
Support for Name Authority Control Problem in
Same
Digital Libraries: OpenDBLP Approach. ECDL 2004.
Person?
Sudha Ram, Jinsoo Park and Dongwon Lee. Digital
Libraries for the Next Millennium: Challenges and
Research Directions. Information Systems Frontiers 1999.
NIH Lister Hill Medical Center
12
Entity Linkage using the Web and Graph-based Update Summarization
Graphical Methods: Social network analysis
• Nodes: entities
• Edges: relationships
J. C. Latombe
T.-H. Chiang
D. Hsu
A. Dhanik
Y. Wang
Analysis
L. Qiu
M.-Y. Kan
H. Cui
T.-S. Chua
Y. F. Tan
Connected components
Distance between nodes
Node/edge centrality
Cliques
Bipartite subgraphs
…
NIH Lister Hill Medical Center
13
Entity Linkage using the Web and Graph-based Update Summarization
Talk Outline
• Linkage using the Web
– Introduction
– Record linkage using internal knowledge
>> Record
linkage using search engines
Search Engine Features
Adaptive Queries
Query Probing
• Update Summarization
NIH Lister Hill Medical Center
14
Entity Linkage using the Web and Graph-based Update Summarization
Record linkage using search engines
Previously…
– We assumed input data records contain sufficient
information to perform linkage
What if…
– There is insufficient or only noisy information?
– e.g., linking short forms to long forms
Ask other people!
– I.e., consult external (vs. internal) sources of knowledge
– Use web as collective knowledge base
NIH Lister Hill Medical Center
15
Entity Linkage using the Web and Graph-based Update Summarization
Anatomy of Search Engine Results
Number of results
Ranked list
Title
Snippet
Programmatically
accessible
through APIs
URL
Web page
NIH Lister Hill Medical Center
16
Entity Linkage using the Web and Graph-based Update Summarization
Derivable Features
• Counts
– Co-occurrence measure
between count(q1), count(q2)
and count(q1 and q2)
• Hyperlinkage
– Count of web pages of q1
point to pages of q2, and vice
versa?
– Incorporate additional
indirect links with less weight
(e.g., q 1  p  q2)
NIH Lister Hill Medical Center
• Snippets or web pages
– (Cosine) similarity using
tokens
– Counts of specific terms
e.g. number of snippets
for q1 containing the
string q2
– Further natural language
processing
17
Entity Linkage using the Web and Graph-based Update Summarization
Web page features
•
Named entities (NE)
– We consider people, organizations, locations
– Each NE token a feature
Charles, Chelsea, Morrice,
Edward, Fox, London, …
Born Edward Charles Morrice Fox in Chelsea,
London…
•
NE-targeted (NE-T)
Charles, Morrice, …
– Motivation: middle names and titles
– For NEs having a token of target name
•
NIH Lister Hill Medical Center
Extract tokens that are not in target name as features
18
Entity Linkage using the Web and Graph-based Update Summarization
Using URLs
Where web pages are
located is also useful
Hypothesis: If web pages
of q1 and web pages of
q2 overlap a lot, q1 and
q2 are the same entity
Measure this using URL /
Host information
NIH Lister Hill Medical Center
• Caveat: Not all hosts are
equally telling
– citeseer vs. harvard.edu for
author names
– pubmed vs. diabetesinfo.com for diabetic terms
• Solution: Weight by
Inverse Host Frequency
19
Entity Linkage using the Web and Graph-based Update Summarization
URL Features (cont.)
•
Page URLs
http://www.cs.ualberta.ca/~lindek/
Hypothesis: URL itself tells quite a lot
•
•
Home page of “lindek”
CS department, University of Alberta, Canada
– MeURLin (Kan and Nguyen Thi, 2005)
•
•
•
•
•
NIH Lister Hill Medical Center
Tokens (http, www, cs, ualberta, ca, lindek)
URI parts (scheme:http, hostname:cs, user:lindek, …)
N-grams (ca ualberta, uaberta cs, cs www, www lindek)
Length of tokens
…
20
Entity Linkage using the Web and Graph-based Update Summarization
Web search engine linkage
Test whether q1 and q2
should be linked
Hypothesis: Web pages
of q1 and web pages of
q2 share some
representative data I
Similar to disconnected
triples:
q1
“Jeffrey D. Ullman” =
384K pgs
“Jeffrey D. Ullman” + “aho” =
174K pgs
“J. Ullman” =
124K pgs
“J. Ullman” + “aho” =
41K pgs
“Shimon Ullman” =
27.3K pgs
“Shimon Ullman” + “aho”=
66 pgs
q2
NIH Lister Hill Medical Center
21
Entity Linkage using the Web and Graph-based Update Summarization
Evaluation - Full web pages in WEPS
•
Goal
– To compare the usefulness of various features for the
Web People Search Task
•
Architecture
Input
web pages
NIH Lister Hill Medical Center
Feature
vectors
Clusters
Cosine similarity
+
Single link
hierarchical agglomerative
clustering
+
Minimum similarity
threshold
22
Entity Linkage using the Web and Graph-based Update Summarization
Evaluation
•
F(α = 0.5) and similarity threshold 0.2
1
0.9
0.8
0.7
0.6
ECDL
0.5
Wikipedia
0.4
Census
0.3
0.2
0.1
0
Tokens
(T)
NIH Lister Hill Medical Center
Named
Entities
(NE)
NE
targeted
(NE-T)
Host (H)
Host +
Self (H-S)
Domain
(D)
Domain + URL (U)
Self (D-S)
23
Entity Linkage using the Web and Graph-based Update Summarization
Evaluation - Author Disambiguation
•
•
Dataset
– Manually-disambiguated dataset of 24 ambiguous
names in computer science domain
– Each ambiguous name represented 2 unique
authors (k = 2) except for one where it represented 3
– Each name is attributed to 30 citations on average
– Proportion of largest class ranges from 50% to 97%
Search engine
– Google (http://www.google.com/)
NIH Lister Hill Medical Center
24
Entity Linkage using the Web and Graph-based Update Summarization
Evaluation
0.86
0.84
0.82
0.80
0.78
0.76
0.74
0.72
0.70
0.68
0.66
•
0.836
0.827
0.807
0.798
0.726
Single link
0.734
Complete link
Hostname
0.812
0.8050.811
Domain
Group average
IP address
•
Classification accuracy
averaged over all names
NIH Lister Hill Medical Center
Single link performs best
– Good for clustering
citations from different
publication pages together
(some pages list only
selected publications)
– Some authors have
disparate research areas,
not well represented by a
centroid vector
Resolving hostnames to IP
addresses give best
accuracy
25
Entity Linkage using the Web and Graph-based Update Summarization
Discussion
Per-name accuracies using single link
NIH Lister Hill Medical Center
Per-name average number of URLs
returned per citation
26
Entity Linkage using the Web and Graph-based Update Summarization
Discussion
•
Apparent correlation between accuracy and average number
of URLs returned per citation
– Author names with few URLs tend to fare poorly since results
are mainly aggregator web sites
What’s the cost?
– Lots of queries needed
– Web page downloads are expensive
– Hence, slow
Can we speed this up?
Sure thing…
NIH Lister Hill Medical Center
27
Entity Linkage using the Web and Graph-based Update Summarization
Query probing
• Consider some publication venues:
–Joint Conference on Digital Libraries
–European Conference on Digital Libraries
–Digital Libraries
• Query probing
– Use common n-gram “digital libraries” as query probe
– If we can obtain information on all three conferences, we
save two queries
NIH Lister Hill Medical Center
28
Entity Linkage using the Web and Graph-based Update Summarization
Adaptive querying
Combine two methods when
needed
• Methods
– Ms: stronger method but
very slow (e.g. web page
similarity)
– Mw: weaker method but fast
(e.g. host overlap)
• Algorithm
– Execute Mw
– If heuristic suggests that Mw
results are likely incorrect
Execute Ms
• Aim
– Accuracy close to Ms
– Significantly reduced
running time than Ms
NIH Lister Hill Medical Center
29
Entity Linkage using the Web and Graph-based Update Summarization
Entity Linkage - Conclusion
• Important problem with a rich history
• New external methods poll contextual evidence for
judgment
• Need to combine methods to obtain best aspect of
each
NIH Lister Hill Medical Center
30
Entity Linkage using the Web and Graph-based Update Summarization
Talk Outline
• Linkage using the Web
>> Graph-based Update
Summarization
– Introduction
– Timestamped Graphs
– Evaluation and Conclusions
NIH Lister Hill Medical Center
“Now that all this
data is linked, how
do we process it?’’
31
Entity Linkage using the Web and Graph-based Update Summarization
Applications of Summarization
Doing Less Work
Decision Support
NIH Lister Hill Medical Center
32
Entity Linkage using the Web and Graph-based Update Summarization
More seriously: an exciting challenge ...
...put a book on the scanner, turn the dial to ‘2 pages’, and read
the result...
...download 1000 documents from the web, send them to the
summarizer, and select the best ones by reading the summaries of
the clusters...
...forward the Japanese email to the summarizer, select ‘1 par’,
and skim the translated summary.
…get a weekly digest of new treatments and therapies for
pressure ulcers
An update task
NIH Lister Hill Medical Center
33
Entity Linkage using the Web and Graph-based Update Summarization
Simplifying summarization
Select important sentences verbatim from the input text to
form a summary
– Input: A text document with k sentences
– Output: Top n (n << k) sentences with the highest numeric
scores (each sentence in the input document is assigned a
numeric score)
Extractive Summarization
NIH Lister Hill Medical Center
34
Entity Linkage using the Web and Graph-based Update Summarization
Summarization
Heuristics for extractive summarization
– Cue/stigma phrases
– Sentence position (relative to document, section,
paragraph)
– Sentence length
– TF×IDF, TF scores
– Similarity (with title, context, query)
Machine learning to tune weights by supervised learning
Recently, graphical representations of text have shed new
light on the summarization problem
NIH Lister Hill Medical Center
35
Entity Linkage using the Web and Graph-based Update Summarization
Revisiting Social Networks: Prestige
One motivation was to model the problem as finding
prestige of nodes in a social network
• PageRank: random walk
In summarization, lead to
TextRank and LexRank
• Did we leave anything out of our
representation for summarization?
Yes, the notion of an evolving network
NIH Lister Hill Medical Center
36
Entity Linkage using the Web and Graph-based Update Summarization
Social networks change!
Natural evolving networks (Dorogovtsev and Mendes, 2001)
– Citation networks: New papers can cite old ones, but the old
network is static
– The Web: new pages are added with an old page connecting it to
the web graph, old pages may update links
NIH Lister Hill Medical Center
37
Entity Linkage using the Web and Graph-based Update Summarization
Talk Outline
• Linkage using the Web
• Graph-based Update Summarization
– Introduction
>> Timestamped Graphs
– Evaluation and Conclusion
NIH Lister Hill Medical Center
38
Entity Linkage using the Web and Graph-based Update Summarization
Evolutionary models for summarization
Writers and readers often follow conventional rhetorical
styles - articles are not written or read in an arbitrary way
Consider the evolution of texts using a very simplistic
model
– Writers write from the first sentence onwards in a text
– Readers read from the first sentence onwards of a text
A simple model: sentences get added incrementally to the
graph
NIH Lister Hill Medical Center
39
Entity Linkage using the Web and Graph-based Update Summarization
Timestamped Graph Construction
These assumptions suggest us to iteratively add sentences
into the graph in chronological order.
At each iteration, consider which edges to add to the graph.
– For single document: simple and straightforward: add 1st
sentence, followed by the 2nd, and so forth, until the last sentence is
added
– For multi-document: treat it as multiple instances of single
documents, which evolve in parallel; i.e., add 1st sentences of all
documents, followed by all 2nd sentences, and so forth
• NB: Doesn’t really model chronological ordering between articles, fix
later
NIH Lister Hill Medical Center
40
Entity Linkage using the Web and Graph-based Update Summarization
Timestamped Graph Construction
Model:
• Documents as columns
– di = document i
• Sentences as rows
–sj = jth sentence of
document
NIH Lister Hill Medical Center
41
Entity Linkage using the Web and Graph-based Update Summarization
Timestamped Graph Construction
• A multi document example
doc1
doc2
doc3
sent1
sent2
sent3
NIH Lister Hill Medical Center
42
Entity Linkage using the Web and Graph-based Update Summarization
An example TSG: DUC 2007 D0703A-A
NIH Lister Hill Medical Center
43
Entity Linkage using the Web and Graph-based Update Summarization
Timestamped Graph Construction
These are just one instance of TSGs
Let’s generalize and formalize them
Def: A timestamped graph algorithm tsg(M) is a 9-tuple
(d, e, u, f,σ, t, i, s, τ) that specifies a resulting
algorithm that takes as input the set of texts M and
outputs a graph G
Properties
of edges
NIH Lister Hill Medical Center
Properties
of nodes
Input text
transformation
function
44
Entity Linkage using the Web and Graph-based Update Summarization
Edge properties (d, e, u, f)
• Edge Direction (d)
– Forward, backward, or undirected
• Edge Number (e)
– number of edges to instantiate per timestep
• Edge Weight (u)
– weighted or unweighted edges
• Inter-document factor (f)
– penalty factor for links between documents
in multi-document sets.
NIH Lister Hill Medical Center
45
Entity Linkage using the Web and Graph-based Update Summarization
Node properties (σ, t, i, s)
• Vertex selection function σ(u, G)
– One strategy: among those nodes not yet connected to u in G,
choose the one with highest similarity according to u
– Similarity functions: Jaccard, cosine, concept links
(Ye et al.. 2005)
• Text unit type (t)
– Most extractive algorithms use sentences as elementary units
• Node increment factor (i)
– How many nodes get added at each timestep
• Skew degree (s)
– Models how nodes in multi-document graphs are added
– Skew degree = how many iterations to wait before adding the 1st
sentence of the next document
– Skip for today…
NIH Lister Hill Medical Center
46
Entity Linkage using the Web and Graph-based Update Summarization
Timestamped Graph Construction
• Representations
– We can model a number of different algorithms using this
9-tuple formalism:
(d, e, u, f,
σ, t, i, s,
τ)
– The given toy example:
(f, 1, 0, 1,
max-cosine-based, sentence, 1, 0,
null)
– LexRank graphs:
(u, N, 1, 1,
cosine-based, sentence, Lmax, 0,
null)
N = total number of sentences in the cluster; Lmax = the max document
length
i.e., all sentences are added into the graph in one timestep, each
connected to all others, and cosine scores are given to edge weights
NIH Lister Hill Medical Center
47
Entity Linkage using the Web and Graph-based Update Summarization
System Overview
• Sentence splitting
–Detect and mark sentence boundaries
–Annotate each sentence with the doc
ID and the sentence number
–E.g., XIE19980304.0061: 4 March
1998 from Xinhua News;
XIE19980304.0061-14: the 14th
sentence of this document
• Graph construction
–Construct TSG in this phase
NIH Lister Hill Medical Center
48
Entity Linkage using the Web and Graph-based Update Summarization
System Overview
• Sentence Ranking
– Apply topic-sensitive random
walk on the graph to redistribute the
weights of the nodes
• Sentence extraction
– Extract the top-ranked sentences
– Two different modified MMR rerankers are used, depending on
whether it is main or update task
NIH Lister Hill Medical Center
49
Entity Linkage using the Web and Graph-based Update Summarization
Talk Outline
• Linkage using the Web
• Graph-based Update Summarization
– Introduction
– Timestamped Graphs
>> Evaluation and Conclusion
NIH Lister Hill Medical Center
50
Entity Linkage using the Web and Graph-based Update Summarization
Evaluation
• Dataset: DUC 2005, 2006 and 2007.
• Evaluation tool: ROUGE: n-gram based automatic evaluation
• Each dataset contains 50 or 45 clusters, each cluster contains
a query and 25 documents
• Evaluate on some parameters
– Do different e values affect the summarization process?
• e = 2 works best for DUC dataset
– How do topic-sensitivity and edge weighting perform in running
PageRank?
• Applying both seems to have best effect
– How does skewing the graph affect the information flow in the
graph?
• Skew of 1 works best, but need to try other possibilities
NIH Lister Hill Medical Center
51
Entity Linkage using the Web and Graph-based Update Summarization
Holistic Evaluation in DUC 2007
Extractive-based TSG system
Used modified maximal
marginal relevance for update
tasks
– Penalize links in
previously read articles
Cluster 1
Cluster 2
Cluster 3
– Extension of interdocument factor (f)
NIH Lister Hill Medical Center
52
Entity Linkage using the Web and Graph-based Update Summarization
Evaluation Results
Main task: 10th of 32 systems
Update task: 3rd of 24 systems
Conclusion
• TSG formalism better tailored to deal with update / incremental
text tasks
• New method that may be competitive with current approaches
– Other top scoring systems may do sentence compression
(abstractive), not just extraction
NIH Lister Hill Medical Center
53
Entity Linkage using the Web and Graph-based Update Summarization
Graph-based Update Summary - Conclusion
Proposed a timestamped graph model for text
understanding and summarization
– Adds sentences in an incremental fashion
Future work:
– Freely skewed model
– Empirical and theoretical properties of TSGs
NIH Lister Hill Medical Center
54
Entity Linkage using the Web and Graph-based Update Summarization
Where do we go from here?
Organizing data around entities,
events
• How people deal with data anyways
• Understand objects and their
inter/intra-relationship
• Automation requires domainexpertise within a generic framework
“Collate all studies on SBP2 that new
findings in the last year.”
“Oh, I meant the PROTEIN SBP2, not
the gene.”
“What other proteins does SBP2 bind
to?”
“Tell me more about the contradiction
from previous results.”
“Which Miller did the study on SBP2 in
2002?”
Thank you!
http://wing.comp.nus.edu.sg/
NIH Lister Hill Medical Center
55
Backup Slides – Entity Linkage
50 Minute talk total
7 Apr 2008, 10 – 11 AM
Entity Linkage using the Web and Graph-based Update Summarization
Social network analysis
• Connected triple
• Maximum flow
x1
s
t
x2
• Random walk
• Clustering
x2
x1
x3
NIH Lister Hill Medical Center
57
Entity Linkage using the Web and Graph-based Update Summarization
Scalability Issues
•
Pairwise comparisons
– Requires O(n2) time
– Major bottleneck
•
Possible solutions
– Blocking techniques
– Avoiding pairwise
comparisons altogether
Input: d1, d2, …, dn
for i = 1 to n
for j = (i + 1) to n
compute sim(di, dj)
NIH Lister Hill Medical Center
58
Entity Linkage using the Web and Graph-based Update Summarization
Cost-utility Framework
cost of
utility of
acquiring fi acquiring fi
feature fi
f1
f2
f3
f4
f5
c1
c2
c3
c4
c5
u1
u2
u3
u4
u5
r1
r2
r3
r4
r5
r6
known value
value that can be acquired
NIH Lister Hill Medical Center
59
Entity Linkage using the Web and Graph-based Update Summarization
Record Matching
[2]
Information that can
[1]
Given information be acquired at a cost
Header-reference pair (instance)
TITLE_MIN_LEN
TITLE_MAX_LEN
AUTHOR_MIN_LEN
AUTHOR_MAX_LEN
VENUE_MIN_LEN
VENUE_MAX_LEN
NIH Lister Hill Medical Center
TITLE_SIM
AUTHOR_SIM
VENUE_SIM
Training data
Assume all feature-values
and their acquisition costs
known
Testing data
Assume [1] known, but
feature-values and their
acquisition costs in [2]
unknown
Costs
Set to MIN_LEN * MAX_LEN
MATCH/MISMATCH?
60
Entity Linkage using the Web and Graph-based Update Summarization
Costs and Utilities
•Costs
–Trained 3 models (using M5’), treat as regression
•Utilities
–Trained 2^3 = 8 classifiers (each to predict match/mismatch using
only known feature-values)
–For a test instance with a missing feature-value F
Get confidence of appropriate classifier without F
Get expected confidence of appropriate classifier with F
Utility is difference between the two confidence scores
•Note
–Similar to Saar-Tsechansky et al.
NIH Lister Hill Medical Center
61
Entity Linkage using the Web and Graph-based Update Summarization
Results
Without cleaning of header records
Normalized cost
Recall
Precision
With manual cleaning of header records
F-measure
Normalized cost
1
1
0.9
0.9
0.8
0.8
0.7
0.7
0.6
0.6
0.5
0.5
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0
0
Increasing proportion of
feature-values acquired
NIH Lister Hill Medical Center
Recall
Precision
F-measure
Increasing proportion of
feature-values acquired
62
Entity Linkage using the Web and Graph-based Update Summarization
Selected Bibliography
•General and surveys
–Ivan P. Fellegi and Alan B. Sunter. A theory for record linkage. Journal of the American
Statistical Association, 64(328):1183–1210, December 1969.
–William E. Winkler and Yves Thibaudeau. An application of the Fellegi-Sunter Model of
record linkage to the 1990 U.S. Decennial Census. Technical Report RR91/09, U.S.
Bureau of the Census, 1991.
–Ahmed K. Elmagarmid, Panagiotis G. Ipeirotis, and Vassilios S. Verykios. Duplicate
record detection: A survey. IEEE Transactions on Knowledge and Data Engineering
(TKDE), 19(1):1–16, January 2007.
–William E. Winkler. Overview of record linkage and current research directions. Technical
Report RRS2006/02, U.S. Bureau of the Census, February 2006.
–Mikhail Bilenko, Raymond J. Mooney, William W. Cohen, Pradeep Ravikumar, and
Stephen E. Fienberg. Adaptive name matching in information integration. IEEE Intelligent
Systems, 18(5):16–23, January/February 2003.
–Min-Yen Kan and Yee Fan Tan. Record Matching in Digital Library Metadata. To appear
in Communications of the ACM (CACM).
NIH Lister Hill Medical Center
63
Entity Linkage using the Web and Graph-based Update Summarization
Selected Bibliography
•String matching
–Robert A. Wagner and Michael J. Fischer. The string-to-string correction problem. Journal of the Association of Computing Machinery,
21(1):168–173, January 1974.
–Saul B. Needleman and Christian D. Wunsch. 1970. A general method applicable to the search for similarities in the amino acid
sequence of two proteins. Journal of Molecular Biology, 148(3):443–453, March 1970.
–Temple F. Smith and Michael S. Waterman. Identification of common molecular subsequences. Journal of Molecular Biology,
147(1):195–197, March 1981.
–Andrés Marzal and Enrique Vidal. Computation of normalized edit distance and applications. IEEE Transactions on Pattern Analysis
and Machine Intelligence, 15(9):926–932, September 1993.
–Alvaro E. Monge and Charles Elkan. The field matching problem: Algorithms and applications. In ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining, pages 267–270, August 1996.
–Jie Wei. Markov edit distance. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(3):311–321, March 2004.
–Mikhail Bilenko and Raymond J. Mooney. Adaptive duplicate detection using learnable string similarity measures. In ACM SIGKDD
International Conference on Knowledge Discovery and Data Mining, pages 39–48, August 2003.
–Andrew McCallum, Kedar Bellare, and Fernando Pereira. A Conditional Random Field For Discriminatively-Trained Finite-State String
Edit Distance. In Conference on Uncertainty in Artificial Intelligence (UAI), July 2005.
–William. W. Cohen, Pradeep Ravikumar, and Stephen E. Fienberg. A comparison of string distance metrics for name-matching tasks.
In Information Integration on the Web (IIWeb), pages 73–78, August 2003.
–Ariel S. Schwartz and Marti A. Hearst. A simple algorithm for identifying abbreviation definitions in biomedical text. In Pacific
Symposium on Biocomputing (PSB), pages 451–462, January 2003.
–Youngja Park and Roy J. Byrd. Hybrid text mining for finding abbreviations and their definitions. In Conference on Empirical Methods
in Natural Language Processing (EMNLP), pages 126–133, June 2001.
–Jeffrey T. Chang , Hinrich Schütze, and Russ B. Altman. Creating an online dictionary of abbreviations from MEDLINE. Journal of the
American Medical Informatics Association, 9(6):612–620, November/December 2002.
–Hiroko Ao and Toshihisa Takagi. ALICE: An algorithm to extract abbreviations from MEDLINE. Journal of the American Medical
Informatics Association, 12(5):576–586, September/October 2005.
NIH Lister Hill Medical Center
64
Entity Linkage using the Web and Graph-based Update Summarization
Selected Bibliography
•Direct classification or clustering, and blocking
–Hui Han, Hongyuan Zha, and C. Lee Giles. A model-based K-means algorithm for name disambiguation. In Workshop on Semantic
Web Technologies for Searching and Retrieving Scientific Data, October 2003.
–Hui Han, C. Lee Giles, Hongyuan Zha, Cheng Li, and Kostas Tsioutsiouliklis. Two supervised learning approaches for name
disambiguation in author citations. In ACM/IEEE Joint Conference on Digital Libraries (JCDL), pages 296–305, June 2004.
–Hui Han, Wei Xu, Hongyuan Zha, and C. Lee Giles. A hierarchical naive bayes mixture model for name disambiguation in author
citations. In ACM Symposium on Applied Computing (SAC), pages 1065–1069, March 2005.
–Hui Han, Hongyuan Zha, and C. Lee Giles. Name disambiguation in author citations using a K-way spectral clustering method. In
ACM/IEEE Joint Conference on Digital Libraries (JCDL), pages 334–343, June 2005.
–Dongwon Lee, Byung-Won On, Jaewoo Kang, and Sanghyun Park. Effective and scalable solutions for mixed and split citation
problems in digital libraries. In ACM SIGMOD Workshop on Information Quality in Information Systems (IQIS), pages 69–76, June
2005.
–Byung-Won On, Dongwon Lee, Jaewoo Kang, and Prasenjit Mitra. Comparative study of name disambiguation problem using a
scalable blocking-based framework. In ACM/IEEE Joint Conference on Digital Libraries (JCDL), pages 344–353, June 2005.
–Andrew McCallum, Kamal Nigam, and Lyle Ungar. Efficient clustering of high-dimensional data sets with application to reference
matching. In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 169–178, August 2000.
–Matthew Michelson and Craig A. Knoblock. Learning blocking schemes for record linkage. In National Conference on Artificial
Intelligence (AAAI), July 2006.
–Mikhail Bilenko, Beena Kamath, and Raymond J. Mooney. Adaptive Blocking: Learning to Scale Up Record Linkage and Clustering. In
IEEE International Conference on Data Mining (ICDM), December 2006.
NIH Lister Hill Medical Center
65
Entity Linkage using the Web and Graph-based Update Summarization
Selected Bibliography
•Graphical models
–Jie Wei. Markov edit distance. IEEE Transactions on Pattern Analysis and Machine Intelligence,
26(3):311–321, March 2004.
–John Lafferty, Andrew McCallum, and Fernando Pereira. Conditional random fields: Probabilistic
models for segmenting and labeling sequence data. In International Conference on Machine Learning
(ICML), pages 282–289, June/July 2001.
–Andrew McCallum and Ben Wellner. Object consolidation by graph partitioning with a conditionallytrained distance metric. In ACM SIGKDD Workshop on Data Cleaning, Record Linkage, and Object
Consolidation, pages 19–24, August 2003.
–Ben Wellner, Andrew McCallum, Fuchun Peng, and Michael Hay. An integrated, conditional model of
information extraction and coreference with application to citation matching. In Conference on
Uncertainty in Artificial Intelligence (UAI), pages 593–601, July 2004.
–Andrew McCallum, Kedar Bellare, and Fernando Pereira. A Conditional Random Field For
Discriminatively-Trained Finite-State String Edit Distance. In Conference on Uncertainty in Artificial
Intelligence (UAI), July 2005.
–Xin Dong, Alon Halevy, and Jayant Madhavan. Reference reconciliation in complex information
spaces. In ACM SIGMOD International Conference on Management of Data, pages 85–96, June 2005.
–Indrajit Bhattacharya and Lise Getoor. A latent dirichlet model for unsupervised entity resolution. In
SIAM International Conference on Data Mining, pages 47–58, April 2006.
NIH Lister Hill Medical Center
66
Entity Linkage using the Web and Graph-based Update Summarization
Selected Bibliography
•Social network analysis
–H. A. Kautz, B. Selman, and M. A. Shah. The hidden web. AI Magazine, 18(2):27–36, 1997.
–P. Mutschke. Mining networks and central entities in digital libraries. A graph theoretic approach applied to co-author networks. In
Intelligent Data Analysis (IDA), pages 155–166, August 2003.
–M. E. J. Newman. Who is the best connected scientist? A study of scientific coauthorship networks. In Complex Networks, pages 337–
370, February 2004.
–E. Otte and R. Rousseau. Social network analysis: a powerful strategy, also for the information sciences. Journal of Information
Science, 28(6), December 2002.
–T. Krichel and N. Bakkalbasi. A social network analysis of research collaboration in the economics community. In International
Workshop on Webometrics, Informetrics and Scientometrics & Seventh COLLNET Meeting, May 2006.
–R. Rousseau and M. Thelwall. Escher staircases on the world wide web. First Monday, 9(6), June 2004.
–D. G. Feitelson. On identifying name equivalences in digital libraries. Information Research, 9(4), October 2004.
–R. Bekkerman and A. McCallum. Disambiguating web appearances of people in a social network. In International conference on World
Wide Web (WWW), pages 463–470, May 2005.
–R. Holzer, B. Malin, and L. Sweeney. Email alias detection using social network analysis. In Workshop on Link Discovery: Issues,
Approaches and Applications (LinkKDD), August 2005.
–B. Malin, E. Airoldi, and K. M. Carley. A network analysis model for disambiguation of names in lists. Computational and Mathematical
Organization Theory, 11(2):119–139, July 2005.
–G. Flake, S. Lawrence, and C. L. Giles. Efficient identification of web communities. In ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining, pages 150–160, August 2000.
–P. K. Reddy and M. Kitsuregawa. An approach to build a cyber-community hierarchy. In SIAM ICDM Workshop on Web Analysis, April
2002.
–Patrick Reuther. Personal name matching: New test collections and a social network based approach. Technical Report
Mathematics/Computer Science 06-01, University of Trier, March 2006.
–Yutaka Matsuo, Junichiro Mori, Masahiro Hamasaki, Keisuke Ishida, Takuichi Nishimura, Hideaki Takeda, Kôiti Hasida, and Mitsuru
Ishizuka. POLYPHONET: an advanced social network extraction system from the web. In International conference on World Wide Web
(WWW), pages 397-406, May 2006.
NIH Lister Hill Medical Center
67
Entity Linkage using the Web and Graph-based Update Summarization
Selected Bibliography
•Web-based methods
–Jamie P. Callan, Margie E. Connell, and Aiqun Du. Automatic discovery of language models for text databases. In ACM SIGMOD International
Conference on Management of Data, pages 479–490, June 1999.
–Jamie P. Callan and Margie E. Connell. Query-based sampling of text databases. ACM Transactions on Information Systems (TOIS), 19(2):97–130,
April 2001.
–Panagiotis G. Ipeirotis and Luis Gravano. Distributed search over the hidden-web: Hierarchical database sampling and selection. In International
Conference on Very Large Databases (VLDB), pages 394–405, August 2002.
–Luis Gravano, Panagiotis G. Ipeirotis, and Mehran Sahami. QProber: A system for automatic classification of hidden-web databases. ACM
Transactions on Information Systems (TOIS), 21(1):1–41, January 2003.
–Aron Culotta, Ron Bekkerman, and Andrew McCallum. Extracting social networks and contact information from email and the web. In Conference on
Email and Anti-Spam (CEAS), July 2004.
–Philipp Cimiano, Siegfried Handschuh, and Steffen Staab. Towards the self-annotating web. In International conference on World Wide Web
(WWW), pages 462–471, May 2004.
–Philipp Cimiano, Günter Ladwig, and Steffen Staab. Gimme the context: Context-driven automatic semantic annotation with C-PANKOW. In
International conference on World Wide Web (WWW), pages 332–341, May 2005.
–Yutaka Matsuo, Junichiro Mori, Masahiro Hamasaki, Keisuke Ishida, Takuichi Nishimura, Hideaki Takeda, Kôiti Hasida, and Mitsuru Ishizuka.
POLYPHONET: an advanced social network extraction system from the web. In International conference on World Wide Web (WWW), pages 397406, May 2006.
–Yee Fan Tan, Min-Yen Kan, and Dongwon Lee. Search engine driven author disambiguation. In ACM/IEEE Joint Conference on Digital Libraries
(JCDL), June 2006.
–Ergin Elmacioglu, Min-Yen Kan, Dongwon Lee, and Yi Zhang. Googled name linkage. 2007.
–Yee Fan Tan, Ergin Elmacioglu, Min-Yen Kan, and Dongwon Lee. Record Linkage of Short Forms to Long Forms: A Case Study of Publication
Venues. 2007.
–Min-Yen Kan. Web page classification without the web page. In International conference on World Wide Web (WWW), pages 262–263, May 2004.
–Min-Yen Kan and Hoang Oanh Nguyen Thi. Fast webpage classification using url features. In International Conference on Information and
Knowledge Management (CIKM), pages 325–326, October/November 2005.
–Panagiotis G. Ipeirotis, Eugene Agichtein, Pranay Jain, and Luis Gravano. To search or to crawl? Towards a query optimizer for text-centric tasks. In
ACM SIGMOD International Conference on Management of Data, pages 265–276, June 2006.
NIH Lister Hill Medical Center
68
Backup Slides - Summarization
50 Minute talk total
7 Apr 2008, 10 – 11 AM
Entity Linkage using the Web and Graph-based Update Summarization
A Summarization Machine
DOC MULTIDOCS QUERY
50%
10%
Very Brief
Brief
Headline
100%
Long
Extract
Abstract
Indicative
Informative
Generic
Query-oriented
Background
Just the news
Summary
Generate a summary given a text document
NIH Lister Hill Medical Center
70
Entity Linkage using the Web and Graph-based Update Summarization
Summarization, defined
• Definitions
Take a text document, extract content from it and present the most
important content to the user in a condensed form and in a
manner sensitive to the user’s or application’s needs
• Summarization requires:
– understanding the meaning of a text document
– generating fluent text summary
• Studies of human summarizers
–Cremmins (65) & Endres-Niggemeyer (98) showed that
professional summarizers used clues to pick summary content.
NIH Lister Hill Medical Center
71
Entity Linkage using the Web and Graph-based Update Summarization
Skew Degree Examples
time(d1) < time(d2) < time(d3) < time(d4)
d1 d2 d3 d4
d1 d2 d3 d4
d1 d2 d3 d4
Freely skewed =
Only add a new
document when it
would be linked
by some node
using vertex
function σ
Skewed by 1
NIH Lister Hill Medical Center
Skewed by 2
Freely skewed
72
Entity Linkage using the Web and Graph-based Update Summarization
Input text transformation function (τ)
• Document Segmentation
Function (τ)
– Problem observed in some
clusters where some documents
in a multi-document cluster are
very long
– Takes many timestamps to
introduce all of the sentences,
causing too many edges to be
drawn
d5
d5a d5b
–Τ(G) segments long documents
into several sub docs
• Solution is too hacked – hope to
investigate more in current and
future work
NIH Lister Hill Medical Center
73
Entity Linkage using the Web and Graph-based Update Summarization
Evaluation on number of edges (e)
Tried different e values
• Optimal performance: e = 2
• At e = 1, graph is too loosely connected, not suitable for PageRank
→ very low performance
• At e = N, a LexRank system
e=2
e=2
N
NIH Lister Hill Medical Center
N
N
74
Entity Linkage using the Web and Graph-based Update Summarization
Evaluation (other edge parameters)
• PageRank: generic vs topic-sensitive
• Edge weight (u): unweighted vs weighted
• Optimal performance: topic-sensitive PageRank and weighted edges
NIH Lister Hill Medical Center
Topicsensitive
Weighted
edges
ROUGE-1
ROUGE-2
No
No
0.39358
0.07690
Yes
No
0.39443
0.07838
No
Yes
0.39823
0.08072
Yes
Yes
0.39845
0.08282
75
Entity Linkage using the Web and Graph-based Update Summarization
Evaluation on skew degree (s)
• Different skew degrees: s = 0, 1 and 2
• Optimal performance: s = 1
• s = 2 introduces a delay interval that is too large
Skew degree
ROUGE-1
ROUGE-2
0
0.36982
0.07580
1
0.37268
0.07682
2
0.36998
0.07489
• Need to try freely skewed graphs
NIH Lister Hill Medical Center
76
Entity Linkage using the Web and Graph-based Update Summarization
Describing Summaries
Aspects of summarization
(Sparck-Jones 97,
Hovy and Lin 99)
• Input:
– Single-document vs. multi-document
• Purpose
– Situation: embedded in larger system (MT, IR) or not?
– Generic vs. query-oriented: author’s view or user’s interest?
– Indicative vs. informative: categorization or understanding?
– Background vs. just-the-news: does user have prior knowledge?
• Output
– Extract vs. abstract: use text fragments or re-phrase content?
NIH Lister Hill Medical Center
77
Entity Linkage using the Web and Graph-based Update Summarization
Differences for main and update task processing
Main task:
1.
2.
3.
Construct a TSG for
input cluster
Run topic-sensitive
PageRank on the TSG
Apply first modified
version of MMR to
extract sentences
Update task:
• Cluster A:
– Construct a TSG for cluster A
– Run topic-sensitive PageRank on the TSG
– Apply the second modified version of MMR
to extract sentences
• Cluster B:
– Construct a TSG for clusters A and B
– Run topic-sensitive PageRank on the TSG;
only retain sentences from B
– Apply the second modified version of MMR
to extract sentences
• Cluster C:
– Construct a TSG for clusters A, B and C
– Run topic-sensitive PageRank on the TSG;
only retain sentences from C
– Apply the second modified version of MMR
to extract sentences
NIH Lister Hill Medical Center
78
Entity Linkage using the Web and Graph-based Update Summarization
Sentence Ranking
• Once a timestamped graph is built, we want to compute an prestige
score for each node
• PageRank: use an iterative method that allows the weights of the
nodes to redistribute until stability is reached
• Similarities as edges → weighted edges; query → topic-sensitive
Topic
sensitive (Q)
portion
NIH Lister Hill Medical Center
Standard
random
walk term
79
Entity Linkage using the Web and Graph-based Update Summarization
Sentence Extraction – Main task
• Original MMR: integrates a penalty of the maximal similarity of the candidate
document and one selected document
• Ye et al. (2005) introduced a modified MMR: integrates a penalty of the total
similarity of the candidate sentence and all selected sentences
Penalty: All
previous
sentence
similarity
• Score(s) = PageRank score of s; S = selected sentences
• This is used in the main task
NIH Lister Hill Medical Center
80
Entity Linkage using the Web and Graph-based Update Summarization
Sentence Extraction – Update task
•Update task assumes readers already read previous cluster(s)
– implies we should not select sentences that have redundant
information with previous cluster(s)
• Propose a modified MMR for the update task:
– consider the total similarity of the candidate sentence with all
selected sentences and sentences in previously-read cluster(s)
Previous
cluster
overlap
• P contains some top-ranked sentences in previous cluster(s)
NIH Lister Hill Medical Center
81
Entity Linkage using the Web and Graph-based Update Summarization
References
• Günes Erkan and Dragomir R. Radev. 2004. LexRank: Graph-based centrality as
salience in text summari-zation. Journal of Artificial Intelligence Research, (22).
• Rada Mihalcea and Paul Tarau. 2004. TextRank: Bring-ing order into texts. In
Proceedings of EMNLP 2004.
• S.N. Dorogovtsev and J.F.F. Mendes. 2001. Evolution of networks. Submitted to
Advances in Physics on 6th March 2001.
• Sergey Brin and Lawrence Page. 1998. The anatomy of a large-scale hypertextual Web
search engine. Com-puter Networks and ISDN Systems, 30(1-7).
• Jon M. Kleinberg. 1999. Authoritative sources in a hy-perlinked environment. In
Proceedings of ACM-SIAM Symposium on Discrete Algorithms, 1999.
• Shiren Ye, Long Qiu, Tat-Seng Chua, and Min-Yen Kan. 2005. NUS at DUC 2005:
Understanding docu-ments via concepts links. In Proceedings of DUC 2005.
NIH Lister Hill Medical Center
82