Download Web People Search via Connection Analysis

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

Nearest-neighbor chain algorithm wikipedia , lookup

Cluster analysis wikipedia , lookup

K-means clustering wikipedia , lookup

Transcript
Web People Search via
Connection Analysis
Authors : Dmitri V. Kalashnikov, Zhaoqi (Stella) Chen, Sharad
Mehrotra, and Rabia Nuray-Turan
From : IEEE Trans. on Knowledge and Data Engineering 2008
Presenter : 陳仲詠
Citation : 21 (Google Scholar)
1
Outline
•
•
•
•
•
•
•
•
1. Introduction
2. Overview of the approach
3. Generating a graph representation
4. Disambiguation algorithm
5. Interpreting clustering results
6. Related work
7. Experimental Results
8. Conclusions and Future work
2
Introduction (1/7)
• Searching for web pages related to a person
accounts for more than 5 percent of the
current Web searches [24].
• A search for a person such as say “Andrew
McCallum” will return pages relevant to any
person with the name Andrew McCallum.
[24] R. Guha and A. Garg, Disambiguating People in Search. Stanford Univ., 2004.
3
Introduction (2/7)
• Assume (for now) that for each such web page,
the search-engine could determine which real
entity (i.e., which Andrew McCallum) the page
refers to.
• Provide a capability of clustered person
search, the returned results are clustered by
associating each cluster to a real person.
4
Introduction (3/7)
• The user can hone in on the cluster of interest
to her and get all pages in that cluster.
• For example, only the pages associated with
that Andrew McCallum.
5
Introduction (4/7)
• In reality, it is not obvious that it indeed is a
better option compared to searching for
people using keyword-based search.
• If clusters identified by the search engine
corresponded to a single person, then the
clustered-based approach would be a good
choice.
6
Introduction (5/7)
• The key issue is the quality of clustering
algorithms in disambiguating different web
pages of the namesake.
7
Introduction (6/7)
• 1. Develop a novel algorithm for
disambiguating among people that have the
same name.
• 2. Design a cluster-based people search
approach based on the disambiguation
algorithm.
8
Introduction (7/7)
• The main contributions of this paper are the
following :
• A new approach for Web People Search that
shows high-quality clustering.
• A thorough empirical evaluation of the
proposed solution (Section 7), and
• A new study of the impact on search of the
proposed approach (Section 7.3).
9
Overview of the approach (1/4)
• The processing of a user query consists of the
following steps:
• 1. User input : A user submits a query.
• 2. Web page retrieval : Retrieves a fixed
number (top K) of relevant web pages.
10
Overview of the approach (2/4)
• 3. Preprocessing :
– TF/IDF. noun phrase identification.
– Extraction. Named entities (NEs) and Web-related
information.
• 4. Graph creation : The entity-relationship (ER)
graph is generated based on data extracted.
11
Overview of the approach (3/4)
• 5. Clustering : The result is a set of clusters of
these pages with the aim being to cluster web
pages based on association to real person.
12
Overview of the approach (4/4)
• 6. Cluster processing :
– Sketches : A set of keywords that represent the
web pages within a cluster.
– Cluster ranking.
– Web page ranking.
• 7. Visualization of results
13
Generating a graph representation (1/6)
•
•
•
•
•
•
Extracted :
1)the entities
2)relationships
3)hyperlinks
4)e-mail addresses
from the web pages.
14
Generating a graph representation (2/6)
• For example, a person “John Smith” might be
extracted from two different web pages.
Doc
1
Doc
2
Regardless whether the
two pages refer to the
same person or to two
different people.
John
Smith
15
Generating a graph representation (3/6)
16
Generating a graph representation (3/6)
17
Generating a graph representation (3/6)
18
Generating a graph representation (3/6)
19
Generating a graph representation (4/6)
• The relationship edges are typed.
• Any hyperlinks and e-mail addresses
extracted from the web page are handled in
an analogous fashion.
20
Generating a graph representation (5/6)
• A hyperlink has the form :
• For example, for the URL :
www.cs.umass.edu/~ mccallum/
have d3 = cs, d2 = umass, d1 = edu
p1 = ~mccallum.
21
Generating a graph representation (6/6)
22
Disambiguation algorithm
• 1. Input the entity relationship graph.
• 2. Uses a Correlation Clustering (CC) algorithm
to cluster the pages.
• 3. The outcome is a set of clusters with each
cluster corresponding to a person.
23
Disambiguation algorithm
Correlation Clustering (1/3)
• CC has been applied in the past to group
documents of the same topic and to other
problems.
• It assumes that there is a similarity function
s(u, v) learned on the past data.
• Each (u, v) edge is assigned a “+” (similar) or “-”
(different) label, according to the similarity
function s(u, v).
24
Disambiguation algorithm
Correlation Clustering (2/3)
• The goal is to find the partition of the graph
into clusters that agrees the most with the
assigned labels.
• The CC does not take k (the number of the
resulting clusters) as its input parameter.
25
Disambiguation algorithm
Correlation Clustering (3/3)
• The goal of CC is formulated formally :
– maximize the agreement
– minimize the disagreement.
• The problem of CC is known to be NP-hard.
26
Disambiguation algorithm
Connection Strength (1/6)
• Use the notion of the Connection Strength c(u,
v) between two objects u and v to define the
similarity function s(u, v).
• The disambiguation algorithm is based on
analyzing :
– object features and
– the ER graph for the data set.
27
Disambiguation algorithm
Connection Strength (2/6)
• A path between u and v semantically captures
interactions between them via intermediate
entities.
• If the combined attraction of all these paths
is sufficiently large, the objects are likely to be
the same.
28
Disambiguation algorithm
Connection Strength (3/6)
• Analyzing paths :
• The assumption is that each path between
two objects carries in itself a certain degree of
attraction.
29
Disambiguation algorithm
Connection Strength (4/6)
• The attraction between two nodes u and v via
paths is measured using the connection
strength measure c(u, v).
• Defined as the sum of attractions contributed
by each path:
30
Disambiguation algorithm
Connection Strength (5/6)
• Puv denotes the set of all L-short simple paths
between u and v.
– A path is L-short if its length does not exceed L
and is simple if it does not contain duplicate nodes.
• wp denotes the weight contributed by path p.
– The weight path p contributes is derived from the
type of that path.
31
Disambiguation algorithm
Connection Strength (6/6)
• Let Puv consist of c1 paths of type 1, c2 paths of
type 2, . . . ; cn paths of type n.
32
Disambiguation algorithm
Similarity Function (1/4)
• The goal is to design a powerful similarity
function s(u, v) that would minimize
mislabeling of the data.
• Design a flexible function s(u, v), such that it
will be able to automatically self-tune itself to
the particular domain being processed.
33
Disambiguation algorithm
Similarity Function (2/4)
• The similarity function s(u, v) labels data by
comparing the s(u, v) value against the threshold
γ.
• Use the δ - band (“clear margin”) approach, label
the edge (u, v).
• To avoid committing to + or - decision, when it
does not have enough evidence for that.
34
Disambiguation algorithm
Similarity Function (3/4)
• Employs the standard TF/IDF scheme to
compute its feature-based similarity f(u, v).
– Noun phrases
– Larger terms
• The entire document corpus consists of K
documents
• N distinct terms T = {t1, t2, . . . ,tN}.
35
Disambiguation algorithm
Similarity Function (4/4)
• Each document u :
• wui is the weight
36
Disambiguation algorithm
Training the Similarity Function (1/2)
• For each (u, v) edge, require that :
• In practice, s(u, v) is unlikely to be perfect and
that would manifest itself in cases where the
inequalities in (5) will be violated for some of the
(u, v) edges
• It can be resolved in a similar manner by adding
slack to each inequality in (5).
37
Disambiguation algorithm
Training the Similarity Function (2/2)
• The task becomes to solve the linear
programming problem (6) to determine the
optimal values for path type weights w1,
w2,…,wn and threshold γ.
38
Disambiguation algorithm
Choosing Negative Weight (1/7)
• A CC algorithm will assign an entity u to a
cluster if the number of positive edges
between u and the other entities in the cluster
outnumbers that of the negative edges.
• The number of positive edges is more than
half (i.e., 50 percent).
39
Disambiguation algorithm
Choosing Negative Weight (2/7)
• To keep an entity in a cluster, it is sufficient to
have only 25 percent of positive edges.
• Using the w+=+1 weight for all positive edges
and w-=-1/3 weight for all negative edges will
achieve the desired effect.
40
Disambiguation algorithm
Choosing Negative Weight (3/7)
• One solution for choosing a good value for the
weight of negative edges w is to learn it on
past data.
• The number of namesakes n in the top k web
pages.
– If n = 1, w- = 0
– All the pair connected via positive edges will be
merged.
41
Disambiguation algorithm
Choosing Negative Weight (4/7)
– If n = k, it is best to choose w- = 1.
– This would produce maximum negative evidence
for pairs not to be merged.
• w- = w-(n)
42
Disambiguation algorithm
Choosing Negative Weight (5/7)
• This observation raises two issues :
– 1) n is not known to the algorithm beforehand.
– 2) how to choose the w-(n) function.
43
Disambiguation algorithm
Choosing Negative Weight (6/7)
• n is not known, compute its estimated value
^n by running the disambiguation algorithm
with a fixed value of w-.
• The algorithm would output certain number
of clusters ^n, which can be employed as an
estimation of n.
44
Disambiguation algorithm
Choosing Negative Weight (7/7)
• The value of w-(^n) :
– when ^n < threshold, w-(^n) = 0.
– when ^n > threshold, w-(^n) = -1.
• This threshold is learned from the data.
45
A brief Summary
46
Interpreting Clustering Results (1/4)
• Now describe how these clusters are used to
build people search.
• The goal is to provide the user with a set of
clusters based on association to real person.
– 1. Rank the clusters.
– 2. Provide a summary description with each
cluster.
47
Interpreting Clustering Results (2/4)
• Cluster rank :
– Select the highest ranked page.
• Cluster sketch :
– The set of terms above a certain threshold is
selected and used as a summary for the cluster.
48
Interpreting Clustering Results (3/4)
• Web page rank :
– These pages are displayed according to their
original search engine order.
49
Interpreting Clustering Results (4/4)
• Affinity to cluster :
– Defined as the sum of the similarity values
between the page p and each page v in the cluster
C:
• The remainder pages are displayed, the user
has the option to get to these web pages too.
50
Experimental Results
Experimental Setup (1/8)
• The three data sets :
– 1. WWW 2005 data set[8] : 12 different people
names.
– 2. WEPS data set : SemEval workshop [3], consist
of :
• Trail : 9 person names.
• Training : 49 person names.
• Test : 30 persons names.
[3] J. Artiles, J. Gonzalo, and S. Sekine, “The SemEval-2007 WePS Evaluation: Establishing a
Benchmark for the Web People Search Task,” Proc. Int’l Workshop Semantic Evaluations
(SemEval ’07), June 2007.
[8] R. Bekkerman and A. McCallum, “Disambiguating Web Appearances of People in a
Social Network,” Proc. Int’l World Wide Web Conf. (WWW), 2005.
51
Experimental Results
Experimental Setup (2/8)
– 3. Context data set :
• Issuing nine queries to Google, each in the form of a
person name along with context keywords.
• The top 100 returned web pages of the Web
search were gathered for each person.
52
Experimental Results
Experimental Setup (3/8)
• To get the “ground truth” for these data sets,
the pages for each person name have then
been assigned to distinct real persons by
manual examination.
53
Experimental Results
Experimental Setup (4/8)
• Used the GATE [19] system for the extraction
of NEs from the web pages in the data set.
• To train the free parameters of algorithm,
apply leave-one-out cross validation on
WWW 2005, WEPS Trial, and Context data
sets.
[19] H. Cunningham, D. Maynard, K. Bontcheva, and V. Tablan, “GATE: A Framework and
Graphical Development Environment for Robust NLP Tools and Applications,” Proc. Ann.
Meeting of the Assoc. Computational Linguistics (ACL), 2002.
54
Experimental Results
Experimental Setup (5/8)
• Before the “ground truth” for its WEPS Test
portion was released, tested the approach on
the WEPS Training set by a twofold cross
validation.
55
Experimental Results
Experimental Setup (6/8)
• After the “ground truth” of the WEPS Test
portion became available, trained the
algorithm on the whole WEPS Training
portion and tested on the WEPS Test portion.
56
Experimental Results
Experimental Setup (7/8)
• Quality evaluation measures :
– the B-cubed [6] and FP measures.
• Baseline methods :
– the Agglomerative Vector Space clustering
algorithm with TF/IDF as the Baseline method.
– The threshold parameter for this method is
trained the same way
57
Experimental Results
Experimental Setup (8/8)
• Statistical significance test :
– 1-tailed paired t-test, with α = 0.05.
58
Testing Disambiguation Quality
Experiment 1 (Disambiguation quality : overall) (1/7)
59
Testing Disambiguation Quality
Experiment 1 (Disambiguation quality : overall) (2/7)
* s(u, v) = c(u, v) represents the approach where only the
connection strength is employed for disambiguation.
* Relies only on the extracted NEs and hyperlink information,
and it does not use the TF/IDF.
60
Testing Disambiguation Quality
Experiment 1 (Disambiguation quality : overall) (3/7)
* With the analysis of the features of web pages f(u, v), in the
form of their TF/ IDF similarity.
61
Testing Disambiguation Quality
Experiment 1 (Disambiguation quality : overall) (4/7)
Picks w- according to the function w-(^n) of the predicted
number of namesakes.
Gains 7.8 percent improvement in terms of B-cubed over the
baseline (WWW 2005 ).
Gets 6.1 percent improvement (WEPS Training) and 10.7
percent improvement (WEPS Test).
62
Testing Disambiguation Quality
Experiment 1 (Disambiguation quality : overall) (5/7)
Also compare the results with the top runners in the WEPS
challenge [3]. The first runner in the challenge reports 0.78 for Fp
and 0.70 for B-cubed measures.
[3] J. Artiles, J. Gonzalo, and S. Sekine, “The SemEval-2007 WePS Evaluation: Establishing a
Benchmark for the Web People Search Task,” Proc. Int’l Workshop Semantic Evaluations
(SemEval ’07), June 2007.
63
Testing Disambiguation Quality
Experiment 1 (Disambiguation quality per namesake) (6/7)
The “#” field shows the number of namesakes for a particular
name in the corresponding 100 web pages.
[4]J. Artiles, J. Gonzalo, and F. Verdejo, “A Testbed for People Searching Strategies in
the WWW,” Proc. SIGIR, 2005. (C : 39)
64
Testing Disambiguation Quality
Experiment 1 (Disambiguation quality per namesake) (7/7)
The table shows that the proposed approach outperforms that
in [4] by 9.5 percent in terms of the FP measure.
[4]J. Artiles, J. Gonzalo, and F. Verdejo, “A Testbed for People Searching Strategies in
the WWW,” Proc. SIGIR, 2005. (C : 39)
65
Testing Disambiguation Quality
Experiment 2 (Disambiguation quality : group identification)
The 1,085 web pages of the WWW 2005 data set.
The task is to find the web pages related to the meant N people.
66
Testing Disambiguation Quality
Experiment 2 (Disambiguation quality : group identification)
The field “#W” in Table 3 is the number of the to-be-found web pages related to the
namesake of interest.
The field “#C” is the number of web pages found correctly.
The field “#I” is the number of pages found incorrectly in the resulting groups.
67
Testing Disambiguation Quality
Experiment 3 (Disambiguation quality: queries with context)
Generated a data set by querying Google with a person name and context
keyword(s) that is related to that person.
Used nine different queries.
68
Testing Disambiguation Quality
Experiment 4 (Quality of generating cluster sketches)
The set of terms above a certain
threshold (or top N terms) is selected and
used as a summary for the cluster.
If the search is for UMass professor
Andrew McCallum, his cluster can easily be
identified with the terms like “machine
learning” and “artificial intelligence.”
69
Impact on Search
In case of a traditional search interface, at each observation i,
where i = 1, 2,…,K, the user looks at the sketch provided for the i-th
returned web page.
70
Impact on Search
For the new interface, supported by a cluster-based people
search, the user first looks at the “people search” interface.
1. The user sequentially reads cluster sketches/ descriptions,
until on the m-th observation the user find the cluster of
interest.
2. Clicks on that cluster.
3. Shows the original set of K web pages returned by the
search engine.
71
Impact on Search
Measures :
Compare the quality of the new and standard
interface using Precision, Recall, and F-measure.
In general, the fewer observations are needed in
a given interface, the faster the user can find the
related pages.
72
Experiment 5 (Impact on search)
Case 1 : First-dominant cluster
Observation
Standard
New interface
To discover 50
percent of the
relevant pages.
44
33
To discover 90
percent of the
relevant pages.
92
55
73
Experiment 5 (Impact on search)
Case 2 : Regular cluster
Andrew McCallum the Customer Support
person. His cluster consists of three pages.
Observation
Standard
New interface
To discover 50
percent of the
relevant pages.
51
16
To discover 90
percent of the
relevant pages.
79
17
74
Experiment 5 (Impact on search)
Case 3 : Average
The average of Recall, Precision, and F
measures for search impact on the WWW 2005.
Some of the person names have many
namesakes.
Show that, even with the imperfect clustering,
the curves for the new interface largely
dominate those for the standard interface.
75
Experiment 5 (Impact on search)
Impact on search with context
In that case, one can expect to see no
difference between the new and the standard
interface.
The query is “Andrew McCallum” music
The number of namesakes for that query is
surprisingly large: 23.
Andrew McCallum the UMass professor, who
is interested in music.
76
Experiment 5 (Impact on search)
Impact on search with context
Andrew McCallum the DJ/ musician.
In both cases, the new interface performs
better than the standard one.
Observation
Standard
New interface
To discover 90
percent of the
prof.
90
60
To discover 90
percent of the DJ.
90
20
77
Experiment 6 (efficiency)
That takes 3.82 seconds per web page (downloads and
preprocesses pages.)
The clustering algorithm itself executes in 4.7 seconds
on the average per queried name.
78
CONCLUSIONS AND FUTURE WORK
•Attempted to answer the question of which maximum
quality the approach can get if it uses only the
information stored in the top-k web pages being
processed.
•Future work :
1. Employ external data sources for disambiguation.
2. Use more advances extraction capabilities.
3. Work on algorithms for a generic entity search, where
entities are not limited to people.
79
Related Work
• Disambiguation and entity resolution
techniques are key to any Web people search
applications.
80
The differences among the disambiguation methodology in this paper and most related
existing work are multilevel (see Table 1).
81
Level 1: Problem type.
Two different common types of the disambiguation challenge:
(fuzzy) Lookup [27], [28], and (fuzzy) Grouping [10], [13].
82
Level 2: Data with respect to GLuv.
*The methodology is based on analyzing GLuv in this paper.
*The majority of the existing techniques do not analyze GLuv.
83
[12]I. Bhattacharya and L. Getoor, “Collective Entity Resolution in Relational
Data,” Bull. IEEE CS Technical Committee Data Eng., vol. 29, no. 2, pp. 4-12, June
2006.
Name co-occurrence.
[12] analyzes only co-occurrences of names of authors via publications for a publication data
set.
84
[10] I. Bhattacharya and L. Getoor, “Iterative Record Linkage for Cleaning and Integration,”
Proc. ACM SIGMOD Workshop Research Issues in Data Mining and Knowledge Discovery
(DMKD), 2004.
[11] I. Bhattacharya and L. Getoor, “Relational Clustering for Multi- Type Entity Resolution,”
Proc. Multi-Relational Data Mining Workshop (MRDM), 2005.
[13] I. Bhattacharya and L. Getoor, “A Latent Dirichlet Model for Unsupervised Entity
Resolution,” Proc. SIAM Data Mining Conf. (SDM), 2006.
Name co-occurrence.
When analyzing authors A1 and A5, the approach in [10], [11], and [13] would only be
interested in author A3, which is a co-occurring author in publications P1 and P2, which are
connected to A1 and A5, respectively.
85
[12]I. Bhattacharya and L. Getoor, “Collective Entity Resolution in Relational
Data,” Bull. IEEE CS Technical Committee Data Eng., vol. 29, no. 2, pp. 4-12, June
2006.
Name co-occurrence.
[12] would be interested only in the sub-graph shown in Fig. 5.
The methodology in this paper instead analyzes the whole GLuv.
86
Restrictions on types.
[12] understands only one type of relationship. The approach proposed here can analyze all
of the types of relationships and entities.
87
[26] R. Holzer, B. Malin, and L. Sweeney, “Email Alias Detection Using Social Network
Analysis,” Proc. ACM SIGKDD, 2005.
[31] B. Malin, “Unsupervised Name Disambiguation via Social Network Similarity,” Proc.
Workshop Link Analysis, Counterterrorism, and Security, 2005.
[33] E. Minkov, W. Cohen, and A. Ng, “Contextual Search and Name Disambiguation in
Email Using Graphs,” Proc. SIGIR, 2006.
*[26], [31], and [33] often still analyzes just portions of GLuv.
* The adaptive approach in [33] analyzes G2uv, see Fig. 7.
88
[31] B. Malin, “Unsupervised Name Disambiguation via Social Network Similarity,”
Proc. Workshop Link Analysis, Counterterrorism, and Security, 2005.
*[31] simply looks at people and connects them via “are-related”
89
[26] R. Holzer, B. Malin, and L. Sweeney, “Email Alias Detection Using Social Network
Analysis,” Proc. ACM SIGKDD, 2005.
[27] D.V. Kalashnikov and S. Mehrotra, “Domain-Independent Data Cleaning via Analysis of
Entity-Relationship Graph,” ACM Trans. Database Systems, vol. 31, no. 2, pp. 716-767, June
2006.
[28] D.V. Kalashnikov, S. Mehrotra, and Z. Chen, “Exploiting Relationships for DomainIndependent Data Cleaning,” Proc. SIAM Int’l Conf. Data Mining (SDM ’05), Apr. 2005.
*Level 3: Analysis of GLuv.
*The methodology in this paper is based on analyzing paths in Puv and building mathematical
models for c(u, v).
* The existing work (e.g., [27], [28]) analyze the direct neighbors and [26] analyzes the
shortest u-v path.
90
[10] I. Bhattacharya and L. Getoor, “Iterative Record Linkage for Cleaning and Integration,” Proc. ACM
SIGMOD Workshop Research Issues in Data Mining and Knowledge Discovery (DMKD), 2004.
[11] I. Bhattacharya and L. Getoor, “Relational Clustering for Multi- Type Entity Resolution,” Proc. MultiRelational Data Mining Workshop (MRDM), 2005.
[27] D.V. Kalashnikov and S. Mehrotra, “Domain-Independent Data Cleaning via Analysis of EntityRelationship Graph,” ACM Trans. Database Systems, vol. 31, no. 2, pp. 716-767, June 2006.
[28] D.V. Kalashnikov, S. Mehrotra, and Z. Chen, “Exploiting Relationships for Domain-Independent Data
Cleaning,” Proc. SIAM Int’l Conf. Data Mining (SDM ’05), Apr. 2005.
*Level 4 : Way to use c(u, v).
*[10] and [11] employ agglomerative clustering.
*[27], [28], the disambiguation problem is converted into an optimization problem, which is
then solved iteratively.
91
*Level 5: Domain independence.
*Some of the existing techniques are applicable to only certain types of domains or just one
domain.
92
Related Work
WSD (1/3)
• Word Sense Disambiguation :
– determine the exact sense of an ambiguous word
given a list of word senses.
• Word Sense Discrimination :
– determine, which instances of the ambiguous
word can be clustered as sharing the same
meaning.
93
Related Work
WSD (2/3)
• External knowledge sources :
– Using lexical knowledge associated with a
dictionary and WordNet.
• Approach :
– supervised
– unsupervised
94
Related Work
WSD (3/3)
• If view the ambiguous word as a reference
and the word sense as an entity.
• The two instances of WSD problem are similar
to the Lookup and Grouping instances of
Entity Resolution/WePS.
95
Related Work
WePS (1/4)
• WePS can be implemented in two different
settings.
– Server-side setting : the disambiguation
mechanism is integrated into the search-engine
directly.
– Middleware approach : build people search
capabilities on top of an existing search-engine
such as Google by “wrapping” the original engine.
96
Related Work
WePS (2/4)
• Clusty (http://www.clusty.com)
• Grokker (http://www.grokker.com)
• Kartoo (http://www.kartoo.com)
97
Related Work
WePS (3/4)
• ZoomInfo (http://www.zoominfo.com)
98
Related Work
WePS (4/4)
• But, this system has a high cost and low
scalability.
• Because the person information in the systems is collected
primarily manually.
• Does not rely on any such pre-compiled
knowledge and thus will scale to person
search for any person on the Web.
99