Download ppt

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Non-negative matrix factorization wikipedia , lookup

Transcript
Disambiguating Web Appearances of
People in a Social Network
Ron Bekkerman, Andrew McCallum
University of Massachusetts
WWW 2005 (Chiba, Japan)
Abstract



Looking for info. about a particular person
 namesakes problem
 multiple people who are related in some ways
Two unsupervised methods
 One based on link structure of the Web pages
 Another using Agglomerative/Conglomerative Double
Clustering (A/CDC)
Dataset
 Over 1000 Web pages retrieved from Google queries on 12
personal names appearing together in someone’s email
folder
 Outperform traditional agglomerative clustering by more
than 20%, achieving over 80% F-measure
Introduction


Personalized tool that manage our social network
 Track people we know already
 Tell us about new people we meet
 e.g. receive email messages form people, it can rapidly
search for any public facts about them
Public info. about a person
 Useful summary of public info. about a person could gather
from the Web: news articles, corporate pages, university
pages, discussion forums, etc.
 But how to identify whether certain Web pages are about
the person (relevant pages) or different person with the
same name
Introduction (cont.)


Example:
 David Mulford, the US Ambassador
 most of pages retrieved are actually related to the
Ambassador; however, there are also two business
managers, a musician, a student, a scientist , and a few
others
 filter out info. about other namesakes
Previous Work
 Automatically populating a database of contact info. of
people in a use’s social network
 Homepage finding to extract institution, job title, address,
phone, fax, email, and use simple heuristic for
disambiguating person names, sometimes failed
Introduction (cont.)

This paper
 Finding all search engine hits about a person, and separate
them from namesakes
 Look beyond homepages
 Present two statistical frameworks: one based on linkage
structure, another based on the recently introduced multiway distributional clustering method
 Rather than searching for people individually, we leverage an
existing social network of people who are known to be
somewhat connected, and use this extra info. to aid the
disambiguation
Problem statement and related work


Problem statement
 Provides a function f answering whether or not Web page d
refers to a particular person h, given a model M and
background knowledge K
Background Knowledge
 Perfect background knowledge is unavailable
 K can include training data – page that are related to or
unrelated to the person, but obtaining negative instances
could be much more difficult
Problem statement and related work
(cont.)

Related Work
 The problem of disambiguating collections of Web
appearances has been explored surprisingly little (??)
 Homepage finding
 AHOY! (1997)
 TREC homepage finding competition in 2002
 primarily use heuristics and pattern matching
 Cross-document co-reference and name resolution
 All use average-link clustering methods
 Bagga and Baldwin use agglomerative clustering over
VSM
 Fleischman and Hovy construct MaxEnt classifier to learn
distance between documents that are then clustered
Methods - Link Structure Model



Application scenario
 Given a group of people H = {h1,…,hn} who are related to
each other, it would like to identify the Web presence of all
of them simultaneously
Important observation
 Web pages of a group of acquaintances are likely to be
interconnected, the term “interconnectedness” should be
defined
Construct a model M given the set of Web pages D
 D is constructed by providing a search engine with queries
th1,…,thN and retrieving top K hits of each query, so that N*K
Web pages overall
Methods - Link Structure Model (cont.)

GLS=(V,E): Link Structure Graph
 Maximal Connected Component
(MCC) is the core of model
 Central cluster C0: the largest
connected component that consists
of pages retrieved by more than one query
 Link Structure Model MLS is a pair (C, δ)
C: {C1,…,CM} (note that C0εC)
δ: distance threshold
 Discrimination function f is defined as
 1, if d  Ci : Ci  C0   , i  0...M
f (d , h | M ( K ))  
0, otherwise
Methods - Link Structure Model (cont.)

Particular design choices (vary from system to system)
 How to decide whether two pages are linked or not
 Directly link, can be reached by three links, or in the
same organization
 How to decide a suitable value of δ
 How to calculate the distance between two cluster C0 and Ci
 Cosine similarity or Kullback-Leibler divergence
Methods - Link Structure Model (cont.)

In their experiment
 Linked pages
 url(d): output domain name with its first directory name
 links(d): a set of URLs that occur in d
 Trusted URLs, TR(D): {url(di)}\POP
 Link structure, LS(d)=(links(d) ∩TR(D)) ∪url(d)
 Two pages d1 and d2 are linked to each other if their link
structures intersect, that is LS(d1) ∩LS(d2)
 Distance threshold
 Set it so that one third of the pages in the dataset within
the threshold
tf ( w)
tfidf ( w) 
 Distance measure between clusters
log google _ df ( w)
 Cosine similarity with a variation of tfidf term weighting
Methods - Agglomerative/Conglomerative
Double Clustering (A/CDC) Model

Clustering Model MCL
 a pair (C,L(.)), where C is the set of clusters of documents
in D, and L(.) is the interconnectedness measure of a
cluster
 Discrimination function f is defined as follows
1, if d  C * : C *  arg max i 1..M L(Ci )
f (d , h | M ( K ))  
0, otherwise
 Apply the A/CDC algorithm – an instance of the new multiway distributional clustering method to clustering method
Method - Agglomerative/Conglomerative
Double Clustering (A/CDC) Model (cont.)

Main idea of A/CDC
 Employ the fact that similar documents have similar
distributions over words, while similar words are similarly
distributed over documents
 Starting with one cluster containing all words and many
clusters with one document each, we iteratively spilt word
clusters and merge document clusters, while conditioning
one clustering system on the other, until meaningful clusters
are obtained
 Multi-way distributional clustering stands in close
correspondence with Multivariate Information Bottleneck
(MIB) method
Method - Agglomerative/Conglomerative
Double Clustering (A/CDC) Model (cont.)

Background
 Information Bottleneck (IB)
 a convenient information-theoretic framework for solving
clustering, information retrieval
 Main idea lies behind the IB clustering is in constructing
an assignment of data points X into clusters X^ that will
maximize info. about entities Y that are interdependent
with X
^ is
 The info. about Y gained from X
~
~
~
I ( X ; Y )   P( X , Y ) log
~
X ,Y

P( X , Y )
~
P( X ) P(Y )
Add compression
constraint,
thus the IB is stated as
~
~
arg max
I ( X ;Y )   I ( X ; X )
~
X
Method - Agglomerative/Conglomerative
Double Clustering (A/CDC) Model (cont.)

Background
 Slonim and Tishby: proposed a greedy agglomerative
algorithm for document clustering based on IB, where X
stands for documents and Y stands for words in the doc.,
and propose double clustering technique
 Friedman et al. propose the Multivariate Information
Bottleneck (MIB) framework: they consider clustering
instances of a set of variables X=(X1,…,Xn) into a set of
clustering systems X^=(X1^,…, Xn^)
 The double clustering problem thus becomes a partial case
of MIB and can be derived as
~
~
~
~
arg max
I ( X ; Y )   ( I ( X ; X )  I (Y ; Y ))
~ ~
X ,Y
Method - Agglomerative/Conglomerative
Double Clustering (A/CDC) Model (cont.)

Motivation
 Set the Lagrange multiplier β to zero, the double clustering
objective is~ then
derived as ~
~
~
arg max
I ( X ; Y ), subject to X =N ~ , Y =N ~
~ ~
X
X ,Y

Y
Explore different possibilities while employing the
hierarchical structure of the clusters: agglomerative (bottomup) and conglomerative (top-down) clustering
 Two top-down schemes


Two bottom-up schemes


Bad, lead to a completely random split
Bad, Because of the computational issues
One top-down, another bottom-up

Feasible, it can “bootstrap” each other
Method - Agglomerative/Conglomerative
Double Clustering (A/CDC) Model (cont.)

Motivation


Use one top-down, another
bottom-up
A/CDC is the simultaneous
clustering of X by a topdown scheme and Y by
bottom-up scheme
Method - Agglomerative/Conglomerative
Double Clustering (A/CDC) Model (cont.)

Overview of algorithm
 Break objective function to two parts
~ ~
~ ~
arg max
I ( X ; Y ), arg max
I ( X ;Y )
~
~
X

Y
Initiate the two clustering systems with one cluster x^ that
contains all data points x, and one data point yi per each cluster
yi^, and calculate the initial Mutual Information I(X^,Y^), at
each iteration of the algorithm, we perform four operations
^ to two equally sized
 Split step: randomly split xi
^ ^
 Sequential pass: pick each data point xj to maximize I(X ,Y )
^ to find its best
 Merge step: randomly select each cluster yi
mate while applying a criterion for minimizing Bayes
classification error
 Another sequential pass: the same sequential pass as step 2
Method - Agglomerative/Conglomerative
Double Clustering (A/CDC) Model (cont.)

Overview of algorithm
 In order to get the global maximum, perform a number of
random restarts of steps 1-2 and the of steps 3-4
 Computational complexity is O(NxNylogNy)
 In this case, we use the top-down scheme for clustering
words and the bottom-up scheme for clustering documents
 Continue the process until we have three document clusters
(one of which is then chosen to be the class of relevant
pages)
Method - LS+A/CDC Hybrid Model

Hybrid model
 Overlap the groups built by the two methods
 Compose a new central cluster C0* by uniting all the
connected components that overlap with C*
C

*
0

U
Ci
Ci C  ,i 0.. M
*
Discrimination function f is defined as
*

1, if d  Ci : Ci  C0   , i  0..M
f (d , h | M ( K )  

0, otherwise
Dataset


1085 Web pages
 From Melinda Gervasio’s
email directory and
extracted 12 person
name, then issue as
queries to Google and for
each query the first 100
were retrieved
The dataset is publicly
available
Results and Discussion


Baseline model
 greedy agglomerative clustering based on cosine similarity
between clusters and the tfidf weighting
A/CDC method
 The relatively high deviation in precision and recall is
caused by the fact that it never ends up with clusters of the
exactly same size
Results and Discussion (cont.)


Bill Mark and Fernando
Pereira because both of
them have “double”
problem
Steve Hardt and Adam
Cheyer: the worst recall
 Adam’s name often
appears in an industrial
context
 Steve: most of his page
refer to an online game
he created
Results and Discussion (cont.)




This result shows that
algorithm stopped with 3, 5,
9 or 17 clusters
When 5 clusters, the
“doubles” can be handled
within the A/CDC framework
Constructing clustering
systems with all possible
granularity levels is an
important feature of the
A/CDC algorithm
It can also solve “homepage
finding”
Conclusions and Future Work





The first attempt to approach the problem of finding Web
appearances of a group of people
Proposed two methods, purely unsupervised – involve minimum
of prior knowledge about people
Built a large annotated dataset that is public
Working on more sophisticated probabilistic models for solving
this problem that would capture the relational structure
Web appearance disambiguation is novel and poses a lot of
exciting challenges