* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Web People Search via Connection Analysis
Survey
Document related concepts
Transcript
Web People Search via Connection Analysis Authors : Dmitri V. Kalashnikov, Zhaoqi (Stella) Chen, Sharad Mehrotra, and Rabia Nuray-Turan From : IEEE Trans. on Knowledge and Data Engineering 2008 Presenter : 陳仲詠 Citation : 21 (Google Scholar) 1 Outline • • • • • • • • 1. Introduction 2. Overview of the approach 3. Generating a graph representation 4. Disambiguation algorithm 5. Interpreting clustering results 6. Related work 7. Experimental Results 8. Conclusions and Future work 2 Introduction (1/7) • Searching for web pages related to a person accounts for more than 5 percent of the current Web searches [24]. • A search for a person such as say “Andrew McCallum” will return pages relevant to any person with the name Andrew McCallum. [24] R. Guha and A. Garg, Disambiguating People in Search. Stanford Univ., 2004. 3 Introduction (2/7) • Assume (for now) that for each such web page, the search-engine could determine which real entity (i.e., which Andrew McCallum) the page refers to. • Provide a capability of clustered person search, the returned results are clustered by associating each cluster to a real person. 4 Introduction (3/7) • The user can hone in on the cluster of interest to her and get all pages in that cluster. • For example, only the pages associated with that Andrew McCallum. 5 Introduction (4/7) • In reality, it is not obvious that it indeed is a better option compared to searching for people using keyword-based search. • If clusters identified by the search engine corresponded to a single person, then the clustered-based approach would be a good choice. 6 Introduction (5/7) • The key issue is the quality of clustering algorithms in disambiguating different web pages of the namesake. 7 Introduction (6/7) • 1. Develop a novel algorithm for disambiguating among people that have the same name. • 2. Design a cluster-based people search approach based on the disambiguation algorithm. 8 Introduction (7/7) • The main contributions of this paper are the following : • A new approach for Web People Search that shows high-quality clustering. • A thorough empirical evaluation of the proposed solution (Section 7), and • A new study of the impact on search of the proposed approach (Section 7.3). 9 Overview of the approach (1/4) • The processing of a user query consists of the following steps: • 1. User input : A user submits a query. • 2. Web page retrieval : Retrieves a fixed number (top K) of relevant web pages. 10 Overview of the approach (2/4) • 3. Preprocessing : – TF/IDF. noun phrase identification. – Extraction. Named entities (NEs) and Web-related information. • 4. Graph creation : The entity-relationship (ER) graph is generated based on data extracted. 11 Overview of the approach (3/4) • 5. Clustering : The result is a set of clusters of these pages with the aim being to cluster web pages based on association to real person. 12 Overview of the approach (4/4) • 6. Cluster processing : – Sketches : A set of keywords that represent the web pages within a cluster. – Cluster ranking. – Web page ranking. • 7. Visualization of results 13 Generating a graph representation (1/6) • • • • • • Extracted : 1)the entities 2)relationships 3)hyperlinks 4)e-mail addresses from the web pages. 14 Generating a graph representation (2/6) • For example, a person “John Smith” might be extracted from two different web pages. Doc 1 Doc 2 Regardless whether the two pages refer to the same person or to two different people. John Smith 15 Generating a graph representation (3/6) 16 Generating a graph representation (3/6) 17 Generating a graph representation (3/6) 18 Generating a graph representation (3/6) 19 Generating a graph representation (4/6) • The relationship edges are typed. • Any hyperlinks and e-mail addresses extracted from the web page are handled in an analogous fashion. 20 Generating a graph representation (5/6) • A hyperlink has the form : • For example, for the URL : www.cs.umass.edu/~ mccallum/ have d3 = cs, d2 = umass, d1 = edu p1 = ~mccallum. 21 Generating a graph representation (6/6) 22 Disambiguation algorithm • 1. Input the entity relationship graph. • 2. Uses a Correlation Clustering (CC) algorithm to cluster the pages. • 3. The outcome is a set of clusters with each cluster corresponding to a person. 23 Disambiguation algorithm Correlation Clustering (1/3) • CC has been applied in the past to group documents of the same topic and to other problems. • It assumes that there is a similarity function s(u, v) learned on the past data. • Each (u, v) edge is assigned a “+” (similar) or “-” (different) label, according to the similarity function s(u, v). 24 Disambiguation algorithm Correlation Clustering (2/3) • The goal is to find the partition of the graph into clusters that agrees the most with the assigned labels. • The CC does not take k (the number of the resulting clusters) as its input parameter. 25 Disambiguation algorithm Correlation Clustering (3/3) • The goal of CC is formulated formally : – maximize the agreement – minimize the disagreement. • The problem of CC is known to be NP-hard. 26 Disambiguation algorithm Connection Strength (1/6) • Use the notion of the Connection Strength c(u, v) between two objects u and v to define the similarity function s(u, v). • The disambiguation algorithm is based on analyzing : – object features and – the ER graph for the data set. 27 Disambiguation algorithm Connection Strength (2/6) • A path between u and v semantically captures interactions between them via intermediate entities. • If the combined attraction of all these paths is sufficiently large, the objects are likely to be the same. 28 Disambiguation algorithm Connection Strength (3/6) • Analyzing paths : • The assumption is that each path between two objects carries in itself a certain degree of attraction. 29 Disambiguation algorithm Connection Strength (4/6) • The attraction between two nodes u and v via paths is measured using the connection strength measure c(u, v). • Defined as the sum of attractions contributed by each path: 30 Disambiguation algorithm Connection Strength (5/6) • Puv denotes the set of all L-short simple paths between u and v. – A path is L-short if its length does not exceed L and is simple if it does not contain duplicate nodes. • wp denotes the weight contributed by path p. – The weight path p contributes is derived from the type of that path. 31 Disambiguation algorithm Connection Strength (6/6) • Let Puv consist of c1 paths of type 1, c2 paths of type 2, . . . ; cn paths of type n. 32 Disambiguation algorithm Similarity Function (1/4) • The goal is to design a powerful similarity function s(u, v) that would minimize mislabeling of the data. • Design a flexible function s(u, v), such that it will be able to automatically self-tune itself to the particular domain being processed. 33 Disambiguation algorithm Similarity Function (2/4) • The similarity function s(u, v) labels data by comparing the s(u, v) value against the threshold γ. • Use the δ - band (“clear margin”) approach, label the edge (u, v). • To avoid committing to + or - decision, when it does not have enough evidence for that. 34 Disambiguation algorithm Similarity Function (3/4) • Employs the standard TF/IDF scheme to compute its feature-based similarity f(u, v). – Noun phrases – Larger terms • The entire document corpus consists of K documents • N distinct terms T = {t1, t2, . . . ,tN}. 35 Disambiguation algorithm Similarity Function (4/4) • Each document u : • wui is the weight 36 Disambiguation algorithm Training the Similarity Function (1/2) • For each (u, v) edge, require that : • In practice, s(u, v) is unlikely to be perfect and that would manifest itself in cases where the inequalities in (5) will be violated for some of the (u, v) edges • It can be resolved in a similar manner by adding slack to each inequality in (5). 37 Disambiguation algorithm Training the Similarity Function (2/2) • The task becomes to solve the linear programming problem (6) to determine the optimal values for path type weights w1, w2,…,wn and threshold γ. 38 Disambiguation algorithm Choosing Negative Weight (1/7) • A CC algorithm will assign an entity u to a cluster if the number of positive edges between u and the other entities in the cluster outnumbers that of the negative edges. • The number of positive edges is more than half (i.e., 50 percent). 39 Disambiguation algorithm Choosing Negative Weight (2/7) • To keep an entity in a cluster, it is sufficient to have only 25 percent of positive edges. • Using the w+=+1 weight for all positive edges and w-=-1/3 weight for all negative edges will achieve the desired effect. 40 Disambiguation algorithm Choosing Negative Weight (3/7) • One solution for choosing a good value for the weight of negative edges w is to learn it on past data. • The number of namesakes n in the top k web pages. – If n = 1, w- = 0 – All the pair connected via positive edges will be merged. 41 Disambiguation algorithm Choosing Negative Weight (4/7) – If n = k, it is best to choose w- = 1. – This would produce maximum negative evidence for pairs not to be merged. • w- = w-(n) 42 Disambiguation algorithm Choosing Negative Weight (5/7) • This observation raises two issues : – 1) n is not known to the algorithm beforehand. – 2) how to choose the w-(n) function. 43 Disambiguation algorithm Choosing Negative Weight (6/7) • n is not known, compute its estimated value ^n by running the disambiguation algorithm with a fixed value of w-. • The algorithm would output certain number of clusters ^n, which can be employed as an estimation of n. 44 Disambiguation algorithm Choosing Negative Weight (7/7) • The value of w-(^n) : – when ^n < threshold, w-(^n) = 0. – when ^n > threshold, w-(^n) = -1. • This threshold is learned from the data. 45 A brief Summary 46 Interpreting Clustering Results (1/4) • Now describe how these clusters are used to build people search. • The goal is to provide the user with a set of clusters based on association to real person. – 1. Rank the clusters. – 2. Provide a summary description with each cluster. 47 Interpreting Clustering Results (2/4) • Cluster rank : – Select the highest ranked page. • Cluster sketch : – The set of terms above a certain threshold is selected and used as a summary for the cluster. 48 Interpreting Clustering Results (3/4) • Web page rank : – These pages are displayed according to their original search engine order. 49 Interpreting Clustering Results (4/4) • Affinity to cluster : – Defined as the sum of the similarity values between the page p and each page v in the cluster C: • The remainder pages are displayed, the user has the option to get to these web pages too. 50 Experimental Results Experimental Setup (1/8) • The three data sets : – 1. WWW 2005 data set[8] : 12 different people names. – 2. WEPS data set : SemEval workshop [3], consist of : • Trail : 9 person names. • Training : 49 person names. • Test : 30 persons names. [3] J. Artiles, J. Gonzalo, and S. Sekine, “The SemEval-2007 WePS Evaluation: Establishing a Benchmark for the Web People Search Task,” Proc. Int’l Workshop Semantic Evaluations (SemEval ’07), June 2007. [8] R. Bekkerman and A. McCallum, “Disambiguating Web Appearances of People in a Social Network,” Proc. Int’l World Wide Web Conf. (WWW), 2005. 51 Experimental Results Experimental Setup (2/8) – 3. Context data set : • Issuing nine queries to Google, each in the form of a person name along with context keywords. • The top 100 returned web pages of the Web search were gathered for each person. 52 Experimental Results Experimental Setup (3/8) • To get the “ground truth” for these data sets, the pages for each person name have then been assigned to distinct real persons by manual examination. 53 Experimental Results Experimental Setup (4/8) • Used the GATE [19] system for the extraction of NEs from the web pages in the data set. • To train the free parameters of algorithm, apply leave-one-out cross validation on WWW 2005, WEPS Trial, and Context data sets. [19] H. Cunningham, D. Maynard, K. Bontcheva, and V. Tablan, “GATE: A Framework and Graphical Development Environment for Robust NLP Tools and Applications,” Proc. Ann. Meeting of the Assoc. Computational Linguistics (ACL), 2002. 54 Experimental Results Experimental Setup (5/8) • Before the “ground truth” for its WEPS Test portion was released, tested the approach on the WEPS Training set by a twofold cross validation. 55 Experimental Results Experimental Setup (6/8) • After the “ground truth” of the WEPS Test portion became available, trained the algorithm on the whole WEPS Training portion and tested on the WEPS Test portion. 56 Experimental Results Experimental Setup (7/8) • Quality evaluation measures : – the B-cubed [6] and FP measures. • Baseline methods : – the Agglomerative Vector Space clustering algorithm with TF/IDF as the Baseline method. – The threshold parameter for this method is trained the same way 57 Experimental Results Experimental Setup (8/8) • Statistical significance test : – 1-tailed paired t-test, with α = 0.05. 58 Testing Disambiguation Quality Experiment 1 (Disambiguation quality : overall) (1/7) 59 Testing Disambiguation Quality Experiment 1 (Disambiguation quality : overall) (2/7) * s(u, v) = c(u, v) represents the approach where only the connection strength is employed for disambiguation. * Relies only on the extracted NEs and hyperlink information, and it does not use the TF/IDF. 60 Testing Disambiguation Quality Experiment 1 (Disambiguation quality : overall) (3/7) * With the analysis of the features of web pages f(u, v), in the form of their TF/ IDF similarity. 61 Testing Disambiguation Quality Experiment 1 (Disambiguation quality : overall) (4/7) Picks w- according to the function w-(^n) of the predicted number of namesakes. Gains 7.8 percent improvement in terms of B-cubed over the baseline (WWW 2005 ). Gets 6.1 percent improvement (WEPS Training) and 10.7 percent improvement (WEPS Test). 62 Testing Disambiguation Quality Experiment 1 (Disambiguation quality : overall) (5/7) Also compare the results with the top runners in the WEPS challenge [3]. The first runner in the challenge reports 0.78 for Fp and 0.70 for B-cubed measures. [3] J. Artiles, J. Gonzalo, and S. Sekine, “The SemEval-2007 WePS Evaluation: Establishing a Benchmark for the Web People Search Task,” Proc. Int’l Workshop Semantic Evaluations (SemEval ’07), June 2007. 63 Testing Disambiguation Quality Experiment 1 (Disambiguation quality per namesake) (6/7) The “#” field shows the number of namesakes for a particular name in the corresponding 100 web pages. [4]J. Artiles, J. Gonzalo, and F. Verdejo, “A Testbed for People Searching Strategies in the WWW,” Proc. SIGIR, 2005. (C : 39) 64 Testing Disambiguation Quality Experiment 1 (Disambiguation quality per namesake) (7/7) The table shows that the proposed approach outperforms that in [4] by 9.5 percent in terms of the FP measure. [4]J. Artiles, J. Gonzalo, and F. Verdejo, “A Testbed for People Searching Strategies in the WWW,” Proc. SIGIR, 2005. (C : 39) 65 Testing Disambiguation Quality Experiment 2 (Disambiguation quality : group identification) The 1,085 web pages of the WWW 2005 data set. The task is to find the web pages related to the meant N people. 66 Testing Disambiguation Quality Experiment 2 (Disambiguation quality : group identification) The field “#W” in Table 3 is the number of the to-be-found web pages related to the namesake of interest. The field “#C” is the number of web pages found correctly. The field “#I” is the number of pages found incorrectly in the resulting groups. 67 Testing Disambiguation Quality Experiment 3 (Disambiguation quality: queries with context) Generated a data set by querying Google with a person name and context keyword(s) that is related to that person. Used nine different queries. 68 Testing Disambiguation Quality Experiment 4 (Quality of generating cluster sketches) The set of terms above a certain threshold (or top N terms) is selected and used as a summary for the cluster. If the search is for UMass professor Andrew McCallum, his cluster can easily be identified with the terms like “machine learning” and “artificial intelligence.” 69 Impact on Search In case of a traditional search interface, at each observation i, where i = 1, 2,…,K, the user looks at the sketch provided for the i-th returned web page. 70 Impact on Search For the new interface, supported by a cluster-based people search, the user first looks at the “people search” interface. 1. The user sequentially reads cluster sketches/ descriptions, until on the m-th observation the user find the cluster of interest. 2. Clicks on that cluster. 3. Shows the original set of K web pages returned by the search engine. 71 Impact on Search Measures : Compare the quality of the new and standard interface using Precision, Recall, and F-measure. In general, the fewer observations are needed in a given interface, the faster the user can find the related pages. 72 Experiment 5 (Impact on search) Case 1 : First-dominant cluster Observation Standard New interface To discover 50 percent of the relevant pages. 44 33 To discover 90 percent of the relevant pages. 92 55 73 Experiment 5 (Impact on search) Case 2 : Regular cluster Andrew McCallum the Customer Support person. His cluster consists of three pages. Observation Standard New interface To discover 50 percent of the relevant pages. 51 16 To discover 90 percent of the relevant pages. 79 17 74 Experiment 5 (Impact on search) Case 3 : Average The average of Recall, Precision, and F measures for search impact on the WWW 2005. Some of the person names have many namesakes. Show that, even with the imperfect clustering, the curves for the new interface largely dominate those for the standard interface. 75 Experiment 5 (Impact on search) Impact on search with context In that case, one can expect to see no difference between the new and the standard interface. The query is “Andrew McCallum” music The number of namesakes for that query is surprisingly large: 23. Andrew McCallum the UMass professor, who is interested in music. 76 Experiment 5 (Impact on search) Impact on search with context Andrew McCallum the DJ/ musician. In both cases, the new interface performs better than the standard one. Observation Standard New interface To discover 90 percent of the prof. 90 60 To discover 90 percent of the DJ. 90 20 77 Experiment 6 (efficiency) That takes 3.82 seconds per web page (downloads and preprocesses pages.) The clustering algorithm itself executes in 4.7 seconds on the average per queried name. 78 CONCLUSIONS AND FUTURE WORK •Attempted to answer the question of which maximum quality the approach can get if it uses only the information stored in the top-k web pages being processed. •Future work : 1. Employ external data sources for disambiguation. 2. Use more advances extraction capabilities. 3. Work on algorithms for a generic entity search, where entities are not limited to people. 79 Related Work • Disambiguation and entity resolution techniques are key to any Web people search applications. 80 The differences among the disambiguation methodology in this paper and most related existing work are multilevel (see Table 1). 81 Level 1: Problem type. Two different common types of the disambiguation challenge: (fuzzy) Lookup [27], [28], and (fuzzy) Grouping [10], [13]. 82 Level 2: Data with respect to GLuv. *The methodology is based on analyzing GLuv in this paper. *The majority of the existing techniques do not analyze GLuv. 83 [12]I. Bhattacharya and L. Getoor, “Collective Entity Resolution in Relational Data,” Bull. IEEE CS Technical Committee Data Eng., vol. 29, no. 2, pp. 4-12, June 2006. Name co-occurrence. [12] analyzes only co-occurrences of names of authors via publications for a publication data set. 84 [10] I. Bhattacharya and L. Getoor, “Iterative Record Linkage for Cleaning and Integration,” Proc. ACM SIGMOD Workshop Research Issues in Data Mining and Knowledge Discovery (DMKD), 2004. [11] I. Bhattacharya and L. Getoor, “Relational Clustering for Multi- Type Entity Resolution,” Proc. Multi-Relational Data Mining Workshop (MRDM), 2005. [13] I. Bhattacharya and L. Getoor, “A Latent Dirichlet Model for Unsupervised Entity Resolution,” Proc. SIAM Data Mining Conf. (SDM), 2006. Name co-occurrence. When analyzing authors A1 and A5, the approach in [10], [11], and [13] would only be interested in author A3, which is a co-occurring author in publications P1 and P2, which are connected to A1 and A5, respectively. 85 [12]I. Bhattacharya and L. Getoor, “Collective Entity Resolution in Relational Data,” Bull. IEEE CS Technical Committee Data Eng., vol. 29, no. 2, pp. 4-12, June 2006. Name co-occurrence. [12] would be interested only in the sub-graph shown in Fig. 5. The methodology in this paper instead analyzes the whole GLuv. 86 Restrictions on types. [12] understands only one type of relationship. The approach proposed here can analyze all of the types of relationships and entities. 87 [26] R. Holzer, B. Malin, and L. Sweeney, “Email Alias Detection Using Social Network Analysis,” Proc. ACM SIGKDD, 2005. [31] B. Malin, “Unsupervised Name Disambiguation via Social Network Similarity,” Proc. Workshop Link Analysis, Counterterrorism, and Security, 2005. [33] E. Minkov, W. Cohen, and A. Ng, “Contextual Search and Name Disambiguation in Email Using Graphs,” Proc. SIGIR, 2006. *[26], [31], and [33] often still analyzes just portions of GLuv. * The adaptive approach in [33] analyzes G2uv, see Fig. 7. 88 [31] B. Malin, “Unsupervised Name Disambiguation via Social Network Similarity,” Proc. Workshop Link Analysis, Counterterrorism, and Security, 2005. *[31] simply looks at people and connects them via “are-related” 89 [26] R. Holzer, B. Malin, and L. Sweeney, “Email Alias Detection Using Social Network Analysis,” Proc. ACM SIGKDD, 2005. [27] D.V. Kalashnikov and S. Mehrotra, “Domain-Independent Data Cleaning via Analysis of Entity-Relationship Graph,” ACM Trans. Database Systems, vol. 31, no. 2, pp. 716-767, June 2006. [28] D.V. Kalashnikov, S. Mehrotra, and Z. Chen, “Exploiting Relationships for DomainIndependent Data Cleaning,” Proc. SIAM Int’l Conf. Data Mining (SDM ’05), Apr. 2005. *Level 3: Analysis of GLuv. *The methodology in this paper is based on analyzing paths in Puv and building mathematical models for c(u, v). * The existing work (e.g., [27], [28]) analyze the direct neighbors and [26] analyzes the shortest u-v path. 90 [10] I. Bhattacharya and L. Getoor, “Iterative Record Linkage for Cleaning and Integration,” Proc. ACM SIGMOD Workshop Research Issues in Data Mining and Knowledge Discovery (DMKD), 2004. [11] I. Bhattacharya and L. Getoor, “Relational Clustering for Multi- Type Entity Resolution,” Proc. MultiRelational Data Mining Workshop (MRDM), 2005. [27] D.V. Kalashnikov and S. Mehrotra, “Domain-Independent Data Cleaning via Analysis of EntityRelationship Graph,” ACM Trans. Database Systems, vol. 31, no. 2, pp. 716-767, June 2006. [28] D.V. Kalashnikov, S. Mehrotra, and Z. Chen, “Exploiting Relationships for Domain-Independent Data Cleaning,” Proc. SIAM Int’l Conf. Data Mining (SDM ’05), Apr. 2005. *Level 4 : Way to use c(u, v). *[10] and [11] employ agglomerative clustering. *[27], [28], the disambiguation problem is converted into an optimization problem, which is then solved iteratively. 91 *Level 5: Domain independence. *Some of the existing techniques are applicable to only certain types of domains or just one domain. 92 Related Work WSD (1/3) • Word Sense Disambiguation : – determine the exact sense of an ambiguous word given a list of word senses. • Word Sense Discrimination : – determine, which instances of the ambiguous word can be clustered as sharing the same meaning. 93 Related Work WSD (2/3) • External knowledge sources : – Using lexical knowledge associated with a dictionary and WordNet. • Approach : – supervised – unsupervised 94 Related Work WSD (3/3) • If view the ambiguous word as a reference and the word sense as an entity. • The two instances of WSD problem are similar to the Lookup and Grouping instances of Entity Resolution/WePS. 95 Related Work WePS (1/4) • WePS can be implemented in two different settings. – Server-side setting : the disambiguation mechanism is integrated into the search-engine directly. – Middleware approach : build people search capabilities on top of an existing search-engine such as Google by “wrapping” the original engine. 96 Related Work WePS (2/4) • Clusty (http://www.clusty.com) • Grokker (http://www.grokker.com) • Kartoo (http://www.kartoo.com) 97 Related Work WePS (3/4) • ZoomInfo (http://www.zoominfo.com) 98 Related Work WePS (4/4) • But, this system has a high cost and low scalability. • Because the person information in the systems is collected primarily manually. • Does not rely on any such pre-compiled knowledge and thus will scale to person search for any person on the Web. 99