Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Named Entity Mining From Click-Through Data Using Weakly Supervised LDA Gu Xu1, Shuang-Hong Yang1,2, Hang Li1 1Microsoft Research Asia, China 2College of Computing, Georgia Tech, USA Talk Outline • Named Entity Mining – Exploiting click-through data – Applying Latent Dirichlet Allocation – Developing a weakly supervised Learning approach • Weakly Supervised LDA • Experimental Results • Summary Named Entity Mining • Named Entity Mining (NEM) – To mine the information of named entities of a class from a large amount of data. – Example: mine movie titles from a textual data collection – Applications: Web search, etc. • Three Challenges – Suitable data source for NEM Click-through Data – Ambiguity in classes of named entities LDA (Topic Model) – Supervision from human knowledge Weakly Supervised Learning Click-through Data • New data source for NEM – Over 70% queries contain named entities. – Rich context for determining the classes of entities. • Query context Query_1 Query _... Site_11 Freq_11 Site_12 Freq_12 … … … … Click-Through Data – [movie] trailer, [game] cheats • Click context – imdb.com for movies, gamespot.com for games – Wisdom-of-crowds • Very Large-scale data and keep on growing • Frequent update with emerging named entities Latent Dirichlet Allocation • Deal with ambiguity in classes of named entities – Classes of named entities are ambiguous. • Harry Potter: Book, Movie and Game – Topic models (LDA) Harry Potter harry potter trailer imdb.com harry potter dvd movies.yahoo.com Classes of Named Entity as Topics Movie Query Context # trailer # dvd # movie harry potter cheats cheats.ign.com harry potter game gamespots.com Game Click Context imdb.com movies.yahoo.com disney.go.com Query Context # cheats # walkthrough # game Click Context gamespots.com cheats.ign.com gamefaqs.com Weakly Supervised Learning • Supervise LDA training with examples – LDA is unsupervised model. • Topics in LDA are latent and not align with predefined semantic classes, like book, movie and game. – Human labels are inaccurate and partial. • Binary indicator rather than proportion • Labels only indicate that a named entity belongs to certain classes, but not exclude the possibility that it belongs to the other classes. – Weakly-supervised LDA • Supervise LDA training with partial labels Weakly Supervised LDA • Overview Seeds ……………….. Harry Potter ……………….. ……………….. Click-through Data Create a virtual document for each seed and train WS-LDA Virtual Document # book, http://www.amazon.com # cheats, http://cheats.ign.com # trailer, http://www.imdb.com …………………………………….. Contexts Newly Discovered Entities Websites Find new named entities as well as their classes by using obtained query contexts and clicked websites harry potter book http://www.amazon.com harry potter cheats http://cheats.ign.com harry potter trailer http://www.imdb.com …………………………………….. Weakly Supervised LDA (cont.) • LDA with two types of virtual words – w1: Query context – w2: Click context Virtual Document # book # cheats # trailer …………… http://www.amazon.com http://cheats.ign.com http://www.imdb.com …………………………………. Weakly Supervised LDA (cont.) • Introduce Weak Supervision – LDA log likelihood + soft constraints Lw, y log pw , Cy , LDA Probability Soft Constraints – Soft Constraints C y , i yi zi Document Probability on i-th Class Document Binary Label on i-th Class Experimental Results • Dataset – Seed named entities • About 1,000 seeds for each class, and 3767 unique named entities in total – Click-through data • 1.5 billion query-URL pairs, containing 240 million unique queries and 17 million unique URLs Experimental Results (cont.) • Top Contexts and websites Movie Contexts Game Contexts Book Contexts Music Contexts Movie Websites Game Websites Book Websites Music Websites Experimental Results (cont.) • Accuracy of Mined Entities Summary • Proposed to use click-through data as a new data source for NEM • Employed topic model to deal with ambiguity in classes of named entities • Devised weakly supervised LDA for modeling click-through data – Two types of virtual words – Introduce weakly supervised learning into LDA • Experiments on large-scale data verified effectiveness of proposed approach THANKS