Download Slide - VideoLectures.NET

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Named Entity Mining
From Click-Through Data
Using Weakly Supervised LDA
Gu Xu1, Shuang-Hong Yang1,2, Hang Li1
1Microsoft
Research Asia, China
2College of Computing, Georgia Tech, USA
Talk Outline
• Named Entity Mining
– Exploiting click-through data
– Applying Latent Dirichlet Allocation
– Developing a weakly supervised Learning approach
• Weakly Supervised LDA
• Experimental Results
• Summary
Named Entity Mining
• Named Entity Mining (NEM)
– To mine the information of named entities of a class
from a large amount of data.
– Example: mine movie titles from a textual data
collection
– Applications: Web search, etc.
• Three Challenges
– Suitable data source for NEM
Click-through Data
– Ambiguity in classes of named entities
LDA (Topic Model)
– Supervision from human knowledge
Weakly Supervised Learning
Click-through Data
• New data source for NEM
– Over 70% queries contain
named entities.
– Rich context for determining
the classes of entities.
• Query context
Query_1
Query _...
Site_11
Freq_11
Site_12
Freq_12
…
…
…
…
Click-Through Data
– [movie] trailer, [game] cheats
• Click context
– imdb.com for movies, gamespot.com for games
– Wisdom-of-crowds
• Very Large-scale data and keep on growing
• Frequent update with emerging named entities
Latent Dirichlet Allocation
• Deal with ambiguity in classes of named entities
– Classes of named entities are ambiguous.
• Harry Potter: Book, Movie and Game
– Topic models (LDA)
Harry Potter
harry potter trailer  imdb.com
harry potter dvd  movies.yahoo.com
Classes of Named
Entity as Topics
Movie
Query
Context
# trailer
# dvd
# movie
harry potter cheats  cheats.ign.com
harry potter game  gamespots.com
Game
Click
Context
imdb.com
movies.yahoo.com
disney.go.com
Query
Context
# cheats
# walkthrough
# game
Click
Context
gamespots.com
cheats.ign.com
gamefaqs.com
Weakly Supervised Learning
• Supervise LDA training with examples
– LDA is unsupervised model.
• Topics in LDA are latent and not align with predefined
semantic classes, like book, movie and game.
– Human labels are inaccurate and partial.
• Binary indicator rather than proportion
• Labels only indicate that a named entity belongs to
certain classes, but not exclude the possibility that it
belongs to the other classes.
– Weakly-supervised LDA
• Supervise LDA training with partial labels
Weakly Supervised LDA
• Overview
Seeds
………………..
Harry Potter
………………..
………………..
Click-through Data
Create a virtual document for
each seed and train WS-LDA
Virtual Document
# book, http://www.amazon.com
# cheats, http://cheats.ign.com
# trailer, http://www.imdb.com
……………………………………..
Contexts
Newly Discovered
Entities
Websites
Find new named entities as well as
their classes by using obtained
query contexts and clicked websites
harry potter book
http://www.amazon.com
harry potter cheats
http://cheats.ign.com
harry potter trailer
http://www.imdb.com
……………………………………..
Weakly Supervised LDA (cont.)
• LDA with two types of virtual words
– w1: Query context
– w2: Click context
Virtual Document
# book
# cheats
# trailer
……………
http://www.amazon.com
http://cheats.ign.com
http://www.imdb.com
………………………………….
Weakly Supervised LDA (cont.)
• Introduce Weak Supervision
– LDA log likelihood + soft constraints
Lw, y   log pw ,    Cy  ,  
LDA Probability
Soft Constraints
– Soft Constraints
C  y  ,    i yi zi
Document Probability
on i-th Class
Document Binary Label
on i-th Class
Experimental Results
• Dataset
– Seed named entities
• About 1,000 seeds for each class, and 3767 unique
named entities in total
– Click-through data
• 1.5 billion query-URL pairs, containing 240 million
unique queries and 17 million unique URLs
Experimental Results (cont.)
• Top Contexts and websites
Movie Contexts
Game Contexts
Book Contexts
Music Contexts
Movie Websites
Game Websites
Book Websites
Music Websites
Experimental Results (cont.)
• Accuracy of Mined Entities
Summary
• Proposed to use click-through data as a new data
source for NEM
• Employed topic model to deal with ambiguity in
classes of named entities
• Devised weakly supervised LDA for modeling
click-through data
– Two types of virtual words
– Introduce weakly supervised learning into LDA
• Experiments on large-scale data verified
effectiveness of proposed approach
THANKS