Download Contextual Text Mining through Probabilistic Theme Analysis

Document related concepts

Cluster analysis wikipedia , lookup

Mixture model wikipedia , lookup

Transcript
Context Analysis in Text
Mining and Search
Qiaozhu Mei
Department of Computer Science
University of Illinois at Urbana-Champaign
http://sifaka.cs.uiuc.edu/~qmei2, [email protected]
Joint work with ChengXiang Zhai
2008 © Qiaozhu Mei
University of Illinois at Urbana-Champaign
1
Motivating Example:
Personalized Search
MSR
Metropolis Street Racer
Magnetic Stripe Reader
Molten salt reactor
Mars Sample Return
…
Mountain safety research
Actually Looking for Microsoft Research…
2008 © Qiaozhu Mei
University of Illinois at Urbana-Champaign
2
Motivating Example:
Comparing Product Reviews
IBM Laptop
Reviews
APPLE Laptop
Reviews
DELL Laptop
Reviews
Common Themes
“IBM” specific
“APPLE” specific
“DELL” specific
Battery Life
Long, 4-3 hrs
Medium, 3-2 hrs
Short, 2-1 hrs
Hard disk
Large, 80-100 GB
Small, 5-10 GB
Medium, 20-50 GB
Speed
Slow, 100-200 Mhz
Very Fast, 3-4 Ghz
Moderate, 1-2 Ghz
Unsupervised discovery of common topics and their variations
2008 © Qiaozhu Mei
University of Illinois at Urbana-Champaign
3
Motivating Example:
Discovering Topical Trends in Literature
SIGIR topics
Topic Strength
Time
1980
1990
1998
TF-IDF Retrieval
2003
Language Model
IR Applications
Text Categorization
Unsupervised discovery of topics and their temporal variations
2008 © Qiaozhu Mei
University of Illinois at Urbana-Champaign
4
Motivating Example:
Analyzing Spatial Topic Patterns
•
•
How do bloggers in different states respond to topics such as
“oil price increase during Hurricane Karina”?
Unsupervised discovery of topics and their variations in
different locations
2008 © Qiaozhu Mei
University of Illinois at Urbana-Champaign
5
Motivating Example:
Summarizing Sentiments
Topic-sentiment summary
Topic-sentiment dynamics
(Topic = Price)
Query: Dell Laptops
strength
Facet 1
(Price)
Facet 2
(Battery)
positive
negative
neutral
• it is the best
site and they
show Dell
coupon code as
early as possible
• Even though
Dell's price is
cheaper, we still
don't want it.
• mac pro vs. dell
precision: a price
comparis..
• One thing I
really like about
this Dell battery
is the Express
Charge feature.
• my Dell battery
sucks
• ……
• Stupid Dell
laptop battery
• ……
Positive
Negative
Neutral
• DELL is trading
at $24.66
• i still want a
free battery from
dell..
• ……
time
Unsupervised/Semi-supervised discovery of topics and
different sentiments of the topics
2008 © Qiaozhu Mei
University of Illinois at Urbana-Champaign
6
Motivating Example:
Analyzing Topics on a Social Network
Bruce Croft
Information retrieval
Publications of
Bruce Croft
Publications of
Gerard Salton
Gerard Salton
Machine learning
Data mining
Unsupervised discovery of topics and correlated
research communities
2008 © Qiaozhu Mei
University of Illinois at Urbana-Champaign
7
Research Questions
• What are these problems in common?
• Can we model all these problems generally?
• Can we solve these problems with a unified
approach?
• How can we bring human into the loop?
2008 © Qiaozhu Mei
University of Illinois at Urbana-Champaign
8
Rest of Talk
• Background: Language Models in Text Mining and
Retrieval
• Definition of context
• General methodology to model context
– Models, example applications, results
• Conclusion and Discussion
2008 © Qiaozhu Mei
University of Illinois at Urbana-Champaign
9
Generative Models of Text
• Text as observations: words; tags; links, etc
• Use a unified probabilistic model to explain the
appearance (generation) of observations
• Documents are generated by sampling every
observation from such a generative model
• Different generation assumption  different model
– Document Language Models
– Probabilistic Topic Models: PLSA, LDA, etc.
– Hidden Markov Models …
2008 © Qiaozhu Mei
University of Illinois at Urbana-Champaign
10
Multinomial Language Models
A multinomial distribution of
words as a text representation
retrieval
information
model
query
language
feedback
……
0.2
0.15
0.08
0.07
0.06
0.03
Known as a Topic model when
there are k of them in text:
e.g., semi-supervised learning;
boosting; spectral clustering, etc.
2008 © Qiaozhu Mei
University of Illinois at Urbana-Champaign
11
Language Models in Information Retrieval
(e.g., KL-Div. Method)
Document d
A text mining
paper
Doc Language Model (LM)
θd : p(w|d)
text 4/100=0.04
mining 3/100=0.03
clustering 1/100=0.01
…
data = 0
computing = 0
…
Smoothed Doc LM
θd' : p(w|d’)
text
=0.039
mining
=0.028
clustering
=0.01
…
data = 0.001
computing = 0.0005
…
Similarity
function
 D( q ||  d )    p( w | q ) log
Query q
data mining
Query Language Model
θq : p(w|q)
Data ½=0.5
Mining ½=0.5
2008 © Qiaozhu Mei
wV
p( w |  q )
p( w |  d )
p(w|q’)
Data ½=0.4
Mining ½=0.4
Clustering =0.1
…
University of Illinois at Urbana-Champaign
?
12
Probabilistic Topic Models for Text
Mining
Topic models
(Multinomial distributions)
Text
Collections
Probabilistic
Topic Modeling
…
PLSA [Hofmann 99]
LDA [Blei et al. 03]
Author-Topic
[Steyvers et al. 04]
Pachinko allocation
[Li & McCallum 06]
CPLSA
[Mei & Zhai 06]
CTM
…
[Blei et al. 06]
2008 © Qiaozhu Mei
term
relevance
weight
feedback
independ.
model
…
web
search
link
graph
…
0.16
0.08 Subtopic discovery
0.07
Topical pattern
0.04
analysis
0.03
0.03
Summarization
0.21 Opinion comparison
0.10
0.08
Passage
0.05
segmentation
…
University of Illinois at Urbana-Champaign
…
13
Importance of Context
• Science in the year 2000 and
Science in the year 1500:
Are we still working on the same
topics?
• For a computer scientist and a gardener:
Does “tree, root, prune” mean the same?
• “Football” means soccer in
Europe. What about in US?
Context affects topics!
2008 © Qiaozhu Mei
University of Illinois at Urbana-Champaign
14
Context Features of Text (Meta-data)
Weblog Article
communities
Author
source
Author’s Occupation
2008 © Qiaozhu Mei
Time
Location
University of Illinois at Urbana-Champaign
15
Context = Partitioning of Text
papers written
in 1998
1998
1999
Papers about Web
……
……
2005
2006
papers written by
authors in US
WWW SIGIR ACL
2008 © Qiaozhu Mei
KDD SIGMOD
University of Illinois at Urbana-Champaign
16
Rich Context Information in Text
• News articles: time, publisher, etc.
• Blogs: time, location, author, …
• Scientific Literature: author, publication year, conference,
citations, …
• Query Logs: time, IP address, user, clicks, …
• Customer reviews: product, source, time, sentiments..
• Emails: sender, receiver, time, thread, …
• Web pages: domain, time, click rate, etc.
• More? entity-relations, social networks, ……
2008 © Qiaozhu Mei
University of Illinois at Urbana-Champaign
17
Categories of Context
• Some partitions of text are explicit  explicit context
– Time; location; author; conference; user; IP; etc
– Similar to metadata
• Some partitions are implicit  implicit context
– Sentiments; missions; goals; intents;
• Some partitions are at document level
• Some are at a finer granularity
– Context of a word; an entity; a pattern; a query, etc.
– Sentences; sliding windows; adjacent words; etc
2008 © Qiaozhu Mei
University of Illinois at Urbana-Champaign
18
Context Analysis
• Use context to infer semantics
– Annotating frequent patterns; labeling of topic models
• Use context to provide targeted service
– Personalized search; intent-based search; etc.
• Compare contextual patterns of topics
– Evolutionary topic patterns; spatiotemporal topic
patterns; topic-sentiment patterns; etc.
• Use context to help other tasks
– Social network analysis; impact summarization; etc.
2008 © Qiaozhu Mei
University of Illinois at Urbana-Champaign
19
General Methodology to Model Context
• Context  Generative Model
– Observations in the same context are generated with
a unified model
– Observations in different contexts are generated
with different models
– Observations in similar contexts are generated with
similar models
• Text is generated with a mixture of such generative
models
– Example Task; Model; Sample results
2008 © Qiaozhu Mei
University of Illinois at Urbana-Champaign
20
Model a unique context with a
unified model
(Generation)
2008 © Qiaozhu Mei
University of Illinois at Urbana-Champaign
21
Probabilistic Latent Semantic
Analysis (Hofmann ’99)
Topics θ1…k
government
donation
P(w|θj)
Draw a word from
i
Criticism
of government
government 0.3
response government
to the hurricane
response 0.2..
primarily consisted of
response
criticism of its response
to … The total shut-in oil
production from the Gulf
donate
of Mexico
…
approximately
A Document
d
help
24%aid
of the annual
production and the shutin gas production … Over
seventy countries
pledged
Orleans
monetary new
donations or
other assistance. …
donate 0.1
relief 0.05
help 0.02 ..
city 0.2
new 0.1
orleans 0.05 ..
New
Orleans
Documents about
“Hurricane Katrina”
1
2
3
4
πd : P(θi|d)
Choose a topic
θk
πd
Zd,n
Wd,n
N D
θk
K
2008 © Qiaozhu Mei
πd
p(d , wd ,n )  p(d ) p( wd ,n | z  k , k ) p( z  k | d )
k
University of Illinois at Urbana-Champaign
22
Example: Topics in Science
(D. Blei 05)
2008 © Qiaozhu Mei
University of Illinois at Urbana-Champaign
24
Label a Multinomial Topic Model
Retrieval models
term
0.1599
relevance
0.0752
weight
0.0660
feedback
0.0372
independence 0.0311
model
0.0310
frequent
0.0233
probabilistic 0.0188
document
0.0173
…
– Mei and Zhai 06:
a topic in SIGIR
•
•
•
•
Semantically close (relevance)
Understandable – phrases?
High coverage inside topic
Discriminative across topics
iPod Nano
じょうほうけんさく
Pseudo-feedback
Information Retrieval
2008 © Qiaozhu Mei
University of Illinois at Urbana-Champaign
25
Automatic Labeling of Topics
Collection
NLP Chunker
Ngram Stat.
information retrieval, retrieval model,
index structure, relevance feedback,
…
(e.g., SIGIR)
1 Candidate label pool
term
0.16
relevance
0.07
weight
0.07
feedback
0.04
independence 0.03
model
0.03
…
2 Relevance Score
Information retrieval 0.26
retrieval models
0.19
IR models
0.17
pseudo feedback
0.06
……
filtering
0.21
collaborative 0.15
… trec
0.18
evaluation 0.10
…
3
Discrimination
information retriev. 0.26 0.01
retrieval models 0.20
IR models
0.18
pseudo feedback 0.09
……
4
Coverage
retrieval models
0.20
IR models
0.18 0.02
pseudo feedback
0.09
……
information retrieval 0.01
2008 © Qiaozhu Mei
University of Illinois at Urbana-Champaign
26
Label Relevance: Context Comparison
•
Intuition: expect the label with similar context (distribution)
Clustering
l2: “hash join”
Clustering
Clustering
Good Label (l1)
dimension
“clustering
dimension dimension
algorithm”
Topic

partition
partition
Score
rank
algorithm

algorithm
key …hash join
… code …hash
table …search
…hash join…
map key…hash
…algorithm…key
…hash…key
table…join…
…
algorithm
(l,  ) = D(||l)
key
p
(
w
|

)
PMI
(
w
,
l
| C)

w
P(w|)
…
hash
…
hash
hash
p(w | clustering algorithm )
2008 © Qiaozhu Mei
p(w | hash join)
University of Illinois at Urbana-Champaign
27
Results: Sample Topic Labels
the, of, a, and,
to, data, > 0.02
…
clustering 0.02
clustering algorithmtime
0.01
clustering structure
clusters
0.01
…
databases 0.01
large
0.01
performance 0.01
0.005
large data, data quality
north
0.02
case
0.01
trial
0.01
iran
0.01
documents 0.01
walsh
0.009
reagan
0.009
charges 0.007
r tree
b tree …
quality, high data,
data application, …
indexing
methods
2008 © Qiaozhu Mei
University of Illinois at Urbana-Champaign
iran contra
…
tree
trees
spatial
b
r
disk
array
cache
0.09
0.08
0.08
0.05
0.04
0.02
0.01
0.01
28
Model different contexts with
different models
(Discrimination, Comparison)
2008 © Qiaozhu Mei
University of Illinois at Urbana-Champaign
29
Example: Finding Evolutionary
Patterns of Topics
1999
2000
2001
2002
KDD
web 0.009
classifica –
tion 0.007
features0.006
topic 0.005
…
SVM 0.007
criteria 0.007
classifica –
tion
0.006
linear 0.005
…
decision 0.006
tree
0.006
classifier 0.005
class
0.005
Bayes
0.005
…
2003
mixture 0.005
random 0.006
cluster 0.006
clustering 0.005
variables 0.005
…
…
Content Variations
…
over Contexts
…
Classifica
- tion
text
unlabeled
document
labeled
learning
…
2008 © Qiaozhu Mei
0.015
0.013
0.012
0.008
0.008
0.007
…
Informa
- tion 0.012
web
0.010
social 0.008
retrieval 0.007
distance 0.005
networks 0.004
…
University of Illinois at Urbana-Champaign
2004
T
topic 0.010
mixture 0.008
LDA 0.006
semantic
0.005
…
30
Example: Finding Evolutionary
Patterns of Topics (II)
Normalized Strength of Theme
0.02
Biology Data
0.018
Web Information
0.016
Time Series
0.014
Classification
Association Rule
0.012
Clustering
0.01
Bussiness
0.008
0.006
0.004
0.002
0
1999
2000
2001
2002
Time (year)
2003
2004
Figure from (Mei ‘05)
Strength Variations
over Contexts
2008 © Qiaozhu Mei
University of Illinois at Urbana-Champaign
31
View of Topics:
Context-Specific Version of Views
Context 1: 1998 ~ 2006
(e.g. After “Language Modeling”)
One context  one view
A document selects from a mix of views
vector
space
TF-IDF
Topic 1:
Retrieval
Model
retrieve
model
relevance
documen
t query
LSI
retrieval
vector
Rocchio
weighting
Feedback feedback
term
Topic 2:
Okapi
feedback
judge
expansion
pseudo
query
mixture
language
model
smoothing
query
model
estimate
EM
feedback
pseudo
generation
Context 2: 1977 ~ 1998
(i.e. Before “Language Modeling”)
2008 © Qiaozhu Mei
University of Illinois at Urbana-Champaign
32
Coverage of Topics:
Distribution over Topics
Oil Price
Government
Response
Aid and
donation
Background
Criticism of government response to the
hurricane primarily consisted of criticism of its
1
response
to … The total shut-in oil production
2
from
the Gulf of Mexico … approximately 24% of
3
4 annual production and the shut-in gas
the
production … Over seventy countries pledged
monetary donations or other assistance. …
Context: Texas
Oil Price
Government
Response
Aid and
donation
Background
• A coverage of topics: a (strength)
distribution over the topics.
• 1One context  one coverage
2
• 3A document selects from a mix of
4
multiple coverages.
Context: Louisiana
2008 © Qiaozhu Mei
University of Illinois at Urbana-Champaign
33
A General Solution: CPLSA
•
•
CPLAS = Contextual Probabilistic Latent Semantic Analysis
An extension of PLSA model ([Hofmann 99]) by
– Introducing context variables
– Modeling views of topics
– Modeling coverage variations of topics
•
Process of contextual text mining
– Instantiation of CPLSA (context, views, coverage)
– Fit the model to text data (EM algorithm)
– Compare a topic from different views
– Compute strength dynamics of topics from coverages
– Compute other probabilistic topic patterns
2008 © Qiaozhu Mei
University of Illinois at Urbana-Champaign
34
The “Generation” Process
Topics
View1 View2 View3
government
Choose a theme
Draw a word from
i
Criticism
of government
government 0.3
response togovernment
the hurricane
primarily
consistedof
of
response 0.2..
Context
response
criticism
of its response
Document:
to … The total shut-in oil
production from the Gulf
Time =
July
2005
of Mexico
…donate
approximately
help
= Texas
24%Location
of the annual
aid
production
theBrill
shutAuthor =and
Eric
inOccup.
gas production
… Over
= Sociologist
seventyAge
countries
pledged
= Orleans
45+
monetary donations
or
new
…
other assistance. …
donate 0.1
relief 0.05
help 0.02 ..
donation
city 0.2
new 0.1
orleans 0.05 ..
New
Orleans
Texas
July
2005
Topic
coverages:
sociolo
gist
Choose a view
1
2
3
4
Texas
July 2005
1
2
3
4
1
2
3
4
1
2
3
4
……
sociologist
2008 © Qiaozhu Mei
1
2
3
4
Choose a
Coverage
document
University of Illinois at Urbana-Champaign
35
An Intuitive Example
• Two topics: web search; machine learning
• I am writing a WWW paper.  I will cover more
about “web search” instead of “machine
learning”.
Coverage
1
2
3
4
– But of course I have my own taste.
• I am from a search engine company, so when I
write about “web search”, I will focus on “search
donate 0.1
engine” and “online advertisements”…
relief 0.05
help 0.02 ..
View
2008 © Qiaozhu Mei
University of Illinois at Urbana-Champaign
city 0.2
new 0.1
orleans 0.05
..
36
The Probabilistic Model
• A probabilistic model explaining the generation of a
document D and its context features C: if an author
wants to write such a document, he will
– Choose a view vi according to the view distribution p(vi | D, C )
– Choose a coverage кj according to the coverage distribution
p( j | D, C ).
– Choose a theme  il according to the coverage кj .
– Generate a word using  il.
– The likelihood of the document collection is:
log p(D) 
n
m
  c(w, D) log(  p(v | D, C) p(
( D ,C )D wV
2008 © Qiaozhu Mei
i 1
i
j 1
k
j
| D, C ) p(l |  j ) p(w | il ))
l 1
University of Illinois at Urbana-Champaign
37
Example results: Query Log Analysis
Context = Days of week
Day-Week Pattern of Search Difficulty
10000000
1.25
9000000
1.2
Query & Clicks:
more query/clicks
on weekdays
8000000
1.15
7000000
6000000
1.1
5000000
1.05
4000000
3000000
1
2000000
Total Clicks
1000000
H(Url | IP, Q)
0
0.95
Search Difficulty:
more difficult to
predict on weekends
0.9
1
3
5
7
9
11
13
15
17
19
21
23
Jan 2006 (Jan. 1st is a Sunday)
38
Query Log Analysis
Context = Type of Query
Query Frequency over time
Query Frequency over time
0.08
0.06
yahoo
0.07
mapquest
Query Frequency
0.05
cnn
0.06
0.04
0.05
0.04
0.03
0.03
0.02
0.02
sex
movie
0.01
0.01
mp3
0
0
1
2
3
4
5
6
7
8
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Jan 2006 (Jan 1st is a Sunday)
Business Queries: clear dayweek pattern; weekdays more
frequent than weekends
1
2
3
4
5
6
7
8
9
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Jan 2006 (Jan 1st is a Sunday)
Consumer Queries: no clear
day-week pattern; weekends
are comparable, even more
frequent than weekdays
39
Bursting Topics in SIGMOD:
Context = Time (Years)
1800
1600
1400
1200
1000
800
600
400
200
0
Sensor data
XML data
Web data
Data Streams
Ranking, Top-K
75 978 981 984 987 990 993 996 999 002 005
9
1
1
1
1
1
1
1
1
1
2
2
2008 © Qiaozhu Mei
University of Illinois at Urbana-Champaign
40
Spatiotemporal Text Mining:
Context = Time & Location
Week2: The discussion moves towards the north and west
Week1: The theme is the strongest along the Gulf of Mexico
Week3: The theme distributes more uniformly over the states
About Government Response
in Hurricane Katrina
Week4: The theme is again strong along the east coast and the Gulf of Mexico
Week5: The theme fades out in most states
2008 © Qiaozhu Mei
University of Illinois at Urbana-Champaign
41
Faceted Opinions
Context = Sentiments
Neutral
Positive
Negative
... Ron Howards selection
of Tom Hanks to play
Robert Langdon.
Tom Hanks stars in
the movie,who can be
mad at that?
But the movie might get
delayed, and even killed off
if he loses.
Topic 1: Directed by: Ron Howard
Writing credits: Akiva
Movie
Goldsman ...
After watching the movie I
went online and some
research on ...
I remembered when i first
read the book, I finished
Topic 2: the book in two days.
I’m reading “Da Vinci
Book
Code” now.
…
2008 © Qiaozhu Mei
Tom Hanks, who is my protesting ... will lose your
favorite movie star act faith by ... watching the
the leading role.
movie.
Anybody is interested
in it?
... so sick of people making
such a big deal about a
FICTION book and movie.
Awesome book.
... so sick of people making
such a big deal about a
FICTION book and movie.
So still a good book to
past time.
This controversy book
cause lots conflict in west
society.
University of Illinois at Urbana-Champaign
42
Sentiment Dynamics
Context = Time & Sentiments
“ the da vinci code”
Facet: the book “ the da vinci
code”. ( Bursts during the
movie, Pos > Neg )
2008 © Qiaozhu Mei
Facet: the impact on religious
beliefs. ( Bursts during the
movie, Neg > Pos )
University of Illinois at Urbana-Champaign
43
Event Impact Analysis: IR Research
Theme:
retrieval models
term
0.1599
relevance
0.0752
weight
0.0660
feedback
0.0372
independence 0.0311
model
0.0310
frequent
0.0233
probabilistic 0.0188
document
0.0173
…
vector
concept
extend
model
space
boolean
function
feedback
…
xml
email
model
collect
judgment
rank
subtopic
…
0.0514
0.0298
0.0297
0.0291
0.0236
0.0151
0.0123
0.0077
1992
0.0678
0.0197
0.0191
0.0187
0.0102
0.0097
0.0079
SIGIR papers
Publication of the paper “A language
modeling approach to information retrieval”
year
Starting of the TREC conferences
probabilist 0.0778
model
0.0432
logic
0.0404
ir
0.0338
boolean 0.0281
algebra 0.0200
estimate 0.0119
weight
0.0111
…
2008 © Qiaozhu Mei
1998
model
0.1687
language 0.0753
estimate 0.0520
parameter 0.0281
distribution 0.0268
probable
0.0205
smooth
0.0198
markov
0.0137
likelihood 0.0059
…
University of Illinois at Urbana-Champaign
44
Model similar context with
similar models
(Smoothing, Regularization)
2008 © Qiaozhu Mei
University of Illinois at Urbana-Champaign
45
Personalization with Backoff
• Ambiguous query: MSG
– Madison Square Garden
– Monosodium Glutamate
• Disambiguate based on user’s prior clicks
• We don’t have enough data for everyone!
– Backoff to classes of users
• Proof of Concept:
– Classes defined by IP addresses
• Better:
– Market Segmentation (Demographics)
– Collaborative Filtering (Other users who click like me)
46
Context = IP
Full personalization: every context has a
different model: sparse data!
P(Url | IP, Q)  4 P(Url | IP4 , Q)
Personalization
with backoff:
similar contexts
have similar
models
156.111.188.243
 3 P(Url | IP3 , Q)
156.111.188.*
 2 P(Url | IP2 , Q)
156.111.*.*
 1 P(Url | IP1 , Q)
 0 P(Url | IP0 , Q)
156.*.*.*
*.*.*.*
No personalization: all contexts share the
same model
2008 © Qiaozhu Mei
University of Illinois at Urbana-Champaign
47
Lambda
Sparse Data
Missed
Opportunity
0. 3
Backing
Off by IP
0. 25
0. 2
0. 15
0. 1
0. 05
0
λ4
•
•
λ3
λ2
λ1
λ0
4
λs estimated with EM and
CV
P(Url | IP, Q)   i P(Url | IPi , Q)
A little bit of personalization
λ4 : weights for first 4 bytes of IP
λ3 : weights for first 3 bytes of IP
λ2 : weights for first 2 bytes of IP
– Better than too much
– Or too little
i 0
……
48
Social Network as Correlated Contexts
Linked contexts are similar to each other
Predicting query
performance
…
A Language
Modeling
Approach to
Information
Retrieval
...
Optimization of Relevance
Feedback Weights
Parallel Architecture in IR ...
2008 © Qiaozhu Mei
University of Illinois at Urbana-Champaign
49
Social Network Context for
Topic Modeling
e.g. coauthor network
•
•
•
Context = author
Coauthor = similar contexts
Intuition: I work on similar
topics to my neighbors
Smoothed Topic
distributions
over context 
2008 © Qiaozhu Mei
University of Illinois at Urbana-Champaign
50
Topic Modeling with Network
Regularization (NetPLSA)
• Basic Assumption (e.g., co-author graph)
• Related authors work on similar topics
topic distribution of a document
PLSA
k
O(C , G )  (1   )  ( c( w, d ) log  p( j | d ) p( w |  j ))
d
tradeoff
between
topic and
smoothness
1
   (
2

j 1
w
u ,v E
w(u, v) ( p( j | u )  p( j | v)) 2 )
Graph Harmonic Regularizer,
Generalization of [Zhu ’03],

1
2
f
j 1... k
T
j
k
j 1
difference of topic distribution
on neighbor vertices
importance (weight) of an edge
f j , where f j ,u  p( j | u )
2008 © Qiaozhu Mei
University of Illinois at Urbana-Champaign
51
Topical Communities with PLSA
Topic 1
Topic 2
Topic 3
Topic 4
term
0.02
peer
0.02
visual
0.02
interface
0.02
question
0.02
patterns
0.01
analog
0.02
towards
0.02
protein
0.01
mining
0.01
neurons
0.02
browsing
0.02
training
0.01
clusters
0.01
vlsi
0.01
xml
0.01
weighting
0.01
stream
0.01
motion
0.01
generation
0.01
multiple
0.01
frequent
0.01
chip
0.01
design
0.01
recognition 0.01
e
0.01
natural
0.01
engine
0.01
relations
0.01
page
0.01
cortex
0.01
service
0.01
library
0.01
gene
0.01
spike
0.01
social
0.01
??
?
Noisy
community
assignment
?
2008 © Qiaozhu Mei
University of Illinois at Urbana-Champaign
52
Topical Communities with NetPLSA
Topic 1
retrieval
Topic 2
Topic 3
Topic 4
mining
0.11
neural
0.06
web
0.05
information 0.05
data
0.06
learning
0.02
services
0.03
document
0.03
discovery
0.03
networks
0.02
semantic
0.03
query
0.03
databases
0.02
recognition 0.02
services
0.03
text
0.03
rules
0.02
analog
0.01
peer
0.02
search
0.03
association 0.02
vlsi
0.01
ontologies
0.02
evaluation
0.02
patterns
0.02
neurons
0.01
rdf
user
0.02
frequent
0.01
gaussian
0.01
management 0.01
relevance
0.02
streams
0.01
network
0.01
ontology
0.13
Information
Retrieval
Web
Coherent
community
assignment
0.02
0.01
Data mining Machine
learning
2008 © Qiaozhu Mei
University of Illinois at Urbana-Champaign
53
Smoothed Topic Map
Map a topic on the network (e.g., using p(θ|a))
Core contributors
Intermediate
Irrelevant
PLSA
2008 © Qiaozhu Mei
NetPLSA
(Topic : “information retrieval”)
University of Illinois at Urbana-Champaign
54
Smoothed Topic Map
NetPLSA
PLSA
The Windy States
-Blog articles: “weather”
-US states network:
-Topic: “windy”
2008 © Qiaozhu Mei
Real
reference
University of Illinois at Urbana-Champaign
55
Related Work
•
Specific Contextual Text Mining Problems
– Multi-collection Comparative Mining (e.g., [Zhai et al. 04])
– Temporal theme pattern (e.g., [Mei et al. 05], [Blei et al. 06], [Wang et al.
06])
– Spatiotemporal theme analysis (e.g., [Mei et al. 06], [Wang et al. 07])
– Author-topic analysis (e.g., [Steyvers et al. 04], [Zhou et al 06])
•
– …
Probabilistic topic models:
– Probabilistic latent semantic analysis (PLSA) (e.g. [Hofmann 99])
– Latent Dirichlet allocation (LDA) (e.g., [Blei et al. 03])
– Many extensions (e.g., [Blei et al. 05], [Li and McCallum 06])
2007 © ChengXiang Zhai
LLNL, Aug 15, 2007
56
Conclusions
• Context analysis in text mining and search
• General methodology to model context in text
– A unified generative model for observations in the
same context
– Different models for different context
– Similar models for similar contexts
– Generation  discrimination  smoothing
• Many applications
2008 © Qiaozhu Mei
University of Illinois at Urbana-Champaign
57
Discussion: Context in Search
•
Not all contexts are useful
– E.g. personalized search v.s. search by time of day
– How can we know which contexts are more useful?
•
Many contexts are useful
– E.g., personalized search; task-based search; localized search;
– How can we combine them?
•
Can we do better than market segmentations?
– Backoff to users who search like me – Collaborative Search
– But who searches like you?
2008 © Qiaozhu Mei
University of Illinois at Urbana-Champaign
58
References
•
•
•
•
•
CPLSA
– Q. Mei, C. Zhai. A Mixture Model for Contextual Text Mining, In Proceedings of KDD'
06.
NetPLSA
– Q. Mei, D. Cai, D. Zhang, C. Zhai, Topic Modeling with Network Reguarization,
Proceedings of WWW’ 08
Labeling
– Q. Mei, X.Shen, C. Zhai, Automatic Labeling of Multinomial Topic Models,
Proceedings KDD'07
Personalization:
– Q.Mei, K.Church, Entropy of Search Logs: How Hard is Search? With
Personalization? With Backoff? In Proceedings of WSDM’08.
Applications:
– Q. Mei, C. Zhai, Discovering Evolutionary Theme Patterns from Text - An Exploration
of Temporal Text Mining, In Proceedings KDD' 05
– Q. Mei, C. Liu, H. Su, and C. Zhai, A Probabilistic Approach to Spatiotemporal
Theme Pattern Mining on Weblogs, In Proceedings of WWW' 06
– Q. Mei, X. Ling, M. Wondra, H. Su, C. Zhai, Topic Sentiment Mixture: Modeling
Facets and Opinions in Weblogs, Proceedings of WWW’ 07
2007 © ChengXiang Zhai
LLNL, Aug 15, 2007
59
The End
Thank You!
2007 © ChengXiang Zhai
LLNL, Aug 15, 2007
60
Experiments
•
Bibliography data and coauthor
networks
– DBLP: text = titles; network = coauthors
– Four conferences (expect 4 topics):
SIGIR, KDD, NIPS, WWW
•
Blog articles and Geographic network
– Blogs from spaces.live.com
containing topical words, e.g. “weather”
– Network: US states (adjacent states)
2008 © Qiaozhu Mei
University of Illinois at Urbana-Champaign
61
Coherent Topical Communities
PLSA
visual
NetPLSA
neural
0.06
learning
0.02
networks
0.02
recognition 0.02
analog
vlsi
PLSA
0.02
peer
0.02
analog
0.02
patterns
0.01
neurons
0.02
mining
0.01
vlsi
0.01
clusters
0.01
motion
0.01
stream
0.01
chip
0.01
frequent
0.01
natural
0.01
e
0.01
cortex
0.01
page
0.01
spike
0.01
gene
0.01
0.01
0.01
neurons
0.01
gaussian
0.01
network
0.01
Semantics of
community:
“machine
learning (NIPS)”
2008 © Qiaozhu Mei
Semantics of
community:
“Data Mining
(KDD) ”
University of Illinois at Urbana-Champaign
NetPLSA
mining
0.11
data
0.06
discovery
0.03
databases
0.02
rules
0.02
association 0.02
patterns
0.02
frequent
0.01
streams
0.01
62