Download CTM

Document related concepts
no text concepts found
Transcript
Contextual Text Mining
Qiaozhu Mei
[email protected]
University of Illinois at Urbana-Champaign
2009 © Qiaozhu Mei
University of Illinois at Urbana-Champaign
Knowledge Discovery from Text
Text Mining System
2009 © Qiaozhu Mei
University of Illinois at Urbana-Champaign
2
Trend of Text Content
Content
Type
Published Professional User generated
Content
web content content
Private text
content
Amount / day
3-4G
~ 3T
~ 2G
8-10G
- Ramakrishnan and Tomkins 2007
2009 © Qiaozhu Mei
University of Illinois at Urbana-Champaign
3
Text on the Web (Unconfirmed)
~100B
10B
Gold?
~3M day
~750k /day
~150k /day
1M
6M
Where to Start? Where to Go?
2009 © Qiaozhu Mei
University of Illinois at Urbana-Champaign
4
Context Information in Text
Check Lap Kok, HK
Author
Time
Location
Author’s
occupation self designer, publisher,
editor …
3:53 AM Jan 28th
Source
Sentiment
From Ping.fm
Language
Social Network
2009 © Qiaozhu Mei
University of Illinois at Urbana-Champaign
5
Rich Context in Text
~150k bookmarks /day
5M users
~3M msgs /day
500M URLs
~2M users
73 years
~400k authors
~4k sources
~300M words/month
8M contributors
100+ languages
750K posts/day
102M blogs
100M users
> 1M groups
1B queries?
Per hour?
Per IP?
2009 © Qiaozhu Mei
University of Illinois at Urbana-Champaign
6
Text + Context = ?
+
Context = Guidance
2009 © Qiaozhu Mei
I Have A
Guide!
University of Illinois at Urbana-Champaign
7
Query + User = Personalized Search
Metropolis Street Racer
MSR
Magnetic Stripe Reader
Molten salt reactor
Modern System Research
Mars sample return
Wikipedia definitions
If you know me, you should give me
Microsoft Research…
Medical simulation
Montessori School of Raleigh
MSR Racing
Mountain Safety Research
How much can personalized help?
2009 © Qiaozhu Mei
University of Illinois at Urbana-Champaign
8
Customer Review + Brand = Comparative
Product Summary
IBM Laptop
Reviews
APPLE Laptop
Reviews
DELL Laptop
Reviews
Common Themes
IBM
APPLE
DELL
Battery Life
Long, 4-3 hrs
Medium, 3-2 hrs
Short, 2-1 hrs
Hard disk
Large, 80-100 GB
Small, 5-10 GB
Medium, 20-50 GB
Speed
Slow, 100-200 Mhz
Very Fast, 3-4 Ghz
Moderate, 1-2 Ghz
Can we compare Products?
2009 © Qiaozhu Mei
University of Illinois at Urbana-Champaign
9
Literature + Time = Topic Trends
1800
1600
1400
1200
Hot Topics in
SIGMOD
1000
Sensor Networks
Structured data, XML
Web data
Data Streams
800
Ranking, Top-K
600
400
200
0
What’s hot in literature?
2009 © Qiaozhu Mei
University of Illinois at Urbana-Champaign
10
Blogs + Time & Location =
Spatiotemporal Topic Diffusion
One Week Later
How does discussion spread?
2009 © Qiaozhu Mei
University of Illinois at Urbana-Champaign
11
Blogs + Sentiment =
Faceted Opinion Summary
The Da Vinci Code
Tom Hanks, who is
my favorite movie
star act the leading
role.
protesting... will
lose your faith by
watching the movie.
120
100
80
Positive
Negative
60
40
a good book to past
time.
... so sick of people
making such a big
deal about a fiction
book
20
0
What is good and what is bad?
2009 © Qiaozhu Mei
University of Illinois at Urbana-Champaign
12
Publications + Social Network =
Topical Community
Coauthor
Network
Information
retrieval
Machine learning
Data mining
Who works together on what?
2009 © Qiaozhu Mei
University of Illinois at Urbana-Champaign
13
A General Solution for All
Query log + User = Personalized Search
Literature + Time = Topic Trends
Review + Brand = Comparative Opinion
Blog + Time & Location = Spatiotemporal Topic Diffusion
Blog + Sentiment = Faceted Opinion Summary
Publications + Social Network = Topical Community
…..
Text + Context = Contextual Text Mining
2009 © Qiaozhu Mei
University of Illinois at Urbana-Champaign
14
Contextual Text Mining
•
•
•
•
•
Generative Model of Text
Modeling Simple Context
Modeling Implicit Context
Modeling Complex Context
Applications of Contextual Text Mining
2009 © Qiaozhu Mei
University of Illinois at Urbana-Champaign
15
Generative Model of Text
the
is
harry
potter
movie
plot
time
rowling
0.1
0.07
0.05
0.04
0.04
0.02
0.01
0.01
Inference, Estimation
movie
the.. movie..
harry ..
potter is ..
based.. on..
j..k..rowling
Generation
P( word | Model )
2009 © Qiaozhu Mei
University of Illinois at Urbana-Champaign
16
book
harry
potter
rowling
0.15
0.10
0.08
0.05
Year = 1998
Contextualized Models
Sentiment = +
Inference:
• How to estimate contextual models?
• How to reveal contextual patterns?
Year = 2008
movie
harry
potter
director
book
0.18
0.09
0.08
0.04
Generation:
• How to select contexts?
• How to model the relations of
contexts?
Source = official
Location = China
Location = US
P( word | Model , Context)
2009 © Qiaozhu Mei
University of Illinois at Urbana-Champaign
17
Topics in Text
•
•
•
•
•
Topic (Theme) = the subject of a discourse
A topic covers multiple documents
A document has multiple topics
Topic = a soft cluster of documents
Topic = a multinomial distribution of words
Many text mining tasks:
• Extracting topics from text
• Reveal contextual topic patterns
2009 © Qiaozhu Mei
Data
Mining
search
engine
query
user
ranking
……
0.2
0.15
0.08
0.07
0.06
Machine Web
Learning Search
University of Illinois at Urbana-Champaign
18
Probabilistic Topic Models
ipod
ipod
nano
music
download
apple
P( w) 
0.15
0.15
0.08
0.05
0.02
0.01
Topic 1
Apple iPod
 P( z  i) P(w | Topic )
movie
harry
harry
potter
actress
music
0.10
0.09
0.09
0.05
0.04
0.02
downloaded
the
i
i 1.. K
I
the
harry
Topic 2
my
music of
movie
potter to
ipod
nano
Harry Potter
2009 © Qiaozhu Mei
University of Illinois at Urbana-Champaign
19
Parameter Estimation
• Maximizing data likelihood:
*  arg max  log( P( Data | Model ))
PseudoCounts
• Parameter Estimation using EM algorithm
ipod
nano
music
download
apple
?
0.15
?
0.08
?
0.05
?
0.02
?
0.01
movie
harry
potter
actress
music
0.10
?
0.09
?
0.05
?
0.04
?
0.02
?
Guess the affiliation
II
downloaded
downloaded
the
the music
music of
of
the
the
Estimate the params
2009 © Qiaozhu Mei
movie
movie
harry
harry potter
potter to
to
my
my
ipod
ipod nano
nano
University of Illinois at Urbana-Champaign
20
How Context Affects Topics
• Topics in science literature:
16th Century v.s. 21st Century
• When do a computer scientist and a gardener use
“tree, root, prune” in text?
• What does “tree” mean in “algorithm”?
• In Europe, “football” appears a lot in
a soccer report. What about in the US?
Text are generated according to the Context!!
2009 © Qiaozhu Mei
University of Illinois at Urbana-Champaign
21
Simple Contextual Topic Model
Context 1:
2004
Topic 1
Apple iPod
Context 2:
2007
ipod
mini
4gb
ipod
iphone
nano
I
downloaded
the
the
Topic 2
Harry Potter
P( w) 
harry
prisoner
azkaban
potter
order
phoenix
harry
my
music of
movie
potter to
iphone
 P(c  j )  P( z  i | Context ) P(w | Topic , Context )
j 1.. C
i 1.. K
j
2009 © Qiaozhu Mei
i
j
University of Illinois at Urbana-Champaign
22
Contextual Topic Patterns
• Compare contextualized versions of topics:
Contextual topic patterns
• Contextual topic patterns  conditional
distributions
– z: topic; c: context; w: word
•
•
P( z | c  i ) (or P(c | z  j ) ) :
P( w | z, c  i)
strength of topics in context
:content variation of topics
2009 © Qiaozhu Mei
University of Illinois at Urbana-Champaign
23
Example: Topic Life Cycles
(Mei and Zhai KDD’05)
Normalized Strength of Theme
0.02
Biology Data
0.018
Web Information
0.016
Time Series
0.014
Classification
Association Rule
0.012
Clustering
0.01
Bussiness
0.008
0.006
0.004
0.002
0
1999
2000
2001
2002
Time (year)
Comparing
P (c | z )
2009 © Qiaozhu Mei
2003
2004
Context = time
University of Illinois at Urbana-Champaign
24
Example: Spatiotemporal Theme Pattern
(Mei et al. WWW’06)
Week2: The discussion moves towards the north and west
Week1: The theme is the strongest along the Gulf of Mexico
Week3: The theme distributes more uniformly over the states
About Government Response
in Hurricane Katrina
Week4: The theme is again strong along the east coast and the Gulf of Mexico
Context = time & location
2009 © Qiaozhu Mei
Week5: The theme fades out in most states
Comparing
University of Illinois at Urbana-Champaign
P( z | c)
25
Example: Evolutionary Topic Graph
(Mei and Zhai KDD’05)
1999
2000
2001
2002
KDD
SVM 0.007
criteria 0.007
classifica –
tion
0.006
linear 0.005
…
decision 0.006
tree
0.006
classifier 0.005
class
0.005
Bayes
0.005
…
mixture 0.005
random 0.006
cluster 0.006
clustering 0.005
variables 0.005
…
…
Context = time
Comparing
web 0.009
classifica –
tion 0.007
features0.006
topic 0.005
…
2003
P( w | z , c)
Classifica
- tion
text
unlabeled
document
labeled
learning
…
2009 © Qiaozhu Mei
0.015
0.013
0.012
0.008
0.008
0.007
…
Informa
- tion 0.012
web
0.010
social 0.008
retrieval 0.007
distance 0.005
networks 0.004
…
University of Illinois at Urbana-Champaign
2004
T
topic 0.010
mixture 0.008
LDA 0.006
semantic
0.005
…
26
Example: Event Impact Analysis
(Mei and Zhai KDD’06)
Theme:
retrieval models
term
relevance
weight
feedback
model
probabilistic
document
…
0.1599
0.0752
0.0660
0.0372
0.0310
0.0188
0.0173
vector
concept
model
space
boolean
function
…
1992
0.0678
0.0197
0.0191
0.0187
0.0102
0.0097
SIGIR papers
Publication of the paper “A language
modeling approach to information retrieval”
year
Starting of the TREC conferences
Context = event
Comparing
xml
email
model
collect
judgment
rank
…
0.0514
0.0298
0.0291
0.0236
0.0151
0.0123
P( w | z , c)
probabilist 0.0778
model
0.0432
logic
0.0404
boolean 0.0281
algebra 0.0200
estimate 0.0119
weight
0.0111
…
2009 © Qiaozhu Mei
1998
model
0.1687
language 0.0753
estimate 0.0520
parameter 0.0281
distribution 0.0268
smooth
0.0198
likelihood 0.0059
…
University of Illinois at Urbana-Champaign
27
Implicit Context in Text
• Some contexts are hidden
– Sentiments; intents; impact; etc.
• Document  contexts: don’t know for sure
– Need to infer this affiliation from the data
• Train a model M for each implicit context
• Provide M to the topic model as guidance
2009 © Qiaozhu Mei
University of Illinois at Urbana-Champaign
28
Modeling Implicit Context
good 0.10
like 0.05
perfect 0.02
Topic 1
Apple iPod
?
hate 0.21
awful 0.03
disgust 0.01
color
size
quality
price
scratch
problem
I
like
?
?
the
song of the
movie on my
Topic 2
Harry Potter
actress
music
visual
director
accent
plot
2009 © Qiaozhu Mei
perfect ipod but
hate the accent
University of Illinois at Urbana-Champaign
29
Semi-supervised Topic Model
(Mei et al. WWW’07)
Add Dirichlet priors
Document
Topics
love
great
r1
1
hate
awful
r2
2
…
Guidance
from the
user
k
Maximum Likelihood
Estimation (MLE)
*  arg max  log( P( D | ))
d1
d2
Maximum A Posterior
(MAP) Estimation
w
*  arg max  log( P( D | ) P())
dk
Similar to adding pseudo-counts to
the observation
2009 © Qiaozhu Mei
University of Illinois at Urbana-Champaign
30
Example: Faceted Opinion
Summarization (Mei et al. WWW’07)
Neutral
Positive
Negative
... Ron Howards selection
of Tom Hanks to play
Robert Langdon.
Tom Hanks stars in
the movie,who can be
mad at that?
But the movie might get
delayed, and even killed off
if he loses.
Topic 1: Directed by: Ron Howard
Writing credits: Akiva
Movie
Goldsman ...
After watching the movie I
went online and some
research on ...
I remembered when i first
read the book, I finished
Topic 2: the book in two days.
I’m reading “Da Vinci
Book
Code” now.
…
Tom Hanks, who is my protesting ... will lose your
favorite movie star act faith by ... watching the
the leading role.
movie.
Anybody is interested
in it?
... so sick of people making
such a big deal about a
FICTION book and movie.
Awesome book.
... so sick of people making
such a big deal about a
FICTION book and movie.
So still a good book to
past time.
This controversy book
cause lots conflict in west
society.
Context = topic & sentiment
2009 © Qiaozhu Mei
University of Illinois at Urbana-Champaign
31
Results: Sentiment Dynamics
Facet: the book “ the da vinci
code”. ( Bursts during the
movie, Pos > Neg )
2009 © Qiaozhu Mei
Facet: the impact on religious
beliefs. ( Bursts during the
movie, Neg > Pos )
University of Illinois at Urbana-Champaign
32
Results: Topic with User’s Guidance
Guidance from the user: I know
two topics should look like this
• Topics for iPod:
No Prior
With Prior
Battery, nano
Marketing
Ads, spam Nano
Battery
battery
apple
free
nano
battery
shuffle
microsoft
sign
color
shuffle
charge
market
offer
thin
charge
nano
zune
freepay
hold
usb
dock
device
complete
model
hour
itune
company
virus
4gb
mini
usb
consumer
freeipod
dock
life
hour
sale
trial
inch
rechargable
2009 © Qiaozhu Mei
University of Illinois at Urbana-Champaign
33
Complex Context in Text
• Complex context  structure of contexts
• Many contexts has latent structure
– Time; location; social network
• Why modeling context structure?
– Review novel contextual patterns;
– Regularize contextual models;
– Alleviate data sparseness: smoothing;
2009 © Qiaozhu Mei
University of Illinois at Urbana-Champaign
34
Modeling Complex Context
Context A and B are
closely related
B
A
Topic 1
Topic 2
Ad
as
Ad
as
Ad
as
Ad
as
Two Intuitions:
• Regularization:
Model(A) and Model(B)
should be similar
• Smoothing: Look at B
if A doesn’t have enough
data
O(C )  Likelihood  Regulariza tion
2009 © Qiaozhu Mei
University of Illinois at Urbana-Champaign
35
Applications of Contextual Text Mining
• Personalized Search
– Personalization with backoff
• Social Network Analysis (for schools)
– Finding Topical communities
• Information Retrieval (for industry labs)
– Smoothing Language Models
2009 © Qiaozhu Mei
University of Illinois at Urbana-Champaign
36
Application I: Personalized Search
2009 © Qiaozhu Mei
University of Illinois at Urbana-Champaign
37
Personalization with Backoff
(Mei and Church WSDM’08)
• Ambiguous query: MSG
– Madison Square Garden
– Monosodium Glutamate
• Disambiguate based on user’s prior clicks
• We don’t have enough data for everyone!
– Backoff to classes of users
• Proof of Concept:
– Context = Segments defined by IP addresses
• Other Market Segmentation (Demographics)
2009 © Qiaozhu Mei
University of Illinois at Urbana-Champaign
38
Apply Contextual Text Mining to
Personalized Search
•
•
•
•
•
The text data: Query Logs
The generative model: P(Url| Query)
The context: Users (IP addresses)
The contextual model: P(Url| Query, IP)
The structure of context:
– Hierarchical structure of IP addresses
2009 © Qiaozhu Mei
University of Illinois at Urbana-Champaign
39
Evaluation Metric: Entropy (H)
• H ( X )    p( x) log p( x)
xX
• Difficulty of encoding information (a distr.)
– Size of search space; difficulty of a task
• H = 20  1 million items distributed uniformly
• Powerful tool for sizing challenges and
opportunities
– How hard is search?
– How much does personalization help?
2009 © Qiaozhu Mei
University of Illinois at Urbana-Champaign
40
How Hard Is Search?
• Traditional Search
– H(URL | Query)
– 2.8 (= 23.9 – 21.1)
• Personalized Search
– H(URL | Query, IP)
– 1.2 (= 27.2 – 26.0)
Personalization
cuts H in Half!
2009 © Qiaozhu Mei
Entropy (H)
Query
21.1
URL
22.1
IP
22.1
All But IP
23.9
All But URL
26.0
All But Query
27.1
All Three
27.2
University of Illinois at Urbana-Champaign
41
Context = First k bytes of IP
Full personalization: every context has a
different model: sparse data!
P(Url | IP, Q)  4 P(Url | IP4 , Q)
Personalization
with backoff:
similar contexts
have similar
models
156.111.188.243
 3 P(Url | IP3 , Q)
156.111.188.*
 2 P(Url | IP2 , Q)
156.111.*.*
 1 P(Url | IP1 , Q)
 0 P(Url | IP0 , Q)
156.*.*.*
*.*.*.*
No personalization: all contexts share the
same model
2009 © Qiaozhu Mei
University of Illinois at Urbana-Champaign
42
Lambda
Sparse Data
Missed
Opportunity
0. 3
Backing Off
by IP
0. 25
0. 2
0. 15
0. 1
0. 05
0
λ4
λ3
λ2
λ1
λ0
4
• λs estimated with EM
• A little bit of personalization
– Better than too much
– Or too little
P(Url | IP, Q)   i P(Url | IPi , Q)
i 0
λ4 : weights for first 4 bytes of IP
λ3 : weights for first 3 bytes of IP
λ2 : weights for first 2 bytes of IP
……
2009 © Qiaozhu Mei
University of Illinois at Urbana-Champaign
43
Context Market Segmentation
• Traditional Goal of Marketing:
– Segment Customers (e.g., Business v. Consumer)
– By Need & Value Proposition
• Need: Segments ask different questions at different times
• Value: Different advertising opportunities
• Segmentation Variables
– Queries, URL Clicks, IP Addresses
– Geography & Demographics (Age, Gender, Income)
– Time of day & Day of Week
2009 © Qiaozhu Mei
University of Illinois at Urbana-Champaign
44
Business Days v. Weekends:
More Clicks and Easier Queries
9,000,000
Clicks
8,000,000
7,000,000
6,000,000
5,000,000
4,000,000
Easier
3,000,000
1.20
1.18
1.16
1.14
1.12
1.10
1.08
1.06
1.04
1.02
1.00
Entropy (H)
More Clicks
1 3 5 7 9 11 13 15 17 19 21 23
Jan 2006 (1st is a Sunday)
Total Clicks
2009 © Qiaozhu Mei
H(Url | IP, Q)
University of Illinois at Urbana-Champaign
45
Harder Queries at TV Time
Harder queries
2009 © Qiaozhu Mei
University of Illinois at Urbana-Champaign
46
Application II: Information Retrieval
2009 © Qiaozhu Mei
University of Illinois at Urbana-Champaign
47
Application: Text Retrieval
Document d
A text mining
paper
Doc Language Model (LM)
θd : p(w|d)
text 4/100=0.04
mining 3/100=0.03
clustering 1/100=0.01
…
data = 0
computing = 0
…
Smoothed Doc LM
θd' : p(w|d’)
text
=0.039
mining
=0.028
clustering
=0.01
…
data = 0.001
computing = 0.0005
…
Similarity
function
 D( q ||  d )    p( w | q ) log
Query q
data mining
Query Language Model
θq : p(w|q)
Data ½=0.5
Mining ½=0.5
2009 © Qiaozhu Mei
wV
p( w |  q )
p( w |  d )
p(w|q’)
Data ½=0.4
Mining ½=0.4
Clustering =0.1
…
?
University of Illinois at Urbana-Champaign
48
Smoothing a Document Language Model
Retrieval performance  estimate LM  smoothing LM
text 4/100
= 0.04
mining 3/100 = 0.03
Assoc. 1/100 = 0.01
clustering 1/100=0.01
…
data = 0
computing = 0
Estimate a more accurate
distribution from sparse data
text
= 0.039
mining
= 0.028
Assoc.
= 0.009
clustering =0.01
…
data = 0.001
computing = 0.0005
PMLE ( w | …
d)
text
= 0.038
mining
= 0.026
Assoc.
= 0.008
clustering =0.01
…
data = 0.002
computing = 0.001
…)
P(w | collection
…
Assign non-zero prob.
to unseen words
P(w | d )  (1   )  PMLE (w |  d )    P(w | collection)
2009 © Qiaozhu Mei
University of Illinois at Urbana-Champaign
49
Apply Contextual Text Mining to
Smoothing Language Models
•
•
•
•
•
The text data: collection of documents
The generative model: P(word)
The context: Document
The contextual model: P(w|d)
The structure of context:
– Graph structure of documents
• Goal: use the graph of documents to estimate a
good P(w|d)
2009 © Qiaozhu Mei
University of Illinois at Urbana-Champaign
50
Traditional Document Smoothing in
Information Retrieval
Estimate a Reference language model
Collection
θref using the collection (corpus)
~
d
 ref
d
Collection
[Ponte &
Croft 98]
Clusters
P( w | d )  PMLE ( w |  d )  P( w |  ref )
d
Cluster
d
[Liu &
Croft 04]
Nearest Neighbors
Interpolate MLE
with Reference LM
 neighbors
d
[Kurland&
Lee 04]
2009 © Qiaozhu Mei
University of Illinois at Urbana-Champaign
51
Graph-based Smoothing for Language
Models in Retrieval (Mei et al. SIGIR 2008)
Can also be a word graph
• A novel and general view of
smoothing
Collection
d
Collection = Graph
(of Documents)
P(w|d) = Surface on
top of the Graph
P(w|d): MLE
P(w|d): Smoothed
projection
on a plain
P(w|d1)
d1
P(w|d2)
Smoothed LM =
d2 Smoothed Surface!
2009 © Qiaozhu Mei
University of Illinois at Urbana-Champaign
52
The General Objective of Smoothing
~ 2
w(u)( fu  fu )
~
fu
fu
Fidelity to MLE
 w(u, v)( fu  f v )
( u ,v )E
w(u , v)
2
Smoothness
of the surface
w(u )
Importance
of vertices
- Weights of
edges (1/dist.)
~ 2
O(C )  (1   )   w(u )( f u  f u )   
uV
2009 © Qiaozhu Mei
 w(u, v)( f
( u ,v )E
University of Illinois at Urbana-Champaign
u
 fv )
2
53
Smoothing Language Models using a
Document Graph
Construct a
kNN graph of
documents;
w(u): Deg(u)
w(u,v): cosine
d
fu= p(w|du);
Document language model:
P( w | d u )  (1   )  PMLE ( w | du )    
vV
Additional
Dirichlet
Smoothing
2009 © Qiaozhu Mei
w(u, v)
P( w | d v );
Deg (u )
fu
University of Illinois at Urbana-Champaign
54
Effectiveness of the Framework
Data Sets Dirichlet
DMDG
DMWG †
DSDG
AP88-90
0.217
0.254 ***
(+17.1%)
0.252 ***
(+16.1%)
0.239 *** 0.239
(+10.1%) (+10.1%)
LA
0.247
0.258 **
(+4.5%)
0.257 **
(+4.5%)
0.251 **
(+1.6%)
SJMN
0.204
0.231 ***
(+13.2%)
0.229 ***
(+12.3%)
0.225 *** 0.219
(+10.3%) (+7.4%)
TREC8
0.257
0.271 ***
(+5.4%)
0.271 **
(+5.4%)
0.261
(+1.6%)
QMWG
0.247
0.260
(+1.2%)
Wilcoxon test: *, **, *** means significance level 0.1, 0.05, 0.01
† DMWG: reranking top 3000 results. Usually this yields
to a reduced performance than ranking all the documents
Graph-based smoothing >> Baseline
Smoothing Doc LM >> relevance score >> Query LM
2009 © Qiaozhu Mei
University of Illinois at Urbana-Champaign
55
Intuitive Interpretation – Smoothing using
Document Graph
P (u  1)
P(u  0)
P(w | du )  (1   ) PML (w | du ) 1  (1   )(1  PML (w | du ))  0
w(u, v)
 
P( w | d v )
Deg
(u ) v)
vV P
(u 
Absorption Probability
to the “1” state
d
1
d
Act as neighbors do
2009 © Qiaozhu Mei
0
Writing a word w in
a document =
random walk on
the doc Markov
chain; write down
w if reaching “1”
University of Illinois at Urbana-Champaign
56
Application III: Social Network Analysis
2009 © Qiaozhu Mei
University of Illinois at Urbana-Champaign
57
Topical Community Analysis
physicist, physics,
scientist, theory,
gravitation …
writer, novel,
best-sell, book,
language, film…
Topic modeling to help
community extraction
Network analysis to
help topic extraction
Computer
Science
Literature
=
Information Retrieval +
Data Mining +
Machine Learning, …
or
Domain Review +
Algorithm +
Evaluation, …
2009 © Qiaozhu Mei
?
University of Illinois at Urbana-Champaign
58
Apply Contextual Text Mining to
Topical Community Analysis
•
•
•
•
•
The text data: Publications of researchers
The generative model: topic model
The context: author
The contextual model: author-topic model
The structure of context:
– Social Network: coauthor network of researchers
2009 © Qiaozhu Mei
University of Illinois at Urbana-Champaign
59
Intuitions
• People working on the same topic belong to the same
“topical community”
• Good community: coherent topic + well connected
• A topic is semantically coherent if people working on
this topic also collaborate a lot
IR
IR
?
IR
Intuition: my topics are
similar to my neighbors
More likely to be an IR person
or a compiler person?
2009 © Qiaozhu Mei
University of Illinois at Urbana-Champaign
60
Social Network Context for Topic
Modeling
e.g. coauthor network
• Context = author
• Coauthor = similar contexts
• Intuition: I work on similar
topics to my neighbors
Smoothed Topic
distributions 
P(θj|author)
2009 © Qiaozhu Mei
University of Illinois at Urbana-Champaign
61
Topic Modeling with Network
Regularization (NetPLSA)
• Basic Assumption (e.g., co-author graph)
• Related authors work on similar topics
topic distribution of a document
PLSA
k
O(C , G )  (1   )  ( c( w, d ) log  p( j | d ) p( w |  j ))
d
tradeoff
between
topic and
smoothness
1
   (
2

j 1
w
u ,v E
w(u, v) ( p( j | u )  p( j | v)) 2 )
Graph Harmonic Regularizer,
Generalization of [Zhu ’03],

1
2
f
j 1... k
T
j
k
j 1
difference of topic distribution
on neighbor vertices
importance (weight) of an edge
f j , where f j ,u  p( j | u )
2009 © Qiaozhu Mei
University of Illinois at Urbana-Champaign
62
Topics & Communities without
Regularization
Topic 1
Topic 2
Topic 3
Topic 4
term
0.02
peer
0.02
visual
0.02
interface
0.02
question
0.02
patterns
0.01
analog
0.02
towards
0.02
protein
0.01
mining
0.01
neurons
0.02
browsing
0.02
training
0.01
clusters
0.01
vlsi
0.01
xml
0.01
weighting
0.01
stream
0.01
motion
0.01
generation
0.01
multiple
0.01
frequent
0.01
chip
0.01
design
0.01
recognition 0.01
e
0.01
natural
0.01
engine
0.01
relations
0.01
page
0.01
cortex
0.01
service
0.01
library
0.01
gene
0.01
spike
0.01
social
0.01
??
?
Noisy
community
assignment
?
2009 © Qiaozhu Mei
University of Illinois at Urbana-Champaign
63
Topics & Communities with
Regularization
Topic 1
retrieval
Topic 2
Topic 3
Topic 4
mining
0.11
neural
0.06
web
0.05
information 0.05
data
0.06
learning
0.02
services
0.03
document
0.03
discovery
0.03
networks
0.02
semantic
0.03
query
0.03
databases
0.02
recognition 0.02
services
0.03
text
0.03
rules
0.02
analog
0.01
peer
0.02
search
0.03
association 0.02
vlsi
0.01
ontologies
0.02
evaluation
0.02
patterns
0.02
neurons
0.01
rdf
user
0.02
frequent
0.01
gaussian
0.01
management 0.01
relevance
0.02
streams
0.01
network
0.01
ontology
0.13
Information
Retrieval
Web
Coherent
community
assignment
0.02
0.01
Data mining Machine
learning
2009 © Qiaozhu Mei
University of Illinois at Urbana-Champaign
64
Topic Modeling and SNA Improve
Each Other
The smaller
the better
Methods
Topic Modeling helps balancing communities
(text implicitly bridges authors)
Cut Edge
Weights
Ratio Cut/
Norm. Cut
PLSA
4831
NetPLSA
NCut
Community Size
Community
1
Community
2
Community
3
Community
4
2.14/1.25
2280
2178
2326
2257
662
0.29/0.13
2636
1989
3069
1347
855
0.23/0.12
2699
6323
8
11
-Ncut: spectral clustering with normalized cut. J. Shi et al. 2000
- pure network based community finding
Network Regularization helps extract coherent communities
(network assures the focus of topics)
2009 © Qiaozhu Mei
University of Illinois at Urbana-Champaign
65
Smoothed Topic Map
Map a topic on the network (e.g., using p(θ|a))
Core contributors
Intermediate
Irrelevant
PLSA
NetPLSA
(Topic : “information retrieval”)
2009 © Qiaozhu Mei
University of Illinois at Urbana-Champaign
66
Summary of My Talk
• Text + Context = Contextual Text Mining
– A new paradigm of text mining
• A novel framework for contextual text mining
– Probabilistic Topic Models
– Contextualize by simple context, implicit context,
complex context;
• Applications of contextual text mining
2009 © Qiaozhu Mei
University of Illinois at Urbana-Champaign
67
A Roadmap of My Work
KDD 05
KDD 06a
WWW 06
Contextual Topic
Models
Contextual
Text Mining
KDD 06b
WWW 07
KDD 07
WWW 08
WSDM 08
SIGIR 08
ACL 08
CIKM 08
Information Retrieval
& Web Search
2009 © Qiaozhu Mei
SIGIR 07
University of Illinois at Urbana-Champaign
68
Research Discipline
Applied Statistics
Data Mining
Machine Learning
Database
Social
Text Mining
Networks
Information
Retrieval
Text Information
Information Science
Management
Natural Language
Processing
Bioinformatics
2009 © Qiaozhu Mei
University of Illinois at Urbana-Champaign
69
End Note
+
2009 © Qiaozhu Mei
=
University of Illinois at Urbana-Champaign
70
Thank You
2009 © Qiaozhu Mei
University of Illinois at Urbana-Champaign
71