Download talk-michigan

Document related concepts
no text concepts found
Transcript
Towards Contextual Text
Mining
Qiaozhu Mei
[email protected]
University of Illinois at Urbana-Champaign
2009 © Qiaozhu Mei
University of Illinois at Urbana-Champaign
Knowledge Discovery from Text
Text Mining System
2009 © Qiaozhu Mei
University of Illinois at Urbana-Champaign
2
Overload of Text Content
Content
Type
Published Professional User generated
Content
web content content
Private text
content
Amount / day
3-4G
~ 3T
~ 2G
8-10G
- Ramakrishnan and Tomkins 2007
2009 © Qiaozhu Mei
University of Illinois at Urbana-Champaign
3
Challenge of Mining Text
~100B
10B
Gold?
~3M day
~750k /day
~150k /day
1M
6M
Where to Start? Where to Go?
2009 © Qiaozhu Mei
University of Illinois at Urbana-Champaign
4
Context - “Situation of Text”
Check Lap Kok, HK
Author
Time
Location
Author’s
occupation self designer, publisher,
editor …
3:53 AM Jan 28th
Source
Sentiment
From Ping.fm
Language
Social Network
2009 © Qiaozhu Mei
University of Illinois at Urbana-Champaign
5
Rich Context Information
~1B queries
Per hour?
~1B Users
100M users
> 1M groups
~3M msgs /day
~5M users
102M blogs
5M users
500M URLs
73 years
~400k authors 8M contributors
100+ languages
~4k sources
2009 © Qiaozhu Mei
University of Illinois at Urbana-Champaign
6
Text + Context = ?
+
Context = Guidance
2009 © Qiaozhu Mei
I Have A
Guide!
University of Illinois at Urbana-Champaign
7
Query Log + User = Personalized Search
Metropolis Street Racer
MSR
Magnetic Stripe Reader
Molten salt reactor
Modern System Research
Mars sample return
Wikipedia definitions
If you know me, you should give me
Microsoft Research…
Medical simulation
Montessori School of Raleigh
MSR Racing
Mountain Safety Research
How much can personalized help?
2009 © Qiaozhu Mei
University of Illinois at Urbana-Champaign
8
Customer Reviews + Brand =
Comparative Product Summary
IBM Laptop
Reviews
APPLE Laptop
Reviews
DELL Laptop
Reviews
Common Themes
IBM
APPLE
DELL
Battery Life
Long, 4-3 hrs
Medium, 3-2 hrs
Short, 2-1 hrs
Hard disk
Large, 80-100 GB
Small, 5-10 GB
Medium, 20-50 GB
Speed
Slow, 100-200 Mhz
Very Fast, 3-4 Ghz
Moderate, 1-2 Ghz
Can we compare Products?
2009 © Qiaozhu Mei
University of Illinois at Urbana-Champaign
9
Scientific Literature + Time = Topic
Trends
1800
1600
1400
1200
Hot Topics in
SIGMOD
1000
Sensor Networks
Structured data, XML
Web data
Data Streams
800
Ranking, Top-K
600
400
200
0
What’s hot in literature?
2009 © Qiaozhu Mei
University of Illinois at Urbana-Champaign
10
Blogs + Time & Location =
Spatiotemporal Topic Diffusion
One Week Later
How does discussion spread?
2009 © Qiaozhu Mei
University of Illinois at Urbana-Champaign
11
Blogs + Sentiment =
Faceted Opinion Summary
The Da Vinci Code
Tom Hanks, who is
my favorite movie
star act the leading
role.
protesting... will
lose your faith by
watching the movie.
120
100
80
Positive
Negative
60
40
a good book to past
time.
... so sick of people
making such a big
deal about a fiction
book
20
0
What is good and what is bad?
2009 © Qiaozhu Mei
University of Illinois at Urbana-Champaign
12
Publications + Social Network =
Topical Community
Coauthor
Network
Information
retrieval
Machine learning
Data mining
Who works together on what?
2009 © Qiaozhu Mei
University of Illinois at Urbana-Champaign
13
A General Solution for All ?
Query log + User = Personalized Search
Scientific Literature + Time = Topic Trends
Review + Brand = Comparative Opinion
Blog + Time & Location = Spatiotemporal Topic Diffusion
Blog + Sentiment = Faceted Opinion Summary
Publications + Social Network = Topical Community
…..
Text + Context = Contextual Text Mining
2009 © Qiaozhu Mei
University of Illinois at Urbana-Champaign
14
Roadmap
• Generative Model of Text
• Integrating Contexts in Text Models
– Modeling Simple Context
– Modeling Implicit Context
– Modeling Complex Context
• Applications of Contextual Text Mining
2009 © Qiaozhu Mei
University of Illinois at Urbana-Champaign
15
Generative Model of Text
P( word | Model )
the
is
harry
potter
movie
plot
time
rowling
0.1
0.07
0.05
0.04
0.04
0.02
0.01
0.01
Inference, Estimation
movie
the.. movie..
harry ..
potter is ..
based.. on..
j..k..rowling
Generation
2009 © Qiaozhu Mei
University of Illinois at Urbana-Champaign
16
Text as a Mixture of Topics
Topic (Theme) = the
subject of a discourse
mining
data
pattern
clustering
network
……
0.21
0.13
0.10
0.05
0.04
Data
Mining
learning
model
training
kernel
inference
……
0.18
0.14
0.10
0.09
0.07
Machine
Learning
search
engine
query
user
ranking
……
Web
Search
0.2
0.15
0.08
0.07
0.06
…
Database
K topics
Using machine learning for web search
2009 © Qiaozhu Mei
University of Illinois at Urbana-Champaign
17
Probabilistic Topic Models
(Hofmann ’99, Blei et al. ’03, …)
P( w) 
 P( z  i) P(w | Topic )
i
i 1.. K
ipod
ipod
nano
music
download
apple
0.15
0.15
0.08
0.05
0.02
0.01
Topic 1
Apple iPod
I
downloaded
the
the
movie
harry
harry
potter
actress
music
0.10
0.09
0.09
0.05
0.04
0.02
harry
Topic 2
my
music of
movie
potter to
ipod
nano
Harry Potter
2009 © Qiaozhu Mei
University of Illinois at Urbana-Champaign
18
Parameter Estimation
• Maximum Likelihood Estimation (MLE):
*  arg max  P( D | )
• Parameter Estimation using EM algorithm
PseudoCounts
– Gibbs sampling, Variational inference, Expectation propagation
ipod
nano
music
download
apple
?
0.15
?
0.08
?
0.05
?
0.02
?
0.01
movie
harry
potter
actress
music
0.10
?
0.09
?
0.05
?
0.04
?
0.02
?
Guess the affiliation
I
downloaded
the
the
Estimate the params
2009 © Qiaozhu Mei
music of
movie
harry
my
potter to
ipod
University of Illinois at Urbana-Champaign
nano
19
How Context Affects Topics
“Context of Situation” - B. Malinowski 1923
• Topics in science literature:
16th Century v.s. 21st Century
• When do a computer scientist and a gardener write
about “tree, root, prune? ”
• In Europe, “football” appears a lot in
a soccer report. What about in the US?
Text are generated according to the Context!!
2009 © Qiaozhu Mei
University of Illinois at Urbana-Champaign
20
Existing Work
• PLSA (Hofmann ‘99), LDA (Blei et al ‘03), CTM (Blei et al.
‘06), PAM (Li and McCallum ‘06)
– Don’t incorporate contexts
• Author: Author-topic model (Steyvers et al. 04)
• Time: Topic-over-time (Wang et al. 06), Dynamic
Topic model (Blei et al ‘06)
Can we capture the context
in a general way?
2009 © Qiaozhu Mei
University of Illinois at Urbana-Champaign
21
Contextualized Models
P( word | Model , Context)
P(w|M,
Year
= 1998)
Year
= 1998
book
harry
potter
rowling
Inference:
Sentiment = +
• How to reveal contextual patterns?
0.15
0.10
0.08
0.05
book
P(w|M,Year
Year==2008
2008)
movie
harry
potter
director
0.18
0.09
0.08
0.04
Generation:
• How to select contexts?
• How to model context structure?
Source = official
Location = China
Location = US
2009 © Qiaozhu Mei
University of Illinois at Urbana-Champaign
22
Roadmap: Modeling Simple Context
Author
Time
Simple
Contexts
Location
Author’s
occupation
Source
Language
2009 © Qiaozhu Mei
University of Illinois at Urbana-Champaign
23
Simple Contextual Topic Model
(Mei and Zhai KDD’06)
P( w) 

j 1.. C
P(c  j )  P( z  i | c j ) P( w | Topici , c j )
Context 1: 2004
Topic 1
Apple iPod
i 1.. K
Contextual
Topic Patterns
Context 2: 2007
ipod
mini
4gb
ipod
iphone
nano
I
downloaded
the
the
Topic 2
Harry Potter
harry
prisoner
azkaban
potter
order
phoenix
2009 © Qiaozhu Mei
harry
my
University of Illinois at Urbana-Champaign
music of
movie
potter to
iphone
24
Example: Topic Life Cycles
(Mei and Zhai KDD’05)
Context = Time
Contextual Topic Pattern  P(z|time)
1800
1600
1400
1200
Hot Topics in
SIGMOD
1000
Sensor Networks
Structured data, XML
Web data
Data Streams
800
Ranking, Top-K
600
400
200
0
2009 © Qiaozhu Mei
University of Illinois at Urbana-Champaign
25
Example: Spatiotemporal Theme Pattern
(Mei et al. WWW’06)
Hurricane
Katrina
Context = Time & Location
Contextual Topic Pattern  P(z|time, location)
Topic: Government
Response in
Hurricane Katrina
Hurricane
Rita
2009 © Qiaozhu Mei
University of Illinois at Urbana-Champaign
26
Example: Event Impact Analysis
(Mei and Zhai KDD’06)
Context = Event
Contextual Pattern
 P(w|z, event)
term
0.16
relevance
0.08
weight
0.07
feedback
0.04
model
0.03
probabilistic 0.02
document
0.02
…
Topic:
retrieval models
xml
0.07
email
0.02
Evaluation
&
model
0.02
Applications
collect
0.02
judgment 0.01
rank
0.01
…
vector
0.05
concept 0.03
model
0.03
Traditional
space
0.02
Models0.02
boolean
function 0.01
…
1992
Starting of TREC
[Ponte and Croft 98]
1998
probabilist 0.08
model
0.04
logic
0.04
Probabilistic
boolean 0.03
Models0.02
algebra
weight
0.01
…
2009 © Qiaozhu Mei
SIGIR
University of Illinois at Urbana-Champaign
model
0.17
language 0.08
estimate
0.05
parameter
Language
0.03
distribution
Models 0.03
smooth
0.02
likelihood 0.01
…
27
Instantiation: Personalized Search
(Mei and Church WSDM’08)
2009 © Qiaozhu Mei
University of Illinois at Urbana-Champaign
28
Personalization with Backoff
• Ambiguous query: MSR
– Microsoft Research
– Mountain Safety Research
• Disambiguate based on user’s prior clicks
• We don’t have enough data for everyone!
– Backoff to classes of users
• Proof of Concept:
– Context = Classes of Users defined by IP address
2009 © Qiaozhu Mei
University of Illinois at Urbana-Champaign
29
Personalized Search as
Contextual Text Mining
Context  Users (IP), groups of users
(IP, Query, URL)
156.111.188.243
156.111.188.*
156.111.*.*
156.*.*.*
Text: query(click) logs
*.*.*.*
Text Model: P(URL | Query)
Contextual
P(URL | Query, User)
Model:
2009 © Qiaozhu Mei
Goal: Estimate Better
P(URL | Query, User)
University of Illinois at Urbana-Champaign
30
Evaluation Metric: Entropy (H)
H (URL)   p(URL) log p(URL)
URL
• Difficulty of encoding information (a distribution)
– Size of search space; difficulty of a task
• Powerful tool for sizing challenges and opportunities
– How hard is web search?
– How much does personalization help?
• Predict future  Cross Entropy H(Future|History)
2009 © Qiaozhu Mei
University of Illinois at Urbana-Champaign
31
Difficulty of Queries
• Easy queries (low H(URL|Q)):
– google, yahoo, myspace, ebay, …
•
Hard queries (high H(URL|Q)):
– dictionary, yellow pages, movies, “what is may day?”
Hard Query: “MSR” – High Entropy
msrgear.com
msracing.com
research....com
msrwheels.com
msr.com
msr.org
msrdev.com
…
0.12
0.10
0.09
0.08
0.07
0.07
0.06
0.05
2009 © Qiaozhu Mei
Easy Query: “Google” – Low Entropy
google.com
google.cn
maps.google
…
…
University of Illinois at Urbana-Champaign
0.80
0.10
0.08
~0
~0
~0
~0
~0
32
How Hard Is Search?
• Traditional Search
– H(URL | Query)
– 2.8 (= 23.9 – 21.1)
• Personalized Search
– H(URL | Query, IP)
– 1.2 (= 27.2 – 26.0)
Personalization
cuts H in Half!
2009 © Qiaozhu Mei
Entropy (H)
Query
21.1
URL
22.1
IP
22.1
Query, URL
23.9
Query, IP
26.0
IP, URL
27.1
All Three
27.2
University of Illinois at Urbana-Champaign
33
Context = First k bytes of IP
Full personalization: every user has
a different model: sparse data!
P(URL | User, Q) 
Personalization
with backoff:
smooth by
similar users
4 P(URL | IP4 , Q)
 3 P (URL | IP3 , Q)
 2 P(URL | IP2 , Q)
 1 P(URL | IP1 , Q)
 0 P (URL | IP0 , Q)
156.111.188.243
156.111.188.*
156.111.*.*
156.*.*.*
*.*.*.*
No personalization: all users share the
same model: Missed Opportunity
2009 © Qiaozhu Mei
University of Illinois at Urbana-Champaign
34
Context Market Segmentation
• Can we do better than IP address?
• Potential Context Variables
– ID, QueryType, Click, Intent, …
– Demographics (Age, Gender, Income, …)
– Time of day & Day of Week
2009 © Qiaozhu Mei
University of Illinois at Urbana-Champaign
35
Roadmap: Modeling Implicit Context
Implicit
Contexts
Sentiment
2009 © Qiaozhu Mei
University of Illinois at Urbana-Champaign
36
Implicit Context of Text
Sentiments
Intents
?
?
?
Impact
Trust
Need to infer these situations/conditions
from the data (with prior knowledge)
2009 © Qiaozhu Mei
University of Illinois at Urbana-Champaign
37
Modeling Implicit Context
Trained
From training
Model data
or Guidance
or user
from
guidance
user–– added
added as
as prior
prior
  arg max  P( D | ) P()
*
good 0.10
like 0.05
perfect 0.02
Topic 1
Apple iPod
?
hate 0.21
awful 0.03
disgust 0.01
color
size
quality
price
scratch
problem
I
like
?
?
the
song of the
movie on my
Topic 2
Harry Potter
actress
music
visual
director
accent
plot
2009 © Qiaozhu Mei
perfect ipod but
hate the accent
University of Illinois at Urbana-Champaign
38
Example: Faceted Opinion
Summarization (Mei et al. WWW’07)
Context = Sentiment
The Da Vinci Code
Topic 1:
Movie
Tom Hanks, who is
my favorite movie star
act the leading role.
Topic 2:
Book
a good book to
past time.
2009 © Qiaozhu Mei
Protesting.. you will
lose your faith by
watching the movie.
... so sick of people
making such a big
deal about a fiction
book
University of Illinois at Urbana-Champaign
39
Roadmap: Modeling Complex Context
Complex
Contexts
Social Network
2009 © Qiaozhu Mei
University of Illinois at Urbana-Champaign
40
Complex Context of Text
Structures of
contexts
• Find novel contextual patterns;
• Regularize contextual models;
• Alleviate data sparseness;
2009 © Qiaozhu Mei
University of Illinois at Urbana-Champaign
41
Modeling Complex Context
O( D)  Likelihood  Regulariza tion
Context A and B are
closely related
A
Topic 1
Topic 2
B
Intuitions :
ipod
nano
4gb
ipod
nano
8gb
harry
potter
actor
harry
potter
actress
2009 © Qiaozhu Mei
Model(A) and Model(B) should
be similar
• users in the same building
issue similar queries
• collaborating researchers
work on similar things
• topics in SIGMOD are like
topics in VLDB
University of Illinois at Urbana-Champaign
42
Graph-based Regularization
Structure of contexts
 a graph
v
Intuition: Model(u) and
Model(v) should be similar
O( D, G )  Likelihood  Regulariza tion
u
  surface(s) on
top of the Graph
projection
on a plane
 u ,  v : MLE
Smoothed 
u
Model(u)
v
Model(v)
u
v
Intuition =
Regularized model =
Smoothed Surfaces!
2009 © Qiaozhu Mei
University of Illinois at Urbana-Champaign
43
Instantiation: Topical Community
Extraction (Mei et al. WWW’08)
2009 © Qiaozhu Mei
University of Illinois at Urbana-Champaign
44
Social Network Analysis
Generation, evolution
e.g., [Leskovec 05]
Community extraction
e.g., [Kleinberg 00];
- Kleinberg and Backstrom 2006,
New York Times
Diffusion
[Gruhl 04]; [Backstrom 06]
Search
e.g., [Adamic 05]
Ranking
e.g., [Brin and Page 98];
[Kleinberg 98]
- Jeong et al. 2001 Nature 411
Usually don’t model topics in text
2009 © Qiaozhu Mei
University of Illinois at Urbana-Champaign
45
Topical Community Analysis
physicist, physics,
scientist, theory,
gravitation …
writer, novel,
best-sell, book,
language, film…
Topics in text help
community extraction
Text + Network 
topical communities
Computer
Science
Literature
=
Information Retrieval +
Data Mining +
Machine Learning, …
2009 © Qiaozhu Mei
+
University of Illinois at Urbana-Champaign
46
Topical Community Extraction as
Contextual Text Mining
Context  Authors
Text: Scientific publications
Context Structure:
Social Network (coauthorship)
Text Model: Topic Model
Goal: Assign authors into topical
communities using P(z|author)
Contextual
Model: Topic Model + Author - Regularize using social network
2009 © Qiaozhu Mei
University of Illinois at Urbana-Champaign
47
Topic Modeling with Network
Regularization
Intuition 1: Know my research topics from
my publications
Data Likelihood
Model parameters: 
Text Model
k
O( D, G )  (1   )  ( c( w, c) log  p( j | c) p( w |  j ))
c
w
j 1
O( D, G )  Likelihood
 Regulariza
tion
tradeoff
between
MLE and
smoothness
1
   (
2

u ,v E
k
w(u, v) ( p( j | u )  p( j | v)) 2 )
j 1
Smoothness of 
between neighbors
Graph Regularization
Graph Harmonic Regularizer,
(a generalization of [Zhu ’03])
2009 © Qiaozhu Mei
Intuition 2: I work on similar topics
with my coauthors
University of Illinois at Urbana-Champaign
48
Topics & Communities without
Network Regularization
Topic 1
Topic 2
Topic 3
Fuzzy
Topics
Topic 4
term
0.02
peer
0.02
visual
0.02
interface
0.02
question
0.02
patterns
0.01
analog
0.02
towards
0.02
protein
0.01
mining
0.01
neurons
0.02
browsing
0.02
training
0.01
clusters
0.01
vlsi
0.01
xml
0.01
weighting
0.01
stream
0.01
motion
0.01
generation
0.01
multiple
0.01
frequent
0.01
chip
0.01
design
0.01
recognition 0.01
e
0.01
natural
0.01
engine
0.01
relations
0.01
page
0.01
cortex
0.01
service
0.01
library
0.01
gene
0.01
spike
0.01
social
0.01
??
?
Noisy
community
assignment
?
Four Conferences:
SIGIR, KDD, NIPS, WWW
2009 © Qiaozhu Mei
University of Illinois at Urbana-Champaign
49
Topics & Communities with
Network Regularization
Topic 1
retrieval
Topic 2
Topic 3
Clear
Topics
Topic 4
mining
0.11
neural
0.06
web
0.05
information 0.05
data
0.06
learning
0.02
services
0.03
document
0.03
discovery
0.03
networks
0.02
semantic
0.03
query
0.03
databases
0.02
recognition 0.02
services
0.03
text
0.03
rules
0.02
analog
0.01
peer
0.02
search
0.03
association 0.02
vlsi
0.01
ontologies
0.02
evaluation
0.02
patterns
0.02
neurons
0.01
rdf
user
0.02
frequent
0.01
gaussian
0.01
management 0.01
relevance
0.02
streams
0.01
network
0.01
ontology
0.13
Information
Retrieval
Web
Coherent
community
assignment
0.02
0.01
Data mining Machine
learning
2009 © Qiaozhu Mei
University of Illinois at Urbana-Champaign
50
Topic Modeling and SNA Improve
Each Other
The smaller
the better
Text
Only
Methods
Topic Modeling helps balancing communities
(text implicitly bridges authors)
Cut Edge
Weights
Ratio Cut/
Norm. Cut
PLSA
4831
NetPLSA
NCut
Network
Only
Community Size
Community
1
Community
2
Community
3
Community
4
2.14/1.25
2280
2178
2326
2257
662
0.29/0.13
2636
1989
3069
1347
855
0.23/0.12
2699
6323
8
11
-Ncut: spectral clustering with normalized cut. (Shi et al. ’00)
Network Regularization helps extract coherent communities
(ensure tight connection of authors)
2009 © Qiaozhu Mei
University of Illinois at Urbana-Champaign
51
Summary of My Talk
• Text + Context = Contextual Text Mining
– A new paradigm of text mining
• General methodology for contextual text mining
– Generative models of text (e.g., Topic Models)
– Contextualized models with simple context, implicit
context, complex context;
• Applications of contextual text mining
2009 © Qiaozhu Mei
University of Illinois at Urbana-Champaign
52
Take Away Message
+
Text
2009 © Qiaozhu Mei
Context
=
University of Illinois at Urbana-Champaign
53
A Roadmap of My Work
KDD 06a
KDD 07 Labeling topic models
Annotating frequent
patterns
KDD 05
Bio. literature
mining
PSB 06
IP&M 07
Contextual Topic
Models
Text Mining
Application
to Bioinfo.
WWW 06
KDD 06b
WWW 07
WWW 08
KDD 08
WSDM 08
SIGIR 08
SIGIR 07
Graph-based smoothing
ACL 08
Information Retrieval
& Web Search
2009 © Qiaozhu Mei
CIKM 08
University of Illinois at Urbana-Champaign
Poisson language
models
Impact-based
summarization
Query suggestion
using hitting time
54
A Roadmap to the Future
Theoretical Framework
• Computational challenge;
• Structure of contexts
Task Support
Systems
• Web users
• Scientists
• Business users
Text Mining
Text Information
Management
Information Retrieval
& Web Search
Applications
Interdisciplinary
• Bioinformatics
• Health informatics
• Business informatics
Integrative analysis of
heterogeneous data
• web 2.0 data
• Science data
• Information networks
2009 © Qiaozhu Mei
University of Illinois at Urbana-Champaign
55
Thanks!
2009 © Qiaozhu Mei
University of Illinois at Urbana-Champaign
56
Predict the Future
Cross Entropy:
H(future | history)
• IP in the future might not be seen in the history
No
personalization
Complete
personalization
Personalization
with backoff
4
Knows at least
two bytes
0
1
Knows3every 2
byte –
enough data
2009 © Qiaozhu Mei
At least first k
bytes of IP are
seen in History
University of Illinois at Urbana-Champaign
57
Related documents