Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Contextual Text Mining Qiaozhu Mei [email protected] University of Illinois at Urbana-Champaign 2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Knowledge Discovery from Text Text Mining System 2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 2 Trend of Text Content Content Type Published Professional User generated Content web content content Private text content Amount / day 3-4G ~ 3T ~ 2G 8-10G - Ramakrishnan and Tomkins 2007 2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 3 Text on the Web (Unconfirmed) ~100B 10B Gold? ~3M day ~750k /day ~150k /day 1M 6M Where to Start? Where to Go? 2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 4 Context Information in Text Check Lap Kok, HK Author Time Location Author’s occupation self designer, publisher, editor … 3:53 AM Jan 28th Source Sentiment From Ping.fm Language Social Network 2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 5 Rich Context in Text ~150k bookmarks /day 5M users ~3M msgs /day 500M URLs ~2M users 73 years ~400k authors ~4k sources ~300M words/month 8M contributors 100+ languages 750K posts/day 102M blogs 100M users > 1M groups 1B queries? Per hour? Per IP? 2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 6 Text + Context = ? + Context = Guidance 2009 © Qiaozhu Mei I Have A Guide! University of Illinois at Urbana-Champaign 7 Query + User = Personalized Search Metropolis Street Racer MSR Magnetic Stripe Reader Molten salt reactor Modern System Research Mars sample return Wikipedia definitions If you know me, you should give me Microsoft Research… Medical simulation Montessori School of Raleigh MSR Racing Mountain Safety Research How much can personalized help? 2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 8 Customer Review + Brand = Comparative Product Summary IBM Laptop Reviews APPLE Laptop Reviews DELL Laptop Reviews Common Themes IBM APPLE DELL Battery Life Long, 4-3 hrs Medium, 3-2 hrs Short, 2-1 hrs Hard disk Large, 80-100 GB Small, 5-10 GB Medium, 20-50 GB Speed Slow, 100-200 Mhz Very Fast, 3-4 Ghz Moderate, 1-2 Ghz Can we compare Products? 2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 9 Literature + Time = Topic Trends 1800 1600 1400 1200 Hot Topics in SIGMOD 1000 Sensor Networks Structured data, XML Web data Data Streams 800 Ranking, Top-K 600 400 200 0 What’s hot in literature? 2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 10 Blogs + Time & Location = Spatiotemporal Topic Diffusion One Week Later How does discussion spread? 2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 11 Blogs + Sentiment = Faceted Opinion Summary The Da Vinci Code Tom Hanks, who is my favorite movie star act the leading role. protesting... will lose your faith by watching the movie. 120 100 80 Positive Negative 60 40 a good book to past time. ... so sick of people making such a big deal about a fiction book 20 0 What is good and what is bad? 2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 12 Publications + Social Network = Topical Community Coauthor Network Information retrieval Machine learning Data mining Who works together on what? 2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 13 A General Solution for All Query log + User = Personalized Search Literature + Time = Topic Trends Review + Brand = Comparative Opinion Blog + Time & Location = Spatiotemporal Topic Diffusion Blog + Sentiment = Faceted Opinion Summary Publications + Social Network = Topical Community ….. Text + Context = Contextual Text Mining 2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 14 Contextual Text Mining • • • • • Generative Model of Text Modeling Simple Context Modeling Implicit Context Modeling Complex Context Applications of Contextual Text Mining 2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 15 Generative Model of Text the is harry potter movie plot time rowling 0.1 0.07 0.05 0.04 0.04 0.02 0.01 0.01 Inference, Estimation movie the.. movie.. harry .. potter is .. based.. on.. j..k..rowling Generation P( word | Model ) 2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 16 book harry potter rowling 0.15 0.10 0.08 0.05 Year = 1998 Contextualized Models Sentiment = + Inference: • How to estimate contextual models? • How to reveal contextual patterns? Year = 2008 movie harry potter director book 0.18 0.09 0.08 0.04 Generation: • How to select contexts? • How to model the relations of contexts? Source = official Location = China Location = US P( word | Model , Context) 2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 17 Topics in Text • • • • • Topic (Theme) = the subject of a discourse A topic covers multiple documents A document has multiple topics Topic = a soft cluster of documents Topic = a multinomial distribution of words Many text mining tasks: • Extracting topics from text • Reveal contextual topic patterns 2009 © Qiaozhu Mei Data Mining search engine query user ranking …… 0.2 0.15 0.08 0.07 0.06 Machine Web Learning Search University of Illinois at Urbana-Champaign 18 Probabilistic Topic Models ipod ipod nano music download apple P( w) 0.15 0.15 0.08 0.05 0.02 0.01 Topic 1 Apple iPod P( z i) P(w | Topic ) movie harry harry potter actress music 0.10 0.09 0.09 0.05 0.04 0.02 downloaded the i i 1.. K I the harry Topic 2 my music of movie potter to ipod nano Harry Potter 2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 19 Parameter Estimation • Maximizing data likelihood: * arg max log( P( Data | Model )) PseudoCounts • Parameter Estimation using EM algorithm ipod nano music download apple ? 0.15 ? 0.08 ? 0.05 ? 0.02 ? 0.01 movie harry potter actress music 0.10 ? 0.09 ? 0.05 ? 0.04 ? 0.02 ? Guess the affiliation II downloaded downloaded the the music music of of the the Estimate the params 2009 © Qiaozhu Mei movie movie harry harry potter potter to to my my ipod ipod nano nano University of Illinois at Urbana-Champaign 20 How Context Affects Topics • Topics in science literature: 16th Century v.s. 21st Century • When do a computer scientist and a gardener use “tree, root, prune” in text? • What does “tree” mean in “algorithm”? • In Europe, “football” appears a lot in a soccer report. What about in the US? Text are generated according to the Context!! 2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 21 Simple Contextual Topic Model Context 1: 2004 Topic 1 Apple iPod Context 2: 2007 ipod mini 4gb ipod iphone nano I downloaded the the Topic 2 Harry Potter P( w) harry prisoner azkaban potter order phoenix harry my music of movie potter to iphone P(c j ) P( z i | Context ) P(w | Topic , Context ) j 1.. C i 1.. K j 2009 © Qiaozhu Mei i j University of Illinois at Urbana-Champaign 22 Contextual Topic Patterns • Compare contextualized versions of topics: Contextual topic patterns • Contextual topic patterns conditional distributions – z: topic; c: context; w: word • • P( z | c i ) (or P(c | z j ) ) : P( w | z, c i) strength of topics in context :content variation of topics 2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 23 Example: Topic Life Cycles (Mei and Zhai KDD’05) Normalized Strength of Theme 0.02 Biology Data 0.018 Web Information 0.016 Time Series 0.014 Classification Association Rule 0.012 Clustering 0.01 Bussiness 0.008 0.006 0.004 0.002 0 1999 2000 2001 2002 Time (year) Comparing P (c | z ) 2009 © Qiaozhu Mei 2003 2004 Context = time University of Illinois at Urbana-Champaign 24 Example: Spatiotemporal Theme Pattern (Mei et al. WWW’06) Week2: The discussion moves towards the north and west Week1: The theme is the strongest along the Gulf of Mexico Week3: The theme distributes more uniformly over the states About Government Response in Hurricane Katrina Week4: The theme is again strong along the east coast and the Gulf of Mexico Context = time & location 2009 © Qiaozhu Mei Week5: The theme fades out in most states Comparing University of Illinois at Urbana-Champaign P( z | c) 25 Example: Evolutionary Topic Graph (Mei and Zhai KDD’05) 1999 2000 2001 2002 KDD SVM 0.007 criteria 0.007 classifica – tion 0.006 linear 0.005 … decision 0.006 tree 0.006 classifier 0.005 class 0.005 Bayes 0.005 … mixture 0.005 random 0.006 cluster 0.006 clustering 0.005 variables 0.005 … … Context = time Comparing web 0.009 classifica – tion 0.007 features0.006 topic 0.005 … 2003 P( w | z , c) Classifica - tion text unlabeled document labeled learning … 2009 © Qiaozhu Mei 0.015 0.013 0.012 0.008 0.008 0.007 … Informa - tion 0.012 web 0.010 social 0.008 retrieval 0.007 distance 0.005 networks 0.004 … University of Illinois at Urbana-Champaign 2004 T topic 0.010 mixture 0.008 LDA 0.006 semantic 0.005 … 26 Example: Event Impact Analysis (Mei and Zhai KDD’06) Theme: retrieval models term relevance weight feedback model probabilistic document … 0.1599 0.0752 0.0660 0.0372 0.0310 0.0188 0.0173 vector concept model space boolean function … 1992 0.0678 0.0197 0.0191 0.0187 0.0102 0.0097 SIGIR papers Publication of the paper “A language modeling approach to information retrieval” year Starting of the TREC conferences Context = event Comparing xml email model collect judgment rank … 0.0514 0.0298 0.0291 0.0236 0.0151 0.0123 P( w | z , c) probabilist 0.0778 model 0.0432 logic 0.0404 boolean 0.0281 algebra 0.0200 estimate 0.0119 weight 0.0111 … 2009 © Qiaozhu Mei 1998 model 0.1687 language 0.0753 estimate 0.0520 parameter 0.0281 distribution 0.0268 smooth 0.0198 likelihood 0.0059 … University of Illinois at Urbana-Champaign 27 Implicit Context in Text • Some contexts are hidden – Sentiments; intents; impact; etc. • Document contexts: don’t know for sure – Need to infer this affiliation from the data • Train a model M for each implicit context • Provide M to the topic model as guidance 2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 28 Modeling Implicit Context good 0.10 like 0.05 perfect 0.02 Topic 1 Apple iPod ? hate 0.21 awful 0.03 disgust 0.01 color size quality price scratch problem I like ? ? the song of the movie on my Topic 2 Harry Potter actress music visual director accent plot 2009 © Qiaozhu Mei perfect ipod but hate the accent University of Illinois at Urbana-Champaign 29 Semi-supervised Topic Model (Mei et al. WWW’07) Add Dirichlet priors Document Topics love great r1 1 hate awful r2 2 … Guidance from the user k Maximum Likelihood Estimation (MLE) * arg max log( P( D | )) d1 d2 Maximum A Posterior (MAP) Estimation w * arg max log( P( D | ) P()) dk Similar to adding pseudo-counts to the observation 2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 30 Example: Faceted Opinion Summarization (Mei et al. WWW’07) Neutral Positive Negative ... Ron Howards selection of Tom Hanks to play Robert Langdon. Tom Hanks stars in the movie,who can be mad at that? But the movie might get delayed, and even killed off if he loses. Topic 1: Directed by: Ron Howard Writing credits: Akiva Movie Goldsman ... After watching the movie I went online and some research on ... I remembered when i first read the book, I finished Topic 2: the book in two days. I’m reading “Da Vinci Book Code” now. … Tom Hanks, who is my protesting ... will lose your favorite movie star act faith by ... watching the the leading role. movie. Anybody is interested in it? ... so sick of people making such a big deal about a FICTION book and movie. Awesome book. ... so sick of people making such a big deal about a FICTION book and movie. So still a good book to past time. This controversy book cause lots conflict in west society. Context = topic & sentiment 2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 31 Results: Sentiment Dynamics Facet: the book “ the da vinci code”. ( Bursts during the movie, Pos > Neg ) 2009 © Qiaozhu Mei Facet: the impact on religious beliefs. ( Bursts during the movie, Neg > Pos ) University of Illinois at Urbana-Champaign 32 Results: Topic with User’s Guidance Guidance from the user: I know two topics should look like this • Topics for iPod: No Prior With Prior Battery, nano Marketing Ads, spam Nano Battery battery apple free nano battery shuffle microsoft sign color shuffle charge market offer thin charge nano zune freepay hold usb dock device complete model hour itune company virus 4gb mini usb consumer freeipod dock life hour sale trial inch rechargable 2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 33 Complex Context in Text • Complex context structure of contexts • Many contexts has latent structure – Time; location; social network • Why modeling context structure? – Review novel contextual patterns; – Regularize contextual models; – Alleviate data sparseness: smoothing; 2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 34 Modeling Complex Context Context A and B are closely related B A Topic 1 Topic 2 Ad as Ad as Ad as Ad as Two Intuitions: • Regularization: Model(A) and Model(B) should be similar • Smoothing: Look at B if A doesn’t have enough data O(C ) Likelihood Regulariza tion 2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 35 Applications of Contextual Text Mining • Personalized Search – Personalization with backoff • Social Network Analysis (for schools) – Finding Topical communities • Information Retrieval (for industry labs) – Smoothing Language Models 2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 36 Application I: Personalized Search 2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 37 Personalization with Backoff (Mei and Church WSDM’08) • Ambiguous query: MSG – Madison Square Garden – Monosodium Glutamate • Disambiguate based on user’s prior clicks • We don’t have enough data for everyone! – Backoff to classes of users • Proof of Concept: – Context = Segments defined by IP addresses • Other Market Segmentation (Demographics) 2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 38 Apply Contextual Text Mining to Personalized Search • • • • • The text data: Query Logs The generative model: P(Url| Query) The context: Users (IP addresses) The contextual model: P(Url| Query, IP) The structure of context: – Hierarchical structure of IP addresses 2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 39 Evaluation Metric: Entropy (H) • H ( X ) p( x) log p( x) xX • Difficulty of encoding information (a distr.) – Size of search space; difficulty of a task • H = 20 1 million items distributed uniformly • Powerful tool for sizing challenges and opportunities – How hard is search? – How much does personalization help? 2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 40 How Hard Is Search? • Traditional Search – H(URL | Query) – 2.8 (= 23.9 – 21.1) • Personalized Search – H(URL | Query, IP) – 1.2 (= 27.2 – 26.0) Personalization cuts H in Half! 2009 © Qiaozhu Mei Entropy (H) Query 21.1 URL 22.1 IP 22.1 All But IP 23.9 All But URL 26.0 All But Query 27.1 All Three 27.2 University of Illinois at Urbana-Champaign 41 Context = First k bytes of IP Full personalization: every context has a different model: sparse data! P(Url | IP, Q) 4 P(Url | IP4 , Q) Personalization with backoff: similar contexts have similar models 156.111.188.243 3 P(Url | IP3 , Q) 156.111.188.* 2 P(Url | IP2 , Q) 156.111.*.* 1 P(Url | IP1 , Q) 0 P(Url | IP0 , Q) 156.*.*.* *.*.*.* No personalization: all contexts share the same model 2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 42 Lambda Sparse Data Missed Opportunity 0. 3 Backing Off by IP 0. 25 0. 2 0. 15 0. 1 0. 05 0 λ4 λ3 λ2 λ1 λ0 4 • λs estimated with EM • A little bit of personalization – Better than too much – Or too little P(Url | IP, Q) i P(Url | IPi , Q) i 0 λ4 : weights for first 4 bytes of IP λ3 : weights for first 3 bytes of IP λ2 : weights for first 2 bytes of IP …… 2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 43 Context Market Segmentation • Traditional Goal of Marketing: – Segment Customers (e.g., Business v. Consumer) – By Need & Value Proposition • Need: Segments ask different questions at different times • Value: Different advertising opportunities • Segmentation Variables – Queries, URL Clicks, IP Addresses – Geography & Demographics (Age, Gender, Income) – Time of day & Day of Week 2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 44 Business Days v. Weekends: More Clicks and Easier Queries 9,000,000 Clicks 8,000,000 7,000,000 6,000,000 5,000,000 4,000,000 Easier 3,000,000 1.20 1.18 1.16 1.14 1.12 1.10 1.08 1.06 1.04 1.02 1.00 Entropy (H) More Clicks 1 3 5 7 9 11 13 15 17 19 21 23 Jan 2006 (1st is a Sunday) Total Clicks 2009 © Qiaozhu Mei H(Url | IP, Q) University of Illinois at Urbana-Champaign 45 Harder Queries at TV Time Harder queries 2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 46 Application II: Information Retrieval 2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 47 Application: Text Retrieval Document d A text mining paper Doc Language Model (LM) θd : p(w|d) text 4/100=0.04 mining 3/100=0.03 clustering 1/100=0.01 … data = 0 computing = 0 … Smoothed Doc LM θd' : p(w|d’) text =0.039 mining =0.028 clustering =0.01 … data = 0.001 computing = 0.0005 … Similarity function D( q || d ) p( w | q ) log Query q data mining Query Language Model θq : p(w|q) Data ½=0.5 Mining ½=0.5 2009 © Qiaozhu Mei wV p( w | q ) p( w | d ) p(w|q’) Data ½=0.4 Mining ½=0.4 Clustering =0.1 … ? University of Illinois at Urbana-Champaign 48 Smoothing a Document Language Model Retrieval performance estimate LM smoothing LM text 4/100 = 0.04 mining 3/100 = 0.03 Assoc. 1/100 = 0.01 clustering 1/100=0.01 … data = 0 computing = 0 Estimate a more accurate distribution from sparse data text = 0.039 mining = 0.028 Assoc. = 0.009 clustering =0.01 … data = 0.001 computing = 0.0005 PMLE ( w | … d) text = 0.038 mining = 0.026 Assoc. = 0.008 clustering =0.01 … data = 0.002 computing = 0.001 …) P(w | collection … Assign non-zero prob. to unseen words P(w | d ) (1 ) PMLE (w | d ) P(w | collection) 2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 49 Apply Contextual Text Mining to Smoothing Language Models • • • • • The text data: collection of documents The generative model: P(word) The context: Document The contextual model: P(w|d) The structure of context: – Graph structure of documents • Goal: use the graph of documents to estimate a good P(w|d) 2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 50 Traditional Document Smoothing in Information Retrieval Estimate a Reference language model Collection θref using the collection (corpus) ~ d ref d Collection [Ponte & Croft 98] Clusters P( w | d ) PMLE ( w | d ) P( w | ref ) d Cluster d [Liu & Croft 04] Nearest Neighbors Interpolate MLE with Reference LM neighbors d [Kurland& Lee 04] 2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 51 Graph-based Smoothing for Language Models in Retrieval (Mei et al. SIGIR 2008) Can also be a word graph • A novel and general view of smoothing Collection d Collection = Graph (of Documents) P(w|d) = Surface on top of the Graph P(w|d): MLE P(w|d): Smoothed projection on a plain P(w|d1) d1 P(w|d2) Smoothed LM = d2 Smoothed Surface! 2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 52 The General Objective of Smoothing ~ 2 w(u)( fu fu ) ~ fu fu Fidelity to MLE w(u, v)( fu f v ) ( u ,v )E w(u , v) 2 Smoothness of the surface w(u ) Importance of vertices - Weights of edges (1/dist.) ~ 2 O(C ) (1 ) w(u )( f u f u ) uV 2009 © Qiaozhu Mei w(u, v)( f ( u ,v )E University of Illinois at Urbana-Champaign u fv ) 2 53 Smoothing Language Models using a Document Graph Construct a kNN graph of documents; w(u): Deg(u) w(u,v): cosine d fu= p(w|du); Document language model: P( w | d u ) (1 ) PMLE ( w | du ) vV Additional Dirichlet Smoothing 2009 © Qiaozhu Mei w(u, v) P( w | d v ); Deg (u ) fu University of Illinois at Urbana-Champaign 54 Effectiveness of the Framework Data Sets Dirichlet DMDG DMWG † DSDG AP88-90 0.217 0.254 *** (+17.1%) 0.252 *** (+16.1%) 0.239 *** 0.239 (+10.1%) (+10.1%) LA 0.247 0.258 ** (+4.5%) 0.257 ** (+4.5%) 0.251 ** (+1.6%) SJMN 0.204 0.231 *** (+13.2%) 0.229 *** (+12.3%) 0.225 *** 0.219 (+10.3%) (+7.4%) TREC8 0.257 0.271 *** (+5.4%) 0.271 ** (+5.4%) 0.261 (+1.6%) QMWG 0.247 0.260 (+1.2%) Wilcoxon test: *, **, *** means significance level 0.1, 0.05, 0.01 † DMWG: reranking top 3000 results. Usually this yields to a reduced performance than ranking all the documents Graph-based smoothing >> Baseline Smoothing Doc LM >> relevance score >> Query LM 2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 55 Intuitive Interpretation – Smoothing using Document Graph P (u 1) P(u 0) P(w | du ) (1 ) PML (w | du ) 1 (1 )(1 PML (w | du )) 0 w(u, v) P( w | d v ) Deg (u ) v) vV P (u Absorption Probability to the “1” state d 1 d Act as neighbors do 2009 © Qiaozhu Mei 0 Writing a word w in a document = random walk on the doc Markov chain; write down w if reaching “1” University of Illinois at Urbana-Champaign 56 Application III: Social Network Analysis 2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 57 Topical Community Analysis physicist, physics, scientist, theory, gravitation … writer, novel, best-sell, book, language, film… Topic modeling to help community extraction Network analysis to help topic extraction Computer Science Literature = Information Retrieval + Data Mining + Machine Learning, … or Domain Review + Algorithm + Evaluation, … 2009 © Qiaozhu Mei ? University of Illinois at Urbana-Champaign 58 Apply Contextual Text Mining to Topical Community Analysis • • • • • The text data: Publications of researchers The generative model: topic model The context: author The contextual model: author-topic model The structure of context: – Social Network: coauthor network of researchers 2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 59 Intuitions • People working on the same topic belong to the same “topical community” • Good community: coherent topic + well connected • A topic is semantically coherent if people working on this topic also collaborate a lot IR IR ? IR Intuition: my topics are similar to my neighbors More likely to be an IR person or a compiler person? 2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 60 Social Network Context for Topic Modeling e.g. coauthor network • Context = author • Coauthor = similar contexts • Intuition: I work on similar topics to my neighbors Smoothed Topic distributions P(θj|author) 2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 61 Topic Modeling with Network Regularization (NetPLSA) • Basic Assumption (e.g., co-author graph) • Related authors work on similar topics topic distribution of a document PLSA k O(C , G ) (1 ) ( c( w, d ) log p( j | d ) p( w | j )) d tradeoff between topic and smoothness 1 ( 2 j 1 w u ,v E w(u, v) ( p( j | u ) p( j | v)) 2 ) Graph Harmonic Regularizer, Generalization of [Zhu ’03], 1 2 f j 1... k T j k j 1 difference of topic distribution on neighbor vertices importance (weight) of an edge f j , where f j ,u p( j | u ) 2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 62 Topics & Communities without Regularization Topic 1 Topic 2 Topic 3 Topic 4 term 0.02 peer 0.02 visual 0.02 interface 0.02 question 0.02 patterns 0.01 analog 0.02 towards 0.02 protein 0.01 mining 0.01 neurons 0.02 browsing 0.02 training 0.01 clusters 0.01 vlsi 0.01 xml 0.01 weighting 0.01 stream 0.01 motion 0.01 generation 0.01 multiple 0.01 frequent 0.01 chip 0.01 design 0.01 recognition 0.01 e 0.01 natural 0.01 engine 0.01 relations 0.01 page 0.01 cortex 0.01 service 0.01 library 0.01 gene 0.01 spike 0.01 social 0.01 ?? ? Noisy community assignment ? 2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 63 Topics & Communities with Regularization Topic 1 retrieval Topic 2 Topic 3 Topic 4 mining 0.11 neural 0.06 web 0.05 information 0.05 data 0.06 learning 0.02 services 0.03 document 0.03 discovery 0.03 networks 0.02 semantic 0.03 query 0.03 databases 0.02 recognition 0.02 services 0.03 text 0.03 rules 0.02 analog 0.01 peer 0.02 search 0.03 association 0.02 vlsi 0.01 ontologies 0.02 evaluation 0.02 patterns 0.02 neurons 0.01 rdf user 0.02 frequent 0.01 gaussian 0.01 management 0.01 relevance 0.02 streams 0.01 network 0.01 ontology 0.13 Information Retrieval Web Coherent community assignment 0.02 0.01 Data mining Machine learning 2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 64 Topic Modeling and SNA Improve Each Other The smaller the better Methods Topic Modeling helps balancing communities (text implicitly bridges authors) Cut Edge Weights Ratio Cut/ Norm. Cut PLSA 4831 NetPLSA NCut Community Size Community 1 Community 2 Community 3 Community 4 2.14/1.25 2280 2178 2326 2257 662 0.29/0.13 2636 1989 3069 1347 855 0.23/0.12 2699 6323 8 11 -Ncut: spectral clustering with normalized cut. J. Shi et al. 2000 - pure network based community finding Network Regularization helps extract coherent communities (network assures the focus of topics) 2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 65 Smoothed Topic Map Map a topic on the network (e.g., using p(θ|a)) Core contributors Intermediate Irrelevant PLSA NetPLSA (Topic : “information retrieval”) 2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 66 Summary of My Talk • Text + Context = Contextual Text Mining – A new paradigm of text mining • A novel framework for contextual text mining – Probabilistic Topic Models – Contextualize by simple context, implicit context, complex context; • Applications of contextual text mining 2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 67 A Roadmap of My Work KDD 05 KDD 06a WWW 06 Contextual Topic Models Contextual Text Mining KDD 06b WWW 07 KDD 07 WWW 08 WSDM 08 SIGIR 08 ACL 08 CIKM 08 Information Retrieval & Web Search 2009 © Qiaozhu Mei SIGIR 07 University of Illinois at Urbana-Champaign 68 Research Discipline Applied Statistics Data Mining Machine Learning Database Social Text Mining Networks Information Retrieval Text Information Information Science Management Natural Language Processing Bioinformatics 2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 69 End Note + 2009 © Qiaozhu Mei = University of Illinois at Urbana-Champaign 70 Thank You 2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 71