Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Towards Contextual Text Mining Qiaozhu Mei [email protected] University of Illinois at Urbana-Champaign 2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Knowledge Discovery from Text Text Mining System 2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 2 Overload of Text Content Content Type Published Professional User generated Content web content content Private text content Amount / day 3-4G ~ 3T ~ 2G 8-10G - Ramakrishnan and Tomkins 2007 2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 3 Challenge of Mining Text ~100B 10B Gold? ~3M day ~750k /day ~150k /day 1M 6M Where to Start? Where to Go? 2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 4 Context - “Situation of Text” Check Lap Kok, HK Author Time Location Author’s occupation self designer, publisher, editor … 3:53 AM Jan 28th Source Sentiment From Ping.fm Language Social Network 2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 5 Rich Context Information ~1B queries Per hour? ~1B Users 100M users > 1M groups ~3M msgs /day ~5M users 102M blogs 5M users 500M URLs 73 years ~400k authors 8M contributors 100+ languages ~4k sources 2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 6 Text + Context = ? + Context = Guidance 2009 © Qiaozhu Mei I Have A Guide! University of Illinois at Urbana-Champaign 7 Query Log + User = Personalized Search Metropolis Street Racer MSR Magnetic Stripe Reader Molten salt reactor Modern System Research Mars sample return Wikipedia definitions If you know me, you should give me Microsoft Research… Medical simulation Montessori School of Raleigh MSR Racing Mountain Safety Research How much can personalized help? 2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 8 Customer Reviews + Brand = Comparative Product Summary IBM Laptop Reviews APPLE Laptop Reviews DELL Laptop Reviews Common Themes IBM APPLE DELL Battery Life Long, 4-3 hrs Medium, 3-2 hrs Short, 2-1 hrs Hard disk Large, 80-100 GB Small, 5-10 GB Medium, 20-50 GB Speed Slow, 100-200 Mhz Very Fast, 3-4 Ghz Moderate, 1-2 Ghz Can we compare Products? 2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 9 Scientific Literature + Time = Topic Trends 1800 1600 1400 1200 Hot Topics in SIGMOD 1000 Sensor Networks Structured data, XML Web data Data Streams 800 Ranking, Top-K 600 400 200 0 What’s hot in literature? 2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 10 Blogs + Time & Location = Spatiotemporal Topic Diffusion One Week Later How does discussion spread? 2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 11 Blogs + Sentiment = Faceted Opinion Summary The Da Vinci Code Tom Hanks, who is my favorite movie star act the leading role. protesting... will lose your faith by watching the movie. 120 100 80 Positive Negative 60 40 a good book to past time. ... so sick of people making such a big deal about a fiction book 20 0 What is good and what is bad? 2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 12 Publications + Social Network = Topical Community Coauthor Network Information retrieval Machine learning Data mining Who works together on what? 2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 13 A General Solution for All ? Query log + User = Personalized Search Scientific Literature + Time = Topic Trends Review + Brand = Comparative Opinion Blog + Time & Location = Spatiotemporal Topic Diffusion Blog + Sentiment = Faceted Opinion Summary Publications + Social Network = Topical Community ….. Text + Context = Contextual Text Mining 2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 14 Roadmap • Generative Model of Text • Integrating Contexts in Text Models – Modeling Simple Context – Modeling Implicit Context – Modeling Complex Context • Applications of Contextual Text Mining 2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 15 Generative Model of Text P( word | Model ) the is harry potter movie plot time rowling 0.1 0.07 0.05 0.04 0.04 0.02 0.01 0.01 Inference, Estimation movie the.. movie.. harry .. potter is .. based.. on.. j..k..rowling Generation 2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 16 Text as a Mixture of Topics Topic (Theme) = the subject of a discourse mining data pattern clustering network …… 0.21 0.13 0.10 0.05 0.04 Data Mining learning model training kernel inference …… 0.18 0.14 0.10 0.09 0.07 Machine Learning search engine query user ranking …… Web Search 0.2 0.15 0.08 0.07 0.06 … Database K topics Using machine learning for web search 2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 17 Probabilistic Topic Models (Hofmann ’99, Blei et al. ’03, …) P( w) P( z i) P(w | Topic ) i i 1.. K ipod ipod nano music download apple 0.15 0.15 0.08 0.05 0.02 0.01 Topic 1 Apple iPod I downloaded the the movie harry harry potter actress music 0.10 0.09 0.09 0.05 0.04 0.02 harry Topic 2 my music of movie potter to ipod nano Harry Potter 2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 18 Parameter Estimation • Maximum Likelihood Estimation (MLE): * arg max P( D | ) • Parameter Estimation using EM algorithm PseudoCounts – Gibbs sampling, Variational inference, Expectation propagation ipod nano music download apple ? 0.15 ? 0.08 ? 0.05 ? 0.02 ? 0.01 movie harry potter actress music 0.10 ? 0.09 ? 0.05 ? 0.04 ? 0.02 ? Guess the affiliation I downloaded the the Estimate the params 2009 © Qiaozhu Mei music of movie harry my potter to ipod University of Illinois at Urbana-Champaign nano 19 How Context Affects Topics “Context of Situation” - B. Malinowski 1923 • Topics in science literature: 16th Century v.s. 21st Century • When do a computer scientist and a gardener write about “tree, root, prune? ” • In Europe, “football” appears a lot in a soccer report. What about in the US? Text are generated according to the Context!! 2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 20 Existing Work • PLSA (Hofmann ‘99), LDA (Blei et al ‘03), CTM (Blei et al. ‘06), PAM (Li and McCallum ‘06) – Don’t incorporate contexts • Author: Author-topic model (Steyvers et al. 04) • Time: Topic-over-time (Wang et al. 06), Dynamic Topic model (Blei et al ‘06) Can we capture the context in a general way? 2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 21 Contextualized Models P( word | Model , Context) P(w|M, Year = 1998) Year = 1998 book harry potter rowling Inference: Sentiment = + • How to reveal contextual patterns? 0.15 0.10 0.08 0.05 book P(w|M,Year Year==2008 2008) movie harry potter director 0.18 0.09 0.08 0.04 Generation: • How to select contexts? • How to model context structure? Source = official Location = China Location = US 2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 22 Roadmap: Modeling Simple Context Author Time Simple Contexts Location Author’s occupation Source Language 2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 23 Simple Contextual Topic Model (Mei and Zhai KDD’06) P( w) j 1.. C P(c j ) P( z i | c j ) P( w | Topici , c j ) Context 1: 2004 Topic 1 Apple iPod i 1.. K Contextual Topic Patterns Context 2: 2007 ipod mini 4gb ipod iphone nano I downloaded the the Topic 2 Harry Potter harry prisoner azkaban potter order phoenix 2009 © Qiaozhu Mei harry my University of Illinois at Urbana-Champaign music of movie potter to iphone 24 Example: Topic Life Cycles (Mei and Zhai KDD’05) Context = Time Contextual Topic Pattern P(z|time) 1800 1600 1400 1200 Hot Topics in SIGMOD 1000 Sensor Networks Structured data, XML Web data Data Streams 800 Ranking, Top-K 600 400 200 0 2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 25 Example: Spatiotemporal Theme Pattern (Mei et al. WWW’06) Hurricane Katrina Context = Time & Location Contextual Topic Pattern P(z|time, location) Topic: Government Response in Hurricane Katrina Hurricane Rita 2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 26 Example: Event Impact Analysis (Mei and Zhai KDD’06) Context = Event Contextual Pattern P(w|z, event) term 0.16 relevance 0.08 weight 0.07 feedback 0.04 model 0.03 probabilistic 0.02 document 0.02 … Topic: retrieval models xml 0.07 email 0.02 Evaluation & model 0.02 Applications collect 0.02 judgment 0.01 rank 0.01 … vector 0.05 concept 0.03 model 0.03 Traditional space 0.02 Models0.02 boolean function 0.01 … 1992 Starting of TREC [Ponte and Croft 98] 1998 probabilist 0.08 model 0.04 logic 0.04 Probabilistic boolean 0.03 Models0.02 algebra weight 0.01 … 2009 © Qiaozhu Mei SIGIR University of Illinois at Urbana-Champaign model 0.17 language 0.08 estimate 0.05 parameter Language 0.03 distribution Models 0.03 smooth 0.02 likelihood 0.01 … 27 Instantiation: Personalized Search (Mei and Church WSDM’08) 2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 28 Personalization with Backoff • Ambiguous query: MSR – Microsoft Research – Mountain Safety Research • Disambiguate based on user’s prior clicks • We don’t have enough data for everyone! – Backoff to classes of users • Proof of Concept: – Context = Classes of Users defined by IP address 2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 29 Personalized Search as Contextual Text Mining Context Users (IP), groups of users (IP, Query, URL) 156.111.188.243 156.111.188.* 156.111.*.* 156.*.*.* Text: query(click) logs *.*.*.* Text Model: P(URL | Query) Contextual P(URL | Query, User) Model: 2009 © Qiaozhu Mei Goal: Estimate Better P(URL | Query, User) University of Illinois at Urbana-Champaign 30 Evaluation Metric: Entropy (H) H (URL) p(URL) log p(URL) URL • Difficulty of encoding information (a distribution) – Size of search space; difficulty of a task • Powerful tool for sizing challenges and opportunities – How hard is web search? – How much does personalization help? • Predict future Cross Entropy H(Future|History) 2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 31 Difficulty of Queries • Easy queries (low H(URL|Q)): – google, yahoo, myspace, ebay, … • Hard queries (high H(URL|Q)): – dictionary, yellow pages, movies, “what is may day?” Hard Query: “MSR” – High Entropy msrgear.com msracing.com research....com msrwheels.com msr.com msr.org msrdev.com … 0.12 0.10 0.09 0.08 0.07 0.07 0.06 0.05 2009 © Qiaozhu Mei Easy Query: “Google” – Low Entropy google.com google.cn maps.google … … University of Illinois at Urbana-Champaign 0.80 0.10 0.08 ~0 ~0 ~0 ~0 ~0 32 How Hard Is Search? • Traditional Search – H(URL | Query) – 2.8 (= 23.9 – 21.1) • Personalized Search – H(URL | Query, IP) – 1.2 (= 27.2 – 26.0) Personalization cuts H in Half! 2009 © Qiaozhu Mei Entropy (H) Query 21.1 URL 22.1 IP 22.1 Query, URL 23.9 Query, IP 26.0 IP, URL 27.1 All Three 27.2 University of Illinois at Urbana-Champaign 33 Context = First k bytes of IP Full personalization: every user has a different model: sparse data! P(URL | User, Q) Personalization with backoff: smooth by similar users 4 P(URL | IP4 , Q) 3 P (URL | IP3 , Q) 2 P(URL | IP2 , Q) 1 P(URL | IP1 , Q) 0 P (URL | IP0 , Q) 156.111.188.243 156.111.188.* 156.111.*.* 156.*.*.* *.*.*.* No personalization: all users share the same model: Missed Opportunity 2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 34 Context Market Segmentation • Can we do better than IP address? • Potential Context Variables – ID, QueryType, Click, Intent, … – Demographics (Age, Gender, Income, …) – Time of day & Day of Week 2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 35 Roadmap: Modeling Implicit Context Implicit Contexts Sentiment 2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 36 Implicit Context of Text Sentiments Intents ? ? ? Impact Trust Need to infer these situations/conditions from the data (with prior knowledge) 2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 37 Modeling Implicit Context Trained From training Model data or Guidance or user from guidance user–– added added as as prior prior arg max P( D | ) P() * good 0.10 like 0.05 perfect 0.02 Topic 1 Apple iPod ? hate 0.21 awful 0.03 disgust 0.01 color size quality price scratch problem I like ? ? the song of the movie on my Topic 2 Harry Potter actress music visual director accent plot 2009 © Qiaozhu Mei perfect ipod but hate the accent University of Illinois at Urbana-Champaign 38 Example: Faceted Opinion Summarization (Mei et al. WWW’07) Context = Sentiment The Da Vinci Code Topic 1: Movie Tom Hanks, who is my favorite movie star act the leading role. Topic 2: Book a good book to past time. 2009 © Qiaozhu Mei Protesting.. you will lose your faith by watching the movie. ... so sick of people making such a big deal about a fiction book University of Illinois at Urbana-Champaign 39 Roadmap: Modeling Complex Context Complex Contexts Social Network 2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 40 Complex Context of Text Structures of contexts • Find novel contextual patterns; • Regularize contextual models; • Alleviate data sparseness; 2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 41 Modeling Complex Context O( D) Likelihood Regulariza tion Context A and B are closely related A Topic 1 Topic 2 B Intuitions : ipod nano 4gb ipod nano 8gb harry potter actor harry potter actress 2009 © Qiaozhu Mei Model(A) and Model(B) should be similar • users in the same building issue similar queries • collaborating researchers work on similar things • topics in SIGMOD are like topics in VLDB University of Illinois at Urbana-Champaign 42 Graph-based Regularization Structure of contexts a graph v Intuition: Model(u) and Model(v) should be similar O( D, G ) Likelihood Regulariza tion u surface(s) on top of the Graph projection on a plane u , v : MLE Smoothed u Model(u) v Model(v) u v Intuition = Regularized model = Smoothed Surfaces! 2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 43 Instantiation: Topical Community Extraction (Mei et al. WWW’08) 2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 44 Social Network Analysis Generation, evolution e.g., [Leskovec 05] Community extraction e.g., [Kleinberg 00]; - Kleinberg and Backstrom 2006, New York Times Diffusion [Gruhl 04]; [Backstrom 06] Search e.g., [Adamic 05] Ranking e.g., [Brin and Page 98]; [Kleinberg 98] - Jeong et al. 2001 Nature 411 Usually don’t model topics in text 2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 45 Topical Community Analysis physicist, physics, scientist, theory, gravitation … writer, novel, best-sell, book, language, film… Topics in text help community extraction Text + Network topical communities Computer Science Literature = Information Retrieval + Data Mining + Machine Learning, … 2009 © Qiaozhu Mei + University of Illinois at Urbana-Champaign 46 Topical Community Extraction as Contextual Text Mining Context Authors Text: Scientific publications Context Structure: Social Network (coauthorship) Text Model: Topic Model Goal: Assign authors into topical communities using P(z|author) Contextual Model: Topic Model + Author - Regularize using social network 2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 47 Topic Modeling with Network Regularization Intuition 1: Know my research topics from my publications Data Likelihood Model parameters: Text Model k O( D, G ) (1 ) ( c( w, c) log p( j | c) p( w | j )) c w j 1 O( D, G ) Likelihood Regulariza tion tradeoff between MLE and smoothness 1 ( 2 u ,v E k w(u, v) ( p( j | u ) p( j | v)) 2 ) j 1 Smoothness of between neighbors Graph Regularization Graph Harmonic Regularizer, (a generalization of [Zhu ’03]) 2009 © Qiaozhu Mei Intuition 2: I work on similar topics with my coauthors University of Illinois at Urbana-Champaign 48 Topics & Communities without Network Regularization Topic 1 Topic 2 Topic 3 Fuzzy Topics Topic 4 term 0.02 peer 0.02 visual 0.02 interface 0.02 question 0.02 patterns 0.01 analog 0.02 towards 0.02 protein 0.01 mining 0.01 neurons 0.02 browsing 0.02 training 0.01 clusters 0.01 vlsi 0.01 xml 0.01 weighting 0.01 stream 0.01 motion 0.01 generation 0.01 multiple 0.01 frequent 0.01 chip 0.01 design 0.01 recognition 0.01 e 0.01 natural 0.01 engine 0.01 relations 0.01 page 0.01 cortex 0.01 service 0.01 library 0.01 gene 0.01 spike 0.01 social 0.01 ?? ? Noisy community assignment ? Four Conferences: SIGIR, KDD, NIPS, WWW 2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 49 Topics & Communities with Network Regularization Topic 1 retrieval Topic 2 Topic 3 Clear Topics Topic 4 mining 0.11 neural 0.06 web 0.05 information 0.05 data 0.06 learning 0.02 services 0.03 document 0.03 discovery 0.03 networks 0.02 semantic 0.03 query 0.03 databases 0.02 recognition 0.02 services 0.03 text 0.03 rules 0.02 analog 0.01 peer 0.02 search 0.03 association 0.02 vlsi 0.01 ontologies 0.02 evaluation 0.02 patterns 0.02 neurons 0.01 rdf user 0.02 frequent 0.01 gaussian 0.01 management 0.01 relevance 0.02 streams 0.01 network 0.01 ontology 0.13 Information Retrieval Web Coherent community assignment 0.02 0.01 Data mining Machine learning 2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 50 Topic Modeling and SNA Improve Each Other The smaller the better Text Only Methods Topic Modeling helps balancing communities (text implicitly bridges authors) Cut Edge Weights Ratio Cut/ Norm. Cut PLSA 4831 NetPLSA NCut Network Only Community Size Community 1 Community 2 Community 3 Community 4 2.14/1.25 2280 2178 2326 2257 662 0.29/0.13 2636 1989 3069 1347 855 0.23/0.12 2699 6323 8 11 -Ncut: spectral clustering with normalized cut. (Shi et al. ’00) Network Regularization helps extract coherent communities (ensure tight connection of authors) 2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 51 Summary of My Talk • Text + Context = Contextual Text Mining – A new paradigm of text mining • General methodology for contextual text mining – Generative models of text (e.g., Topic Models) – Contextualized models with simple context, implicit context, complex context; • Applications of contextual text mining 2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 52 Take Away Message + Text 2009 © Qiaozhu Mei Context = University of Illinois at Urbana-Champaign 53 A Roadmap of My Work KDD 06a KDD 07 Labeling topic models Annotating frequent patterns KDD 05 Bio. literature mining PSB 06 IP&M 07 Contextual Topic Models Text Mining Application to Bioinfo. WWW 06 KDD 06b WWW 07 WWW 08 KDD 08 WSDM 08 SIGIR 08 SIGIR 07 Graph-based smoothing ACL 08 Information Retrieval & Web Search 2009 © Qiaozhu Mei CIKM 08 University of Illinois at Urbana-Champaign Poisson language models Impact-based summarization Query suggestion using hitting time 54 A Roadmap to the Future Theoretical Framework • Computational challenge; • Structure of contexts Task Support Systems • Web users • Scientists • Business users Text Mining Text Information Management Information Retrieval & Web Search Applications Interdisciplinary • Bioinformatics • Health informatics • Business informatics Integrative analysis of heterogeneous data • web 2.0 data • Science data • Information networks 2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 55 Thanks! 2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 56 Predict the Future Cross Entropy: H(future | history) • IP in the future might not be seen in the history No personalization Complete personalization Personalization with backoff 4 Knows at least two bytes 0 1 Knows3every 2 byte – enough data 2009 © Qiaozhu Mei At least first k bytes of IP are seen in History University of Illinois at Urbana-Champaign 57