Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Context Analysis in Text Mining and Search Qiaozhu Mei Department of Computer Science University of Illinois at Urbana-Champaign http://sifaka.cs.uiuc.edu/~qmei2, [email protected] Joint work with ChengXiang Zhai 2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign 1 Motivating Example: Personalized Search MSR Metropolis Street Racer Magnetic Stripe Reader Molten salt reactor Mars Sample Return … Mountain safety research Actually Looking for Microsoft Research… 2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign 2 Motivating Example: Comparing Product Reviews IBM Laptop Reviews APPLE Laptop Reviews DELL Laptop Reviews Common Themes “IBM” specific “APPLE” specific “DELL” specific Battery Life Long, 4-3 hrs Medium, 3-2 hrs Short, 2-1 hrs Hard disk Large, 80-100 GB Small, 5-10 GB Medium, 20-50 GB Speed Slow, 100-200 Mhz Very Fast, 3-4 Ghz Moderate, 1-2 Ghz Unsupervised discovery of common topics and their variations 2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign 3 Motivating Example: Discovering Topical Trends in Literature SIGIR topics Topic Strength Time 1980 1990 1998 TF-IDF Retrieval 2003 Language Model IR Applications Text Categorization Unsupervised discovery of topics and their temporal variations 2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign 4 Motivating Example: Analyzing Spatial Topic Patterns • • How do bloggers in different states respond to topics such as “oil price increase during Hurricane Karina”? Unsupervised discovery of topics and their variations in different locations 2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign 5 Motivating Example: Summarizing Sentiments Topic-sentiment summary Topic-sentiment dynamics (Topic = Price) Query: Dell Laptops strength Facet 1 (Price) Facet 2 (Battery) positive negative neutral • it is the best site and they show Dell coupon code as early as possible • Even though Dell's price is cheaper, we still don't want it. • mac pro vs. dell precision: a price comparis.. • One thing I really like about this Dell battery is the Express Charge feature. • my Dell battery sucks • …… • Stupid Dell laptop battery • …… Positive Negative Neutral • DELL is trading at $24.66 • i still want a free battery from dell.. • …… time Unsupervised/Semi-supervised discovery of topics and different sentiments of the topics 2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign 6 Motivating Example: Analyzing Topics on a Social Network Bruce Croft Information retrieval Publications of Bruce Croft Publications of Gerard Salton Gerard Salton Machine learning Data mining Unsupervised discovery of topics and correlated research communities 2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign 7 Research Questions • What are these problems in common? • Can we model all these problems generally? • Can we solve these problems with a unified approach? • How can we bring human into the loop? 2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign 8 Rest of Talk • Background: Language Models in Text Mining and Retrieval • Definition of context • General methodology to model context – Models, example applications, results • Conclusion and Discussion 2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign 9 Generative Models of Text • Text as observations: words; tags; links, etc • Use a unified probabilistic model to explain the appearance (generation) of observations • Documents are generated by sampling every observation from such a generative model • Different generation assumption different model – Document Language Models – Probabilistic Topic Models: PLSA, LDA, etc. – Hidden Markov Models … 2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign 10 Multinomial Language Models A multinomial distribution of words as a text representation retrieval information model query language feedback …… 0.2 0.15 0.08 0.07 0.06 0.03 Known as a Topic model when there are k of them in text: e.g., semi-supervised learning; boosting; spectral clustering, etc. 2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign 11 Language Models in Information Retrieval (e.g., KL-Div. Method) Document d A text mining paper Doc Language Model (LM) θd : p(w|d) text 4/100=0.04 mining 3/100=0.03 clustering 1/100=0.01 … data = 0 computing = 0 … Smoothed Doc LM θd' : p(w|d’) text =0.039 mining =0.028 clustering =0.01 … data = 0.001 computing = 0.0005 … Similarity function D( q || d ) p( w | q ) log Query q data mining Query Language Model θq : p(w|q) Data ½=0.5 Mining ½=0.5 2008 © Qiaozhu Mei wV p( w | q ) p( w | d ) p(w|q’) Data ½=0.4 Mining ½=0.4 Clustering =0.1 … University of Illinois at Urbana-Champaign ? 12 Probabilistic Topic Models for Text Mining Topic models (Multinomial distributions) Text Collections Probabilistic Topic Modeling … PLSA [Hofmann 99] LDA [Blei et al. 03] Author-Topic [Steyvers et al. 04] Pachinko allocation [Li & McCallum 06] CPLSA [Mei & Zhai 06] CTM … [Blei et al. 06] 2008 © Qiaozhu Mei term relevance weight feedback independ. model … web search link graph … 0.16 0.08 Subtopic discovery 0.07 Topical pattern 0.04 analysis 0.03 0.03 Summarization 0.21 Opinion comparison 0.10 0.08 Passage 0.05 segmentation … University of Illinois at Urbana-Champaign … 13 Importance of Context • Science in the year 2000 and Science in the year 1500: Are we still working on the same topics? • For a computer scientist and a gardener: Does “tree, root, prune” mean the same? • “Football” means soccer in Europe. What about in US? Context affects topics! 2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign 14 Context Features of Text (Meta-data) Weblog Article communities Author source Author’s Occupation 2008 © Qiaozhu Mei Time Location University of Illinois at Urbana-Champaign 15 Context = Partitioning of Text papers written in 1998 1998 1999 Papers about Web …… …… 2005 2006 papers written by authors in US WWW SIGIR ACL 2008 © Qiaozhu Mei KDD SIGMOD University of Illinois at Urbana-Champaign 16 Rich Context Information in Text • News articles: time, publisher, etc. • Blogs: time, location, author, … • Scientific Literature: author, publication year, conference, citations, … • Query Logs: time, IP address, user, clicks, … • Customer reviews: product, source, time, sentiments.. • Emails: sender, receiver, time, thread, … • Web pages: domain, time, click rate, etc. • More? entity-relations, social networks, …… 2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign 17 Categories of Context • Some partitions of text are explicit explicit context – Time; location; author; conference; user; IP; etc – Similar to metadata • Some partitions are implicit implicit context – Sentiments; missions; goals; intents; • Some partitions are at document level • Some are at a finer granularity – Context of a word; an entity; a pattern; a query, etc. – Sentences; sliding windows; adjacent words; etc 2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign 18 Context Analysis • Use context to infer semantics – Annotating frequent patterns; labeling of topic models • Use context to provide targeted service – Personalized search; intent-based search; etc. • Compare contextual patterns of topics – Evolutionary topic patterns; spatiotemporal topic patterns; topic-sentiment patterns; etc. • Use context to help other tasks – Social network analysis; impact summarization; etc. 2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign 19 General Methodology to Model Context • Context Generative Model – Observations in the same context are generated with a unified model – Observations in different contexts are generated with different models – Observations in similar contexts are generated with similar models • Text is generated with a mixture of such generative models – Example Task; Model; Sample results 2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign 20 Model a unique context with a unified model (Generation) 2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign 21 Probabilistic Latent Semantic Analysis (Hofmann ’99) Topics θ1…k government donation P(w|θj) Draw a word from i Criticism of government government 0.3 response government to the hurricane response 0.2.. primarily consisted of response criticism of its response to … The total shut-in oil production from the Gulf donate of Mexico … approximately A Document d help 24%aid of the annual production and the shutin gas production … Over seventy countries pledged Orleans monetary new donations or other assistance. … donate 0.1 relief 0.05 help 0.02 .. city 0.2 new 0.1 orleans 0.05 .. New Orleans Documents about “Hurricane Katrina” 1 2 3 4 πd : P(θi|d) Choose a topic θk πd Zd,n Wd,n N D θk K 2008 © Qiaozhu Mei πd p(d , wd ,n ) p(d ) p( wd ,n | z k , k ) p( z k | d ) k University of Illinois at Urbana-Champaign 22 Example: Topics in Science (D. Blei 05) 2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign 24 Label a Multinomial Topic Model Retrieval models term 0.1599 relevance 0.0752 weight 0.0660 feedback 0.0372 independence 0.0311 model 0.0310 frequent 0.0233 probabilistic 0.0188 document 0.0173 … – Mei and Zhai 06: a topic in SIGIR • • • • Semantically close (relevance) Understandable – phrases? High coverage inside topic Discriminative across topics iPod Nano じょうほうけんさく Pseudo-feedback Information Retrieval 2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign 25 Automatic Labeling of Topics Collection NLP Chunker Ngram Stat. information retrieval, retrieval model, index structure, relevance feedback, … (e.g., SIGIR) 1 Candidate label pool term 0.16 relevance 0.07 weight 0.07 feedback 0.04 independence 0.03 model 0.03 … 2 Relevance Score Information retrieval 0.26 retrieval models 0.19 IR models 0.17 pseudo feedback 0.06 …… filtering 0.21 collaborative 0.15 … trec 0.18 evaluation 0.10 … 3 Discrimination information retriev. 0.26 0.01 retrieval models 0.20 IR models 0.18 pseudo feedback 0.09 …… 4 Coverage retrieval models 0.20 IR models 0.18 0.02 pseudo feedback 0.09 …… information retrieval 0.01 2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign 26 Label Relevance: Context Comparison • Intuition: expect the label with similar context (distribution) Clustering l2: “hash join” Clustering Clustering Good Label (l1) dimension “clustering dimension dimension algorithm” Topic partition partition Score rank algorithm algorithm key …hash join … code …hash table …search …hash join… map key…hash …algorithm…key …hash…key table…join… … algorithm (l, ) = D(||l) key p ( w | ) PMI ( w , l | C) w P(w|) … hash … hash hash p(w | clustering algorithm ) 2008 © Qiaozhu Mei p(w | hash join) University of Illinois at Urbana-Champaign 27 Results: Sample Topic Labels the, of, a, and, to, data, > 0.02 … clustering 0.02 clustering algorithmtime 0.01 clustering structure clusters 0.01 … databases 0.01 large 0.01 performance 0.01 0.005 large data, data quality north 0.02 case 0.01 trial 0.01 iran 0.01 documents 0.01 walsh 0.009 reagan 0.009 charges 0.007 r tree b tree … quality, high data, data application, … indexing methods 2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign iran contra … tree trees spatial b r disk array cache 0.09 0.08 0.08 0.05 0.04 0.02 0.01 0.01 28 Model different contexts with different models (Discrimination, Comparison) 2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign 29 Example: Finding Evolutionary Patterns of Topics 1999 2000 2001 2002 KDD web 0.009 classifica – tion 0.007 features0.006 topic 0.005 … SVM 0.007 criteria 0.007 classifica – tion 0.006 linear 0.005 … decision 0.006 tree 0.006 classifier 0.005 class 0.005 Bayes 0.005 … 2003 mixture 0.005 random 0.006 cluster 0.006 clustering 0.005 variables 0.005 … … Content Variations … over Contexts … Classifica - tion text unlabeled document labeled learning … 2008 © Qiaozhu Mei 0.015 0.013 0.012 0.008 0.008 0.007 … Informa - tion 0.012 web 0.010 social 0.008 retrieval 0.007 distance 0.005 networks 0.004 … University of Illinois at Urbana-Champaign 2004 T topic 0.010 mixture 0.008 LDA 0.006 semantic 0.005 … 30 Example: Finding Evolutionary Patterns of Topics (II) Normalized Strength of Theme 0.02 Biology Data 0.018 Web Information 0.016 Time Series 0.014 Classification Association Rule 0.012 Clustering 0.01 Bussiness 0.008 0.006 0.004 0.002 0 1999 2000 2001 2002 Time (year) 2003 2004 Figure from (Mei ‘05) Strength Variations over Contexts 2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign 31 View of Topics: Context-Specific Version of Views Context 1: 1998 ~ 2006 (e.g. After “Language Modeling”) One context one view A document selects from a mix of views vector space TF-IDF Topic 1: Retrieval Model retrieve model relevance documen t query LSI retrieval vector Rocchio weighting Feedback feedback term Topic 2: Okapi feedback judge expansion pseudo query mixture language model smoothing query model estimate EM feedback pseudo generation Context 2: 1977 ~ 1998 (i.e. Before “Language Modeling”) 2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign 32 Coverage of Topics: Distribution over Topics Oil Price Government Response Aid and donation Background Criticism of government response to the hurricane primarily consisted of criticism of its 1 response to … The total shut-in oil production 2 from the Gulf of Mexico … approximately 24% of 3 4 annual production and the shut-in gas the production … Over seventy countries pledged monetary donations or other assistance. … Context: Texas Oil Price Government Response Aid and donation Background • A coverage of topics: a (strength) distribution over the topics. • 1One context one coverage 2 • 3A document selects from a mix of 4 multiple coverages. Context: Louisiana 2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign 33 A General Solution: CPLSA • • CPLAS = Contextual Probabilistic Latent Semantic Analysis An extension of PLSA model ([Hofmann 99]) by – Introducing context variables – Modeling views of topics – Modeling coverage variations of topics • Process of contextual text mining – Instantiation of CPLSA (context, views, coverage) – Fit the model to text data (EM algorithm) – Compare a topic from different views – Compute strength dynamics of topics from coverages – Compute other probabilistic topic patterns 2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign 34 The “Generation” Process Topics View1 View2 View3 government Choose a theme Draw a word from i Criticism of government government 0.3 response togovernment the hurricane primarily consistedof of response 0.2.. Context response criticism of its response Document: to … The total shut-in oil production from the Gulf Time = July 2005 of Mexico …donate approximately help = Texas 24%Location of the annual aid production theBrill shutAuthor =and Eric inOccup. gas production … Over = Sociologist seventyAge countries pledged = Orleans 45+ monetary donations or new … other assistance. … donate 0.1 relief 0.05 help 0.02 .. donation city 0.2 new 0.1 orleans 0.05 .. New Orleans Texas July 2005 Topic coverages: sociolo gist Choose a view 1 2 3 4 Texas July 2005 1 2 3 4 1 2 3 4 1 2 3 4 …… sociologist 2008 © Qiaozhu Mei 1 2 3 4 Choose a Coverage document University of Illinois at Urbana-Champaign 35 An Intuitive Example • Two topics: web search; machine learning • I am writing a WWW paper. I will cover more about “web search” instead of “machine learning”. Coverage 1 2 3 4 – But of course I have my own taste. • I am from a search engine company, so when I write about “web search”, I will focus on “search donate 0.1 engine” and “online advertisements”… relief 0.05 help 0.02 .. View 2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign city 0.2 new 0.1 orleans 0.05 .. 36 The Probabilistic Model • A probabilistic model explaining the generation of a document D and its context features C: if an author wants to write such a document, he will – Choose a view vi according to the view distribution p(vi | D, C ) – Choose a coverage кj according to the coverage distribution p( j | D, C ). – Choose a theme il according to the coverage кj . – Generate a word using il. – The likelihood of the document collection is: log p(D) n m c(w, D) log( p(v | D, C) p( ( D ,C )D wV 2008 © Qiaozhu Mei i 1 i j 1 k j | D, C ) p(l | j ) p(w | il )) l 1 University of Illinois at Urbana-Champaign 37 Example results: Query Log Analysis Context = Days of week Day-Week Pattern of Search Difficulty 10000000 1.25 9000000 1.2 Query & Clicks: more query/clicks on weekdays 8000000 1.15 7000000 6000000 1.1 5000000 1.05 4000000 3000000 1 2000000 Total Clicks 1000000 H(Url | IP, Q) 0 0.95 Search Difficulty: more difficult to predict on weekends 0.9 1 3 5 7 9 11 13 15 17 19 21 23 Jan 2006 (Jan. 1st is a Sunday) 38 Query Log Analysis Context = Type of Query Query Frequency over time Query Frequency over time 0.08 0.06 yahoo 0.07 mapquest Query Frequency 0.05 cnn 0.06 0.04 0.05 0.04 0.03 0.03 0.02 0.02 sex movie 0.01 0.01 mp3 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 Jan 2006 (Jan 1st is a Sunday) Business Queries: clear dayweek pattern; weekdays more frequent than weekends 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 Jan 2006 (Jan 1st is a Sunday) Consumer Queries: no clear day-week pattern; weekends are comparable, even more frequent than weekdays 39 Bursting Topics in SIGMOD: Context = Time (Years) 1800 1600 1400 1200 1000 800 600 400 200 0 Sensor data XML data Web data Data Streams Ranking, Top-K 75 978 981 984 987 990 993 996 999 002 005 9 1 1 1 1 1 1 1 1 1 2 2 2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign 40 Spatiotemporal Text Mining: Context = Time & Location Week2: The discussion moves towards the north and west Week1: The theme is the strongest along the Gulf of Mexico Week3: The theme distributes more uniformly over the states About Government Response in Hurricane Katrina Week4: The theme is again strong along the east coast and the Gulf of Mexico Week5: The theme fades out in most states 2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign 41 Faceted Opinions Context = Sentiments Neutral Positive Negative ... Ron Howards selection of Tom Hanks to play Robert Langdon. Tom Hanks stars in the movie,who can be mad at that? But the movie might get delayed, and even killed off if he loses. Topic 1: Directed by: Ron Howard Writing credits: Akiva Movie Goldsman ... After watching the movie I went online and some research on ... I remembered when i first read the book, I finished Topic 2: the book in two days. I’m reading “Da Vinci Book Code” now. … 2008 © Qiaozhu Mei Tom Hanks, who is my protesting ... will lose your favorite movie star act faith by ... watching the the leading role. movie. Anybody is interested in it? ... so sick of people making such a big deal about a FICTION book and movie. Awesome book. ... so sick of people making such a big deal about a FICTION book and movie. So still a good book to past time. This controversy book cause lots conflict in west society. University of Illinois at Urbana-Champaign 42 Sentiment Dynamics Context = Time & Sentiments “ the da vinci code” Facet: the book “ the da vinci code”. ( Bursts during the movie, Pos > Neg ) 2008 © Qiaozhu Mei Facet: the impact on religious beliefs. ( Bursts during the movie, Neg > Pos ) University of Illinois at Urbana-Champaign 43 Event Impact Analysis: IR Research Theme: retrieval models term 0.1599 relevance 0.0752 weight 0.0660 feedback 0.0372 independence 0.0311 model 0.0310 frequent 0.0233 probabilistic 0.0188 document 0.0173 … vector concept extend model space boolean function feedback … xml email model collect judgment rank subtopic … 0.0514 0.0298 0.0297 0.0291 0.0236 0.0151 0.0123 0.0077 1992 0.0678 0.0197 0.0191 0.0187 0.0102 0.0097 0.0079 SIGIR papers Publication of the paper “A language modeling approach to information retrieval” year Starting of the TREC conferences probabilist 0.0778 model 0.0432 logic 0.0404 ir 0.0338 boolean 0.0281 algebra 0.0200 estimate 0.0119 weight 0.0111 … 2008 © Qiaozhu Mei 1998 model 0.1687 language 0.0753 estimate 0.0520 parameter 0.0281 distribution 0.0268 probable 0.0205 smooth 0.0198 markov 0.0137 likelihood 0.0059 … University of Illinois at Urbana-Champaign 44 Model similar context with similar models (Smoothing, Regularization) 2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign 45 Personalization with Backoff • Ambiguous query: MSG – Madison Square Garden – Monosodium Glutamate • Disambiguate based on user’s prior clicks • We don’t have enough data for everyone! – Backoff to classes of users • Proof of Concept: – Classes defined by IP addresses • Better: – Market Segmentation (Demographics) – Collaborative Filtering (Other users who click like me) 46 Context = IP Full personalization: every context has a different model: sparse data! P(Url | IP, Q) 4 P(Url | IP4 , Q) Personalization with backoff: similar contexts have similar models 156.111.188.243 3 P(Url | IP3 , Q) 156.111.188.* 2 P(Url | IP2 , Q) 156.111.*.* 1 P(Url | IP1 , Q) 0 P(Url | IP0 , Q) 156.*.*.* *.*.*.* No personalization: all contexts share the same model 2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign 47 Lambda Sparse Data Missed Opportunity 0. 3 Backing Off by IP 0. 25 0. 2 0. 15 0. 1 0. 05 0 λ4 • • λ3 λ2 λ1 λ0 4 λs estimated with EM and CV P(Url | IP, Q) i P(Url | IPi , Q) A little bit of personalization λ4 : weights for first 4 bytes of IP λ3 : weights for first 3 bytes of IP λ2 : weights for first 2 bytes of IP – Better than too much – Or too little i 0 …… 48 Social Network as Correlated Contexts Linked contexts are similar to each other Predicting query performance … A Language Modeling Approach to Information Retrieval ... Optimization of Relevance Feedback Weights Parallel Architecture in IR ... 2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign 49 Social Network Context for Topic Modeling e.g. coauthor network • • • Context = author Coauthor = similar contexts Intuition: I work on similar topics to my neighbors Smoothed Topic distributions over context 2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign 50 Topic Modeling with Network Regularization (NetPLSA) • Basic Assumption (e.g., co-author graph) • Related authors work on similar topics topic distribution of a document PLSA k O(C , G ) (1 ) ( c( w, d ) log p( j | d ) p( w | j )) d tradeoff between topic and smoothness 1 ( 2 j 1 w u ,v E w(u, v) ( p( j | u ) p( j | v)) 2 ) Graph Harmonic Regularizer, Generalization of [Zhu ’03], 1 2 f j 1... k T j k j 1 difference of topic distribution on neighbor vertices importance (weight) of an edge f j , where f j ,u p( j | u ) 2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign 51 Topical Communities with PLSA Topic 1 Topic 2 Topic 3 Topic 4 term 0.02 peer 0.02 visual 0.02 interface 0.02 question 0.02 patterns 0.01 analog 0.02 towards 0.02 protein 0.01 mining 0.01 neurons 0.02 browsing 0.02 training 0.01 clusters 0.01 vlsi 0.01 xml 0.01 weighting 0.01 stream 0.01 motion 0.01 generation 0.01 multiple 0.01 frequent 0.01 chip 0.01 design 0.01 recognition 0.01 e 0.01 natural 0.01 engine 0.01 relations 0.01 page 0.01 cortex 0.01 service 0.01 library 0.01 gene 0.01 spike 0.01 social 0.01 ?? ? Noisy community assignment ? 2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign 52 Topical Communities with NetPLSA Topic 1 retrieval Topic 2 Topic 3 Topic 4 mining 0.11 neural 0.06 web 0.05 information 0.05 data 0.06 learning 0.02 services 0.03 document 0.03 discovery 0.03 networks 0.02 semantic 0.03 query 0.03 databases 0.02 recognition 0.02 services 0.03 text 0.03 rules 0.02 analog 0.01 peer 0.02 search 0.03 association 0.02 vlsi 0.01 ontologies 0.02 evaluation 0.02 patterns 0.02 neurons 0.01 rdf user 0.02 frequent 0.01 gaussian 0.01 management 0.01 relevance 0.02 streams 0.01 network 0.01 ontology 0.13 Information Retrieval Web Coherent community assignment 0.02 0.01 Data mining Machine learning 2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign 53 Smoothed Topic Map Map a topic on the network (e.g., using p(θ|a)) Core contributors Intermediate Irrelevant PLSA 2008 © Qiaozhu Mei NetPLSA (Topic : “information retrieval”) University of Illinois at Urbana-Champaign 54 Smoothed Topic Map NetPLSA PLSA The Windy States -Blog articles: “weather” -US states network: -Topic: “windy” 2008 © Qiaozhu Mei Real reference University of Illinois at Urbana-Champaign 55 Related Work • Specific Contextual Text Mining Problems – Multi-collection Comparative Mining (e.g., [Zhai et al. 04]) – Temporal theme pattern (e.g., [Mei et al. 05], [Blei et al. 06], [Wang et al. 06]) – Spatiotemporal theme analysis (e.g., [Mei et al. 06], [Wang et al. 07]) – Author-topic analysis (e.g., [Steyvers et al. 04], [Zhou et al 06]) • – … Probabilistic topic models: – Probabilistic latent semantic analysis (PLSA) (e.g. [Hofmann 99]) – Latent Dirichlet allocation (LDA) (e.g., [Blei et al. 03]) – Many extensions (e.g., [Blei et al. 05], [Li and McCallum 06]) 2007 © ChengXiang Zhai LLNL, Aug 15, 2007 56 Conclusions • Context analysis in text mining and search • General methodology to model context in text – A unified generative model for observations in the same context – Different models for different context – Similar models for similar contexts – Generation discrimination smoothing • Many applications 2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign 57 Discussion: Context in Search • Not all contexts are useful – E.g. personalized search v.s. search by time of day – How can we know which contexts are more useful? • Many contexts are useful – E.g., personalized search; task-based search; localized search; – How can we combine them? • Can we do better than market segmentations? – Backoff to users who search like me – Collaborative Search – But who searches like you? 2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign 58 References • • • • • CPLSA – Q. Mei, C. Zhai. A Mixture Model for Contextual Text Mining, In Proceedings of KDD' 06. NetPLSA – Q. Mei, D. Cai, D. Zhang, C. Zhai, Topic Modeling with Network Reguarization, Proceedings of WWW’ 08 Labeling – Q. Mei, X.Shen, C. Zhai, Automatic Labeling of Multinomial Topic Models, Proceedings KDD'07 Personalization: – Q.Mei, K.Church, Entropy of Search Logs: How Hard is Search? With Personalization? With Backoff? In Proceedings of WSDM’08. Applications: – Q. Mei, C. Zhai, Discovering Evolutionary Theme Patterns from Text - An Exploration of Temporal Text Mining, In Proceedings KDD' 05 – Q. Mei, C. Liu, H. Su, and C. Zhai, A Probabilistic Approach to Spatiotemporal Theme Pattern Mining on Weblogs, In Proceedings of WWW' 06 – Q. Mei, X. Ling, M. Wondra, H. Su, C. Zhai, Topic Sentiment Mixture: Modeling Facets and Opinions in Weblogs, Proceedings of WWW’ 07 2007 © ChengXiang Zhai LLNL, Aug 15, 2007 59 The End Thank You! 2007 © ChengXiang Zhai LLNL, Aug 15, 2007 60 Experiments • Bibliography data and coauthor networks – DBLP: text = titles; network = coauthors – Four conferences (expect 4 topics): SIGIR, KDD, NIPS, WWW • Blog articles and Geographic network – Blogs from spaces.live.com containing topical words, e.g. “weather” – Network: US states (adjacent states) 2008 © Qiaozhu Mei University of Illinois at Urbana-Champaign 61 Coherent Topical Communities PLSA visual NetPLSA neural 0.06 learning 0.02 networks 0.02 recognition 0.02 analog vlsi PLSA 0.02 peer 0.02 analog 0.02 patterns 0.01 neurons 0.02 mining 0.01 vlsi 0.01 clusters 0.01 motion 0.01 stream 0.01 chip 0.01 frequent 0.01 natural 0.01 e 0.01 cortex 0.01 page 0.01 spike 0.01 gene 0.01 0.01 0.01 neurons 0.01 gaussian 0.01 network 0.01 Semantics of community: “machine learning (NIPS)” 2008 © Qiaozhu Mei Semantics of community: “Data Mining (KDD) ” University of Illinois at Urbana-Champaign NetPLSA mining 0.11 data 0.06 discovery 0.03 databases 0.02 rules 0.02 association 0.02 patterns 0.02 frequent 0.01 streams 0.01 62