Download Automatic Labeling of Multinomial Topic Models

Automatic Labeling of Multinomial Topic Models Qiaozhu Mei, Xuehua Shen, ChengXiang Zhai University of Illinois at Urbana-Champaign Outline • Background: statistical topic models • Labeling a topic model – Criteria and challenge • Our approach: a probabilistic framework • Experiments • Summary 2 Statistical Topic Models for Text Mining Topic models (Multinomial distributions) Text Collections Probabilistic Topic Modeling … PLSA [Hofmann 99] LDA [Blei et al. 03] Author-Topic [Steyvers et al. 04] Pachinko allocation [Li & McCallum 06] CPLSA [Mei & Zhai 06] Topic over time … [Wang et al. 06] term relevance weight feedback independ. model … web search link graph … … 0.16 0.08 0.07 0.04 0.03 0.03 Subtopic discovery Topical pattern analysis Summarization 0.21 0.10 0.08 0.05 Opinion comparison … 3 Topic Models: Hard to Interpret • Use top words term 0.16 – automatic, but hard to make sense relevance 0.08 weight 0.07 feedback 0.04 Term, relevance, independence 0.03 weight, feedback model 0.03 frequent 0.02 • Human generated labels probabilistic 0.02 – Make sense, but cannot scale up document 0.02 … insulin foraging foragers collected grains loads collection nectar … ? Retrieval Models Question: Can we automatically generate understandable labels for topics? 4 What is a Good Label? Retrieval models term 0.1599 relevance 0.0752 weight 0.0660 feedback 0.0372 independence 0.0311 model 0.0310 frequent 0.0233 probabilistic 0.0188 document 0.0173 … – Mei and Zhai 06: a topic in SIGIR • • • • Semantically close (relevance) Understandable – phrases? High coverage inside topic Discriminative across topics iPod Nano じょうほうけんさく Pseudo-feedback Information Retrieval 5 Our Method Collection NLP Chunker Ngram Stat. (e.g., SIGIR) information retrieval, retrieval model, index structure, relevance feedback, … 1 Candidate label pool term 0.16 relevance 0.07 weight 0.07 feedback 0.04 independence 0.03 model 0.03 … 2 Relevance Score Information retrieval 0.26 retrieval models 0.19 IR models 0.17 pseudo feedback 0.06 …… filtering 0.21 collaborative 0.15 … trec 0.18 evaluation 0.10 … 3 Discrimination information retriev. 0.26 0.01 retrieval models 0.20 IR models 0.18 pseudo feedback 0.09 …… 4 Coverage retrieval models 0.20 IR models 0.18 0.02 pseudo feedback 0.09 …… information retrieval 0.01 6 Relevance (Task 2): the Zero-Order Score • Intuition: prefer phrases well covering top words p(“clustering”|) = 0.4 Clustering √ Good Label (l1): “clustering algorithm” p(“dimensional”|) = 0.3 dimensional Latent Topic  algorithm birch shape ? p(clustering  a lg orithm |  ) p(clustering  a lg orithm) p(“shape”|) = 0.01 … p(w|) body … > p(body  shape |  ) p(body  shape) Bad Label (l2): “body shape” p(“body”|) = 0.001 7 Relevance (Task 2): the First-Order Score • Intuition: prefer phrases with similar context (distribution) Clustering Clustering Clustering Good Label (l1) dimension “clustering dimension dimension algorithm” Topic  partition partition Score rank algorithm  algorithm … algorithm (l,  ) = D(||l) key p ( w |  ) PMI ( w , l | C)  w P(w|) … hash … hash p(w | clustering algorithm ) l2: “hash join” key …hash join … code …hash table …search …hash join… map key…hash …algorithm…key …hash…key table…join… hash p(w | hash join) 8 Discrimination and Coverage (Tasks 3 & 4) • Discriminative across topic: – High relevance to target topic, low relevance to other topics Score' (l , i )  Score(l , i )    Score(l ,1,..., i 1,i 1,..., k ) • High Coverage inside topic: – Use MMR strategy lˆ  arg max [  Score(l , )  (1   ) max Sim(l ' , l )] lL  S l 'S 9 Variations and Applications • Labeling document clusters – Document cluster  unigram language model – Applicable to any task with unigram language model • Context sensitive labels – Label of a topic is sensitive to the context – An alternative way to approach contextual text mining tree, prune, root, branch  “tree algorithms” in CS  ? in horticulture  ? in marketing? 10 Experiments • Datasets: – SIGMOD abstracts; SIGIR abstracts; AP news data – Candidate labels: significant bigrams; NLP chunks • Topic models: – PLSA, LDA • Evaluation: – Human annotators to compare labels generated from anonymous systems – Order of systems randomly perturbed; score average over all sample topics 11 Result Summary • Automatic phrase labels >> top words • 1-order relevance >> 0-order relevance • Bigram > NLP chunks – Bigram works better with literature; NLP better with news • System labels << human labels – Scientific literature is an easier task 12 Results: Sample Topic Labels the, of, a, and, to, data, > 0.02 … clustering 0.02 clustering algorithmtime 0.01 clustering structure clusters 0.01 … databases 0.01 large 0.01 performance 0.01 0.005 large data, data quality north 0.02 case 0.01 trial 0.01 iran 0.01 documents 0.01 walsh 0.009 reagan 0.009 charges 0.007 r tree b tree … quality, high data, data application, … indexing methods iran contra … tree trees spatial b r disk array cache 0.09 0.08 0.08 0.05 0.04 0.02 0.01 0.01 13 Results: Context-Sensitive Labeling sampling estimation approximation histogram selectivity histograms … Context: Database (SIGMOD Proceedings) selectivity estimation; random sampling; approximate answers; Context: IR (SIGIR Proceedings) distributed retrieval; parameter estimation; mixture models; • Explore the different meaning of a topic with different contexts (content switch) • An alternative approach to contextual text mining 14 Summary • Labeling: A postprocessing step of all multinomial topic models • A probabilistic approach to generate good labels – understandable, relevant, high coverage, discriminative • Broadly applicable to mining tasks involving multinomial word distributions; context-sensitive • Future work: – Labeling hierarchical topic models – Incorporating priors 15 Thanks! - Please come to our poster tonight (#40) 16

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Automatic Labeling of Multinomial Topic Models