Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Automatic Labeling of Multinomial Topic Models Qiaozhu Mei, Xuehua Shen, ChengXiang Zhai University of Illinois at Urbana-Champaign Outline • Background: statistical topic models • Labeling a topic model – Criteria and challenge • Our approach: a probabilistic framework • Experiments • Summary 2 Statistical Topic Models for Text Mining Topic models (Multinomial distributions) Text Collections Probabilistic Topic Modeling … PLSA [Hofmann 99] LDA [Blei et al. 03] Author-Topic [Steyvers et al. 04] Pachinko allocation [Li & McCallum 06] CPLSA [Mei & Zhai 06] Topic over time … [Wang et al. 06] term relevance weight feedback independ. model … web search link graph … … 0.16 0.08 0.07 0.04 0.03 0.03 Subtopic discovery Topical pattern analysis Summarization 0.21 0.10 0.08 0.05 Opinion comparison … 3 Topic Models: Hard to Interpret • Use top words term 0.16 – automatic, but hard to make sense relevance 0.08 weight 0.07 feedback 0.04 Term, relevance, independence 0.03 weight, feedback model 0.03 frequent 0.02 • Human generated labels probabilistic 0.02 – Make sense, but cannot scale up document 0.02 … insulin foraging foragers collected grains loads collection nectar … ? Retrieval Models Question: Can we automatically generate understandable labels for topics? 4 What is a Good Label? Retrieval models term 0.1599 relevance 0.0752 weight 0.0660 feedback 0.0372 independence 0.0311 model 0.0310 frequent 0.0233 probabilistic 0.0188 document 0.0173 … – Mei and Zhai 06: a topic in SIGIR • • • • Semantically close (relevance) Understandable – phrases? High coverage inside topic Discriminative across topics iPod Nano じょうほうけんさく Pseudo-feedback Information Retrieval 5 Our Method Collection NLP Chunker Ngram Stat. (e.g., SIGIR) information retrieval, retrieval model, index structure, relevance feedback, … 1 Candidate label pool term 0.16 relevance 0.07 weight 0.07 feedback 0.04 independence 0.03 model 0.03 … 2 Relevance Score Information retrieval 0.26 retrieval models 0.19 IR models 0.17 pseudo feedback 0.06 …… filtering 0.21 collaborative 0.15 … trec 0.18 evaluation 0.10 … 3 Discrimination information retriev. 0.26 0.01 retrieval models 0.20 IR models 0.18 pseudo feedback 0.09 …… 4 Coverage retrieval models 0.20 IR models 0.18 0.02 pseudo feedback 0.09 …… information retrieval 0.01 6 Relevance (Task 2): the Zero-Order Score • Intuition: prefer phrases well covering top words p(“clustering”|) = 0.4 Clustering √ Good Label (l1): “clustering algorithm” p(“dimensional”|) = 0.3 dimensional Latent Topic algorithm birch shape ? p(clustering a lg orithm | ) p(clustering a lg orithm) p(“shape”|) = 0.01 … p(w|) body … > p(body shape | ) p(body shape) Bad Label (l2): “body shape” p(“body”|) = 0.001 7 Relevance (Task 2): the First-Order Score • Intuition: prefer phrases with similar context (distribution) Clustering Clustering Clustering Good Label (l1) dimension “clustering dimension dimension algorithm” Topic partition partition Score rank algorithm algorithm … algorithm (l, ) = D(||l) key p ( w | ) PMI ( w , l | C) w P(w|) … hash … hash p(w | clustering algorithm ) l2: “hash join” key …hash join … code …hash table …search …hash join… map key…hash …algorithm…key …hash…key table…join… hash p(w | hash join) 8 Discrimination and Coverage (Tasks 3 & 4) • Discriminative across topic: – High relevance to target topic, low relevance to other topics Score' (l , i ) Score(l , i ) Score(l ,1,..., i 1,i 1,..., k ) • High Coverage inside topic: – Use MMR strategy lˆ arg max [ Score(l , ) (1 ) max Sim(l ' , l )] lL S l 'S 9 Variations and Applications • Labeling document clusters – Document cluster unigram language model – Applicable to any task with unigram language model • Context sensitive labels – Label of a topic is sensitive to the context – An alternative way to approach contextual text mining tree, prune, root, branch “tree algorithms” in CS ? in horticulture ? in marketing? 10 Experiments • Datasets: – SIGMOD abstracts; SIGIR abstracts; AP news data – Candidate labels: significant bigrams; NLP chunks • Topic models: – PLSA, LDA • Evaluation: – Human annotators to compare labels generated from anonymous systems – Order of systems randomly perturbed; score average over all sample topics 11 Result Summary • Automatic phrase labels >> top words • 1-order relevance >> 0-order relevance • Bigram > NLP chunks – Bigram works better with literature; NLP better with news • System labels << human labels – Scientific literature is an easier task 12 Results: Sample Topic Labels the, of, a, and, to, data, > 0.02 … clustering 0.02 clustering algorithmtime 0.01 clustering structure clusters 0.01 … databases 0.01 large 0.01 performance 0.01 0.005 large data, data quality north 0.02 case 0.01 trial 0.01 iran 0.01 documents 0.01 walsh 0.009 reagan 0.009 charges 0.007 r tree b tree … quality, high data, data application, … indexing methods iran contra … tree trees spatial b r disk array cache 0.09 0.08 0.08 0.05 0.04 0.02 0.01 0.01 13 Results: Context-Sensitive Labeling sampling estimation approximation histogram selectivity histograms … Context: Database (SIGMOD Proceedings) selectivity estimation; random sampling; approximate answers; Context: IR (SIGIR Proceedings) distributed retrieval; parameter estimation; mixture models; • Explore the different meaning of a topic with different contexts (content switch) • An alternative approach to contextual text mining 14 Summary • Labeling: A postprocessing step of all multinomial topic models • A probabilistic approach to generate good labels – understandable, relevant, high coverage, discriminative • Broadly applicable to mining tasks involving multinomial word distributions; context-sensitive • Future work: – Labeling hierarchical topic models – Incorporating priors 15 Thanks! - Please come to our poster tonight (#40) 16