Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Discovering Evolutionary Theme Patterns from Text - An Exploration of Temporal Text Mining Qiaozhu Mei and ChengXiang Zhai Department of Computer Science University of Illinois at Urbana Champaign SIGKDD’05 Introduction • Temporal Text Mining (TTM): discovering temporal patterns in text information collected over time. • In this paper – Discovering and summarizing the evolutionary patterns of themes in a text stream – Revealing the life cycle of a theme Introduction (cont.) • We solve this problem through – 1. Discovering latent themes from text – 2. Constructing an evolution graph of themes – 3. Analyzing life cycles of themes. • Evaluation – News articles – The abstracts of the ACM KDD conference papers Definitions • Time indexed documents C = {d1, d2, …, dT} vocabulary set V = {w1, …, w|V|} • Theme θ – A unigram language model {p(w|θ)} • Theme Span γ = 〈θ, s(γ), t(γ)〉 – If s = 1 and t = T, then γ is a trans-collection theme • Evolutionary Transition – If t(γ1) ≦ s(γ2) and similarity(γ1,γ2) ﹥threshold, we say that there is an evolutionary transition from γ1 to γ2, γ1 γ2 Definitions (cont.) • Theme Evolution Thread – A sequence of theme spans γ0, γ1, …, γn such that γi γi+1 • Theme Evolution Graph – Weighted directed graph G = (N,E), where N is the set of all theme spans, E is the set of evolutionary transition. Example of Theme Evolution Graph Theme Span Theme Evolution Thread Evolutionary Transition Evolution Graph Discovery • Partition the documents into sub-collections C = C1∪ C2∪…∪Cn • Extract the most salient themes i = { i ,1, …, i,ki } from each sub-collection Ci • For any themes 1 1 and 2 2 where i < j, decide whether there is an evolutionary transition. Theme Extraction • Let θ1, …, θk be k themes and θB be a background model for the whole collection C. • A document d is regarded as a sample of the following mixture model: w: a word in d, πd,j : mixing weight for d choosing θj, λ: mixing weight for θB • The log-likelyhood of Ci, • Using EM algorithm to train j , d , j | d Ci , 1 j k Parameter Estimation • {zd,w} is a hidden variable • p(zd,w = j) indicates that the word w in document d is generated using theme j given that w is not generated from the background mode. Evolutionary Transition Discovery • For every pair of theme spans γ1 = 〈θ1, s(γ1), t(γ1)〉and γ2 = 〈θ2, s(γ2), t(γ2)〉 where t(γ1) ≦s(γ2) • Kullback-Leibler divergence • If D(θ2 || θ1) ﹥ξ, then γ1 γ2 Analysis of Theme Life Cycles • Theme Life Cycle : the strength distribution of the trans-collection theme over the entire time line. • Assume the collection is generated from HMM – States → Themes – Output symbol set → V – Output probability distribution → the multinomial distribution of words of that state • Obtain state sequence with Viterbi algorithm Analysis of Theme Life Cycles (cont.) • The absolute and relative strengths of theme i at time t 1 if word = 0 otherwise is labeled as theme i Experimental Data Sets • News about Asia Tsunami – Dec. 19 2004 to Feb. 8 2005 (50 days) – Downloaded with query “tsunami” • The abstracts in KDD conference proceeding from 1999 to 2004 Theme Spans from Tsunami Theme Evolution Graph for Tsunami c: Theme Life Cycle in CNN Absolute life cycle in CNN data Theme Life Cycle in XINHUA • A Absolute life cycle in XINHUA data Normalized life cycle in XINHUA data Theme Spans from KDD Theme Evolution Graph for KDD a: classification Typical classification tech. Web classification Clustering & random variables Theme Life Cycle for KDD Business Web Info. Classification Association Rule Biology Data Time series Clustering Conclusions • We propose methods to discover evolutionary theme patterns and analyze the life cycle of each theme • The proposed methods can generate meaningful temporal theme structures on the two experimental data sets. • Our methods are generally applicable to any text stream data. • Future works – Hierarchical theme clustering – Temporal theme mining system