Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
A Clustering Method Based on Nonnegative Matrix Factorization for Text Mining Farial Shahnaz Topics • • • • • Introduction Algorithm Performance Observation Conclusion and Future Work Introduction Basic Concepts • Text Mining : Detection of trends or patterns in text data • Clustering : Grouping or classifying documents based on similarity of content Clustering • Manual Vs Automated • Supervised Vs Unsupervised • Hierarchical Vs Partitional Clustering • Objective: Automated Unsupervised Partitional Clustering of Text Data or Documents • Method : Nonnegative Matrix Factorization or NMF Vector Space Model of Text Data • Documents represented as n-dimensional vectors – n : terms in the dictionary – vector component : importance of term • Document collection represented as term-bydocument matrix Term-by-Document Matrix • Terms in the dictionary, n : 9 (a, brown, dog, fox, jumped, lazy, over, quick, the) • Document 1 : a quick brown fox • Document 2 : jumped over the lazy dog Term-by-Document Matrix Clustering Method : NMF • Low rank approximation of large sparse matrices • Preserves data nonnegativity • Introduces the concept of parts-based representation (by Lee and Seung in Nature, 1999) Other Methods • Other rank reduction methods : – Principal Component Analysis (PCA) – Vector Quantization (VQ) • Produce basis vectors with negative entries • Additive and Subtractive combinations of basis vectors yield original document vectors NMF • Produces nonnegative basis vectors • Additive combination of basis vectors yield original document vector Term-by-Document Matrix (all entries nonnegative) NMF • Basis vectors interpreted as semantic features or topics • Documents clustered on the basis of shared features NMF • Demonstrated by Xu et. Al (2003): – Outperforms Singular Value Decomposition (SVD) – Comparable to Graph Partitioning methods Algorithm NMF : Definition Given • S • Vmxn • m • n : Document collection : term-by-document matrix : terms in the dictionary : Number of documents in S NMF : Definition NMF is defined as: • Low rank approximation of Vmxn in terms of some metric • Factor V into the product WH – Wmxk : Contains basis vectors – Hkxn : Contains linear combinations –k : Selected number of topics or basis vectors, k << min(m,n) NMF : Common Approach • Minimize objective function: NMF : Existing Methods Multiplicative Method (MM) [ by Lee and Seung ] • Based on Multiplicative update rules • || V - WH || is monotonically nonincreasing and constant iff W, H at stationary point • Version of Gradient Descent (GD) optimization scheme NMF : Existing Methods Sparse Encoding [ by Hoyer ] • Based on study of neural networks • Enforces statistical sparsity of H – Minimizes sum of non-zeros in H NMF : Existing Methods Sparse Encoding [ by Mu, Plemmons and Santago ] • Similar to Hoyer’s method • Enforces statistical sparsity of H using a regularization parameter – Minimizes number of non-zeros in H NMF : Proposed Algorithm Hybrid Method: • W approximated using Multiplicative Method • H calculated using a Constrained Least Square (CLS) model as the metric – Penalizes the number of non-zeros – Similar to the method by Mu, Plemmons and Santago • Called GD-CLS GD-CLS Performance Text Collections Used • Two benchmark topic detection text collections: – Reuters : Collection of documents on assorted topics – TDT2 : Transcripts from news media Text Collections Used Accuracy Metric • Defined by: • di : Document number i • = 1 = 1 if the topic labels match • ∂(di) = 0 otherwise k = 2, 4, 6, 8, 10, 15, 20 λ = 0.1, 0.01, 0.001 Results for Reuters Results for TDT2 Observations Observations : AC • AC inversely proportional to k • Nature of the collection affects AC – Reuters : earn, interest, cocoa – TDT2 : Asian economic crisis, Oprah lawsuit Observations : λ parameter • AC declines as λ increases ( mostly effective for homogeneous text collections) : • CPU time declines as λ increases Observations : Cluster size • Imbalance in cluster sizes has adverse effect : Conclusion & Future Work GD-CLS can be used to effectively cluster text data. Further development involves: • Smart updating • Use in Bioinformatics • Develop user-interface • Convert to C++