Download Shallow Text Clustering Does Not Mean Weak Topics - CEUR

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Nearest-neighbor chain algorithm wikipedia , lookup

K-means clustering wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
Shallow Text Clustering Does Not Mean Weak Topics:
How Topic Identification Can Leverage Bigram Features
Julien Velcin1 , Mathieu Roche2 , and Pascal Poncelet3
1
3
Université de Lyon (ERIC, Lyon 2), France.
[email protected],
2
Cirad (TETIS), Montpellier, France.
[email protected]
Université de Montpellier (LIRMM), France.
[email protected]
Abstract. Text clustering and topic learning are two closely related tasks. In this
paper, we show that the topics can be learnt without the absolute need of an exact
categorization. In particular, the experiments performed on two real case studies
with a vocabulary based on bigram features lead to extracting readable topics
that cover most of the documents. Precision at 10 is up to 74% for a dataset
of scientific abstracts with 10,000 features, which is 4% less than when using
unigrams only but provides more interpretable topics.
1
Introduction
Text clustering is a huge research area with many applications, such as corpus visualization [12] and document indexing for information retrieval [25]. In addition to the
classical task of categorizing similar texts, people are usually interested in characterizing the clusters by the mean of concise descriptions called topics, so that they can easily
interpret categories and browse the document collection [24]. Topic extraction (or topic
learning) has been widely popularized by the success of Latent Semantic Analysis [4]
and Non-negative Matrix Factorization [18]. More recently, probabilistic topic models,
such as probabilistic Latent Semantic Analysis [8] and Latent Dirichlet Allocation [1],
have emerged as an efficient alternative implemented by many communities, from data
mining [19] to natural language processing [7] and social sciences and humanities [21].
They are now used as a routine in many systems dedicated to text analytics [2].
Despite all these numerous works, it turns out that some confusion often subsists
between the task of text clustering (grouping similar texts, i.e. working on category’s
extension) and the task of topic identification (extracting within-category commonalities, their intension), as highlighted by [26]. We show here that not-so-good (shallow)
clustering does not always mean weak topics. Another observation is related to the vocabulary used by the algorithms: most of the time, groups and topics are estimated from
unigram tokens (words) [17], whose number is often arbitrarily fixed, or not fully justified [7]. When considering perplexity-based measures only, that is the goodness-of-fit
of the probabilistic model on held-out data, words seems to play the main role [9].
However, it has been shown that n-grams (n
2) might be really useful, whether for
constructing interpretable topics [23] or for improving topic consistency [16,28].
In: P. Cellier, T. Charnois, A. Hotho, S. Matwin, M.-F. Moens, Y. Toussaint (Eds.): Proceedings of
DMNLP, Workshop at ECML/PKDD, Riva del Garda, Italy, 2016.
Copyright c by the paper’s authors. Copying only for private and academic purposes.
26
J. Velcin, M. Roche and P. Poncelet
Based on these two observations, our contribution is twofold.
First, we show that a minimum number of features is necessary but sufficient to
achieve a good accuracy, both in term of clustering purity and topic description. It is
not as obvious as it seems since too many features might add noise and reduce the
generalization ability of the model, which actually happens in supervised settings [13].
If we pay attention to select enough features, it is therefore possible to choose phrases
(here, bigrams) instead of single words. To the best of our knowledge, it is the first time
that this result is clearly highlighted and quantified.
Second, we show that the bigram-based vocabulary provide really useful topic descriptions at the cost of a reasonable decrease in accuracy. The cost is not that important
with a drop of about 10%. Our results highlight that a careful choice for the features allows a much better interpretation of the topics given by topic learning techniques (here,
LDA). This work is closely related to the task of topic labeling but, here, the descriptive features are defined before the topic learning step. Therefore the extracted topics
are characterized by the very terms that constitute their backbone, and not labeled by
using one among the many heuristics proposed in the literature [15,29]. Besides, a postprocessing can be used afterwards to improve the output, such as selecting one term
amongst “data set” and “data sets” (see Section 3).
The paper is organized as follows. Section 2 defines the two complementary tasks
of text clustering and topic identification, highlighting their close connection but also
their difference. Section 3 shows the impact of making the vocabulary change in term
of both size and nature (unigrams versus bigrams). Finally, we conclude and suggest
future work in Section 4.
2
2.1
Text Clustering and Topic Identification: Two Related Tasks
Definition of Tasks
The first step consists in showing the clear distinction between the two tasks. As illustrated in Fig. 1 (left), text clustering mainly aims at categorizing objects into clearly
separated clusters. Even though the membership can be gradual, like in fuzzy clustering
[5], or an object can be associated to several clusters, like in overlapping clustering [3],
the common aim is to associate each object to one category so that subsequent decisions
can be made. In Fig. 1, we can observe that some texts are central to the categories (e.g.,
dc for cluster 1 and dd for cluster 2) whereas other texts lie between clusters (e.g., da ,
db and de ). It is a natural feature of text clustering to assume that texts can be related to
several topics at the same time, which is at the basis of most topic models.
By adopting a different viewpoint, topic identification is more dedicated to extracting a set of topics that structure the dataset as shown in Fig. 1 (right). Topics can be
viewed as weighted lists of keywords (e.g., with LSA) or distributions over words (e.g.,
with LDA). In order to give an overview of the whole corpus to the final users, the usual
solution is to keep the top words (option 1 in Fig. 1) or the top n-grams (option 2 in
Fig. 1, here with n=2).
Obviously, the two tasks are related but not fully aligned. Hence, the documents da ,
db and de can be misclassified as long as we find the expected topics, more identifiable
Shallow Text Clustering Does Not Mean Weak Topics
27
on the colored groups of documents in Fig. 1. Let us note that we might easily get the
top frequent terms for each cluster as a post-processing stage. However, most of the
current state-of-the-art methods such as LSA, NMF and LDA address both tasks at the
same time, which explains the confusion that may arise.
topic2
cluster2
da
da
dd
dd
cluster1
db
db
dc
dc
de
cluster3
topic1
--(textclustering)--
de
topic3
1
2
data
mining
algorithm
clustering
learning
paper
approach
classifica6on
results
method
(…)
datamining
6meseries
experimentalresults
knowledgediscovery
machinelearning
nearestneighbor
supportvector
featureselec6on
decisiontree
associa6onrule
(…)
--(topiciden6fica6on)--
Fig. 1. Distinction between the tasks of text clustering and topic identification (here, topic 2 has
been extracted from scientific publications clearly related to the “data mining” field).
Several previous works have used n-grams either during the topic learning process
[22,23] or as a post-processing step in order to find automatic labels [10,15,29]. However, they did not study the impact of both the vocabulary size (number of terms) and
term nature (unigrams versus bigrams), as we do in this paper.
2.2
Evaluation Measures
In the following sections, we experiment LDA on two datasets in order to address the
two tasks simultaneously. For evaluation’s sake, we compare the output given by LDA
to a gold standard that provides us the real class label of each object. In order to get a
partition, we associate each text d to the most likely topic ẑ = arg max p(z/d). This
way, we reduce the expressive power of topic models but we can leverage the usual
Adjusted Rand Index (ARI) for assessing the clustering quality. We will see in the
next section that the two datasets have been precisely chosen because they fit this crisp
clustering assumption. The maximum of 1 with ARI is achieved with a perfect match
between the partition and the gold standard.
Establishing a ground truth for the topic identification task is much more challenging. To begin with, we choose to restrict the evaluation to the quality of the top-10 terms
associated to each cluster for 10 is the number usually shown to end users. Although we
can imagine various ways to extract those lists from the gold standard partition (e.g.,
selecting the most discriminant terms, etc.), we have chosen to restrict to the most frequent terms for this study. In addition to the simplicity of this solution, we will see
that the probability p(t/z) of the term t given the topic z output by LDA clearly favor
frequent terms. We then propose to calculate the usual precision for this top-10 terms,
noted pre@10. Let us remark that this manner to challenge the list of top K terms is
28
J. Velcin, M. Roche and P. Poncelet
especially uncommon in the literature, in which the list is always manually evaluated.
The mapping between the true category and the topic z is simply derived by taking the
category with the higher number of texts related to z. Obviously, pre@10 ranges from
0 (no common term) to 1 (perfect match between the two lists).
2.3
Datasets and Feature Extraction
dataset #c #docs #unigrams #bigrams
ART 5 18,465 13,778
30,522
20NG 20 18,828 40,142
54,741
Fig. 2. Basic statistics for the two datasets.
The two datasets are the set of scientific abstracts gathered by Tang. et al. [20],
noted ART, and 20 Newsgroups, noted 20NG. Both datasets are available online4 . For
both datasets, we perform minimal preprocessing: lowercasing, removing punctuation
and English stopwords, removing the terms that occur in only one document. We set
the number of expected topics to be the true number of classes #c in the gold standard.
Basic statistics can be found in Fig. 2.
In our context, we extract bigrams based on classical patterns in terminology extraction domain (i.e. noun-noun, adjective-noun, and so forth)5 . Terms extracted from our
corpora are then ranked depending on their relative frequency. Other weightings have
been experimented (e.g., TF-IDF, Okapi, C-value) but it turns out that the frequency is
the more adapted ranking function for both tasks addressed in this study6 .
3
Vocabulary Impact for Both Tasks
We here focus our attention on the importance of vocabulary size and term nature (unigrams versus bigrams). We used the parallel LDA implemented in the MALLET package7 . The priors ↵ and are not automatically estimated (default configuration) but we
set them both to 0.1 after a preliminary grid search8 . We set the maximum number of
iterations for the Gibb’s sampling to 2000, as suggested with this implementation, and
run the algorithm ten times. The final mean is only given since the observed standard
deviation does not exceed 0.01, so we decided not to overload the figures.
We keep the most frequent K words, K ranging from 500 to 30,000. We then compute the quality of LDA topics both for clustering (Fig. 3) and topic identification
4
5
6
7
8
http://arnetminer.org/collaboration and http://qwone.com/ jason/20Newsgroups/
To this end, we used the biotex tool [11], freely available online: http://tubo.lirmm.fr/biotex/.
For instance, the ARI based on the frequency is higher from 0.25 to 0.31 for ART and a
vocabulary of 10,000 features.
Homepage of MALLET package: http://mallet.cs.umass.edu
It turns out that, with this amount of data, priors had a limited effect on the final results (±0.015
on ARI). We are aware that an automatic, dynamic estimation is possible [14] but we do believe
that a constant setup of hyperparameters guarantees a fair comparison.
Shallow Text Clustering Does Not Mean Weak Topics
29
Fig. 3. Evolution of ARI (log scale for x axis).
(Fig. 4). We first observe that ARI increases exponentially below some threshold before converging9 . This means that a fraction of features is sufficient to get an important
gain in ARI (5 000 unigrams for ART achieves 0.434 for a maximum of 0.4346 with
9 000 unigrams ; 10,000 unigrams for 20NG achieves 0.3961 for a maximum of 0.4285
with 30,000 unigrams). For information, recent work [6,27] focusing on text clustering
reported 0.397 and 0.425 ARI on 20NG respectively.
In addition, we observe that with three times the number of features, bigram-based
vocabulary is able to achieve a really good ARI score for ART, not very far from the
maximum with unigrams (0.3865 against 0.4346). This is clearly not the same situation
for 20NG with a difference of about 0.26 for the ARI.
Fig. 4. Evolution of prec@10 (log scale for x axis).
We now take a closer look at the top terms returned by LDA, in comparison to
the reference terms extracted from the true classes. The precision achieved by keeping
the top-10 terms is shown in Fig. 4. We have been really surprised to notice that the
9
We stopped the size for ART-1g at the number of words occurring at least twice in the whole
corpus (13,778 words).
30
J. Velcin, M. Roche and P. Poncelet
datamining (ART)
1-grams
2-grams
data
data mining*
mining
data sets*
algorithm
association rules*
clustering
time series*
paper
data streams (13)
approach experimental results*
learning knowledge discovery*
classification
data set*
algorithms machine learning (12)
results
support vector*
sci.space (20NG)
1-grams
2-grams
space
solar system*
earth
henry spencer*
launch physical universe (30)
writes
night sky*
shuttle
space shuttle*
nasa
toronto zoology*
mission
oort cloud (12)
orbit
jet propulsion*
system
dick dunn (25)
solar high-speed collision (26)
rec.sport.baseball (20NG)
1-grams
2-grams
writes
red sox*
game san francisco (15)
article los angeles (18)
year
st louis*
team
world series*
games major league*
good
blue jays*
players
power play
baseball mark singer*
time
san diego (12)
Fig. 5. top-10 terms of selected topics for ART (columns on the left) and 20NG (columns on the
right). * means that the bigram is in the top-10 bigrams extracted from the ground truth, otherwise
we note its rank. Henry Spencer posted over 34,000 messages to the sci.space.* newsgroups
(source: Wikipedia).
top-10 bigrams are really competitive in comparison to the top-10 words. For ART, the
precision achieved is 4% to 7% below only with a score of about 70% (up to 74% for
10,000 terms). We believe that this is a crucial observation: even though we get one
term less that with single words in average, the topics are much more readable by using
seven bigrams than height unigrams. The bigram “data mining” is more informative
than the two words “data” and “mining”, even if they are given in the same list.
This advantage is obvious if we take a look at the top terms given in the table
of Fig. 5 (top bigrams next to top unigrams). For 20NG, it is even more interesting:
despite the ARI collapse, four of the top bigrams are still accurate (six for the unigrams).
However, it does not mean that the other terms are unrelated. We have highlighted the
terms related to the ground truth with a * in Fig. 5, and noted their rank in the true
list otherwise. Let us note that a vocabulary of 500 terms is sufficient to achieve such
performances in term of precision. It seems that LDA easily finds the core of topics,
without caring much for the result of the text clustering task.
Finally, we have run a last series of experiments in order to see the impact of a mixed
unigram-bigram vocabulary. To this end, we fixed the number of features to 10,000
and changed the proportion in steps of 5% (e.g., 80% unigrams with 20% bigrams).
The results confirm that bigrams might help increasing the overall clustering accuracy,
but the bonus is limited and not significant. The best proportion seems to be highly
dependent of the dataset (e.g., we got +0.01 ARI for ART with 5% of bigrams and
+0.024 for 20NG with 30%). However, we observed no constant improvement for the
precision. When we take a closer look to the top terms, bigrams are overwhelmed by
unigrams, which explains the unchanged score.
4
Discussion and Future Work
Despite all the work done so far for integrating phrases into topic learning, we believe
that this study is the first to highlight the potentiality of bigrams, not only for improving
topic homogeneity (in addition to unigrams) or topic labeling, but for the whole task of
Shallow Text Clustering Does Not Mean Weak Topics
31
topic identification. Even though we have observed a clear gap between unigram and
bigram frequencies, the bigram frequency seems to be sufficient to cover most of topic’s
aspects, getting rid of the ambiguity carried by unigrams. Hence, it is easy to provide
readable topics to end users with a limited energy in the creation of terms (actually, any
bigram library is expected to provide interesting features). Our preliminary experiments
have shown that this reasoning can be transposed to trigrams as soon as their cumulated
frequency is sufficient. We observed a decrease of about 10% for the pre@10 with
trigrams (60% for ART and 30% for 20NG).
Interesting work lies ahead. One immediate follow-up is to design a new method
that directly focuses on topic identification. By even more weakening our expectations
on text clustering, we can find a way to improve the top K terms by favoring the topical
core of each category (colored areas in Fig. 1). Improving the input representation, for
instance by adding pseudo-counts for complex terms, can be a way to explore this idea.
Another exciting, more theoretical question is to question the tradeoff between term frequency, term co-occurrences and performances. During our experiments, we observed a
clear logarithmic correlation between the total number of tokens and the performances
we can achieve (from R2 = 0.94 for 20NG until R2 = 0.98 for ART described with
bigrams). This tells us that we cannot expect much by using too rare terms since they
lead to really sparse matrices. However, it seems that the combination of complementary rare terms can compete with more frequent words. Information theory might be
used for studying this kind of issues further.
References
1. David M Blei, Andrew Y Ng, and Michael I Jordan. Latent dirichlet allocation. Journal of
machine Learning research, 3:993–1022, 2003.
2. Richard T Carback III, Brad D Gaynor, Nathan R Shnidman, and Sang Hoon Chin. Systems
and methods for software analytics, December 17 2015. US Patent 20,150,363,197.
3. Guillaume Cleuziou. An extended version of the k-means method for overlapping clustering.
In Proceedings of the 19th International Conference on Pattern Recognition, 2008.
4. Scott C. Deerwester, Susan T Dumais, Thomas K. Landauer, George W. Furnas, and
Richard A. Harshman. Indexing by latent semantic analysis. JASIS, 41(6):391–407, 1990.
5. Joseph C. Dunn. Well-separated clusters and optimal fuzzy partitions. Journal of cybernetics,
4(1):95–104, 1974.
6. Siddharth Gopal and Yiming Yang. Von mises-fisher clustering models. In Proceedings of
the International Conference on Machine Learning (ICML), pages 154–162, 2014.
7. David Hall, Daniel Jurafsky, and Christopher D Manning. Studying the history of ideas using
topic models. In Proceedings of the conference on empirical methods in natural language
processing, pages 363–371. Association for Computational Linguistics, 2008.
8. Thomas Hofmann. Probabilistic latent semantic analysis. In Proceedings of the 15th conference on Uncertainty in Artificial Intelligence, pages 289–296. Morgan Kaufmann, 1999.
9. Jey Han Lau, Timothy Baldwin, and David Newman. On collocations and topic models.
ACM Trans. Speech Lang. Process., 10(3):10:1–10:14, July 2013.
10. Jey Han Lau, Karl Grieser, David Newman, and Timothy Baldwin. Automatic labelling of
topic models. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1, pages 1536–1545. Association
for Computational Linguistics, 2011.
11. Juan Antonio Lossio-Ventura, Clement Jonquet, Mathieu Roche, and Maguelonne Teisseire.
BIOTEX: A system for biomedical terminology extraction, ranking, and validation. In Proceedings of the ISWC 2014 - the 13th International Semantic Web Conference, ISWC 2014,
Riva del Garda, Italy, October 21, 2014., pages 157–160, 2014.
32
J. Velcin, M. Roche and P. Poncelet
12. Dongning Luo, Jing Yang, Milos Krstajic, William Ribarsky, and Daniel Keim. Eventriver:
Visually exploring text collections with temporal references. IEEE Transactions on Visualization and Computer Graphics, 18(1):93–105, 2012.
13. Christopher D Manning, Prabhakar Raghavan, Hinrich Schütze, et al. Introduction to information retrieval, volume 1. Cambridge university press Cambridge, 2008.
14. Andrew Mccallum, David M Mimno, and Hanna M Wallach. Rethinking lda: why priors
matter. In Advances in Neural Information Processing Systems, pages 1973–1981, 2009.
15. Qiaozhu Mei, Xuehua Shen, and ChengXiang Zhai. Automatic labeling of multinomial topic
models. In 13th ACM SIGKDD international conference on Knowledge Discovery and Data
mining, pages 490–499, 2007.
16. Michael Nokel and Natalia Loukachevitch. A Method of Accounting Bigrams in Topic
Models. In Proceedings of the 11th Workshop on Multiword Expressions, pages 1–9, Denver,
Colorado, 2015. Association for Computational Linguistics.
17. Derek O’Callaghan, Derek Greene, Joe Carthy, and Pádraig Cunningham. An analysis of the
coherence of descriptors in topic modeling. Expert Systems with Applications, 42(13), 2015.
18. Pentti Paatero and Unto Tapper. Positive matrix factorization: A non-negative factor model
with optimal utilization of error estimates of data values. Environmetrics, 5(2), 1994.
19. Mark Steyvers, Padhraic Smyth, Michal Rosen-Zvi, and Thomas Griffiths. Probabilistic
author-topic models for information discovery. In Proceedings of the tenth ACM SIGKDD
international conference on Knowledge Discovery and Data mining, pages 306–315, 2004.
20. Jie Tang, Sen Wu, Jimeng Sun, and Hang Su. Cross-domain collaboration recommendation.
In Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery
and data mining, pages 1285–1293, 2012.
21. Christoph Wagner, Vera Liao, Peter Pirolli, Lynn Nelson, and Markus Strohmaier. It’s not
in their tweets: Modeling topical expertise of twitter users. In Privacy, Security, Risk and
Trust (PASSAT), collocated with the IEEE international conference on Social Computing
(SocialCom), pages 91–100, 2012.
22. Hanna M Wallach. Topic modeling: beyond bag-of-words. In Proceedings of the 23rd
International Conference on Machine Learning, pages 977–984. ACM, 2006.
23. Xuerui Wang, Andrew McCallum, and Xing Wei. Topical n-grams: Phrase and topic discovery, with an application to information retrieval. In Proceedings of the 7th IEEE International
Conference on Data Mining (ICDM), 2007, pages 697–702, 2007.
24. Furu Wei, Shixia Liu, Yangqiu Song, Shimei Pan, Michelle X. Zhou, Weihong Qian, Lei Shi,
Li Tan, and Qiang Zhang. TIARA: a visual exploratory text analytic system. In Proceedings
of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data
Mining, Washington, DC, USA, July 25-28, 2010, pages 153–162, 2010.
25. Weili Wu, Hui Xiong, and Shashi Shekhar. Clustering and information retrieval, volume 11.
Springer Science & Business Media, 2013.
26. Pengtao Xie and Eric P Xing. Integrating document clustering and topic modeling. In
Proceedings of the 29th Conference on Uncertainty in Artificial Intelligence (UAI), 2013.
27. Xiaohui Yan, Jiafeng Guo, Yanyan Lan, and Xueqi Cheng. A biterm topic model for short
texts. In Proceedings of the 22nd international conference on World Wide Web (WWW),
pages 1445–1456. ACM, 2013.
28. Yi Zhang, Guangquan Zhang, Hongshu Chen, Alan L. Porter, Donghua Zhu, and Jie Lu.
Topic analysis and forecasting for science, technology and innovation: Methodology with
a case study focusing on big data research. Technological Forecasting and Social Change,
2016.
29. Wayne Xin Zhao, Jing Jiang, Jing He, Yang Song, Palakorn Achananuparp, Ee-Peng Lim,
and Xiaoming Li. Topical keyphrase extraction from twitter. In Proceedings of the 49th
Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (ACL-HLT), pages 379–388, 2011.