Download Incremental Document Clustering Using Cluster Similarity Histograms

Incremental Document Clustering Using Cluster Similarity Histograms Khaled M. Hammouda Mohamed S. Kamel Department of Systems Design Engineering University of Waterloo Waterloo, Ontario, Canada N2L 3G1 E-mail: {hammouda,mkamel}@pami.uwaterloo.ca Abstract Clustering of large collections of text documents is a key process in providing a higher level of knowledge about the underlying inherent classification of the documents. Web documents, in particular, are of great interest since managing, accessing, searching, and browsing large repositories of web content requires efficient organization. Incremental clustering algorithms are always preferred to traditional clustering techniques, since they can be applied in a dynamic environment such as the Web. An incremental document clustering algorithm is introduced in this paper, which relies only on pair-wise document similarity information. Clusters are represented using a Cluster Similarity Histogram, a concise statistical representation of the distribution of similarities within each cluster, which provides a measure of cohesiveness. The measure guides the incremental clustering process. Complexity analysis and experimental results are discussed and show that the algorithm requires less computational time than standard methods while achieving a comparable or better clustering quality. 1. Introduction In dynamic information environments, such as the World Wide Web, it is usually desirable to apply adaptive methods for document organization such as clustering. Incremental clustering methods are of great interest in particular when we examine their applicability to clustering web content due to their ability to cope with such magnitude of changing content. Applications of document clustering include: clustering of retrieved documents to present organized and understandable results to the user, clustering documents in a collection (e.g. digital libraries), automated (or semi-automated) creation of document taxonomies (e.g. Yahoo and Open Directory styles), and efficient information retrieval by focusing on relevant subsets (clusters) rather than whole collection. The difference between traditional clustering methods [4] and incremental clustering in particular is the ability to process new data as they are added to the data collection. This allows dynamic tracking of the ever-increasing large scale information being put on the web everyday, without having to perform complete reclustering. Various approaches have been proposed, including Suffix Tree Clustering (STC) [8], DC-tree clustering [7], incremental hierarchical clustering [2], and others. In this paper we present an incremental clustering algorithm based on maintaining high cluster cohesiveness, represented as a Cluster Similarity Histogram, which is a concise statistical representation of the pair-wise document similarities within each cluster. Clusters are required to maintain high cohesiveness, expressed in terms of the histogram, while new documents are being added. The process allows for document reassignment to clusters which were created after the document was introduced. The algorithm is compared with other standard clustering techniques, namely Hierarchical Agglomerative Clustering (HAC), Single-Pass Clustering, and k-Nearest Neighbor Clustering. These clustering techniques were chosen because they require pair-wise document similarity information only. Other classes of clustering techniques that rely on having information about the original document vectors (such as k-means) were not considered for comparison since such comparison would not be accurate due to differences in the input to each technique. Each clustering algorithm accepts as its input a document similarity matrix without having to rely on the original feature vectors. 2. Document clustering In this section we present a brief overview of incremental clustering algorithms. We also briefly discuss one non-incremental clustering algorithm to provide a balanced view. A comparison of some document clustering techniques can be found in [6]. Non-incremental clustering methods mainly rely on having the whole document set ready before applying the algorithm. This is typical in offline processing scenarios. One of the most widely used non-incremental clustering algorithms is the Hierarchical Agglomerative Clustering (HAC) [4]. It is a straightforward greedy algorithm that produces a hierarchical grouping of the data. It starts with all instances each in its own cluster, then repeatedly merges the two clusters that are most similar at each iteration. There are different approaches of how to find the similarity between two clusters. Complexity of HAC is O(n2 ), which could get infeasible for very large document sets. Incremental clustering algorithms work by assigning objects to their respective clusters as they arrive. Problems faced by such algorithms include how to find the appropriate cluster to assign for the next object, how to deal with insertion order problems, and how to reassign objects to other clusters (that were not present when the object was first introduced.) We will very briefly review four incremental clustering algorithms. Single-Pass Clustering. This algorithm basically processes documents sequentially, and compares each document to all existing clusters. If the similarity between the document and any cluster is above a certain threshold, then the document is added to the closest cluster; otherwise it forms its own cluster. Usually the method for determining the similarity between a document and a cluster is done by computing the average similarity of the document to all documents in that cluster. K-Nearest Neighbor Clustering. For each new document, the algorithm computes its similarity to every other document, and chooses the top k documents. The new document is assigned to the cluster where the majority of the top k documents are assigned. Other algorithms, which are not compared here because they rely on the original document vectors not the pairwise document similarity, include Suffix Tree Clustering (STC) [8]. and DC-tree Clustering [7]. 3. Similarity histogram-based incremental clustering The clustering approach proposed here is an incremental dynamic method of building the clusters. We adopt an overlapped cluster model. The key concept for the similarity histogram-based clustering method (SHC) is to keep each cluster at a high degree of coherency at any time. We represent the coherency of a cluster with a new concept called Cluster Similarity Histogram. Cluster Similarity Histogram: is a concise statistical representation of the set of pair-wise document similarities distribution in the cluster. A number of bins in the histogram correspond to fixed similarity value intervals. Each bin contains the count of pair-wise document similarities in the corresponding interval. 3.1. Incremental creation of coherent clusters The way toward achieving coherent clusters is chosen so as to maintain high-similarity within clusters. In terms of the similarity histogram, this means keeping the distribution of similarities skewed to the right of the histogram as much as possible. New documents that need to be assigned to a cluster are compared against each cluster histogram, and if it is found to degrade the distribution very much, it should not be added, otherwise it is added. A much stricter strategy would be to add documents that will only enhance the similarity distribution. However, this could create a problem with perfect clusters, due to their tendency to reject any document unless it keeps the cluster perfect. We judge the quality of a similarity histogram (cluster cohesiveness) by calculating the ratio of the count of similarities above a certain similarity threshold ST to the total count of similarities. The higher this ratio, the more cohesive is the cluster. Let nc be the number of the documents in a cluster. The number of pair-wise similarities in the cluster is mc = nc (nc + 1)/2. Let S = {si : i = 1, . . . , mc } be the set of similarities in the cluster. The histogram of the similarities in the cluster is represented as: H hi where B: hi : sli : sui : = {hi : i = 1, . . . , B} = count(sk ) sli ≤ sk < sui (1a) (1b) the number of histogram bins, the count of similarities in bin i, the lower similarity bound of bin i, and the upper similarity bound of bin i. The histogram ratio (HR) of a cluster is the measure of cohesiveness of the cluster as described above, and is calculated as: PB hi (2a) HRc = Pi=T B j=1 hj T = bST · Bc (2b) where HRc : the histogram ratio of cluster c, ST : the similarity threshold, and T : the bin number corresponding to the similarity threshold. Basically we would like to keep the histogram ratio of each cluster high. However, since we allow documents that can degrade the histogram ratio to be added, this could result in a chain effect of degrading the ratio to zero eventually. To prevent this, we set a minimum histogram ratio HRmin that clusters should maintain. We also do not allow adding a single document that will bring down the histogram ratio significantly (even if still above HRmin ). We now present the incremental clustering algorithm based on the above framework (algorithm SHC). The algorithm works incrementally by receiving a new document, and for each cluster calculates the cluster histogram before and after simulating the addition of the document (lines 46). The old and new histogram ratios are compared and if the new ratio is greater than or equal to the old one, the document is added to the cluster. If the new ratio is less than the old one by no more than ε and still above HRmin , it is added (lines 7-9). Otherwise it is not added. If after checking all clusters the document was not assigned to any cluster, a new cluster is created and the document is added to it (lines 11-15). Algorithm SHC 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: L ← Empty List {Cluster List} for each document D do for each cluster C in L do HRold = HRC Simulate adding D to C HRnew = HRC if (HRnew ≥ HRold ) OR ((HRnew > HRmin ) AND (HRold − HRnew < ε)) then Add D to C end if end for if D was not added to any cluster then Create a new cluster C ADD D to C ADD C to L end if end for previously seen documents. This is a property of all algorithms that work based on a document similarity matrix. However, the similarity histogram representation gives us an advantage with typical document corpora. Typically, a document similarity vector (containing its similarity to every other document) will be sufficiently sparse. This is usually because quite a large percentage of documents do not share any terms (especially documents from different classes), so their similarity is zero. This saves a lot of computation and makes the algorithm sub-quadratic in typical situations. Space requirements for the algorithm depends on whether similarities are pre-computed, or each new document similarity to other documents is computed as it arrives. In the former case, the space requirement is O(n2 ), while in the latter it is O(n), since we only need to store information about the current document similarity vector. 3.2. Dealing with insertion order problems Our strategy for the insertion order problem is to implement a document reassignment strategy. This strategy does not completely eliminate the problem, but it helps reduce its effect; i.e. the process is non-deterministic, different insertion ordering will result in different partitioning of the documents. Documents are considered “bad” for a cluster if they are contributing negatively to the HR of that cluster; i.e. if they are removed from the cluster, the HR will increase. We keep with each document a value indicating the HR value if the document was not in the cluster. If this value is greater than the current HR, then the document is a candidate for leaving the cluster. Upon adding a new document to any cluster, we consult the documents that are candidates for leaving the cluster. If any such document can be added to other clusters, we move it to those clusters, thus benefiting both clusters. 4. Experimental results In comparison with the criteria of single-pass clustering and k-NN clustering, the similarity histogram ratio as a coherency measure provides a more representative measure of the tightness of the documents in the cluster, and how the external document would affect such tightness. On the other hand, single-pass compares the external document to the average of the similarities in the cluster, while the kNN method takes into consideration only a few similarities that might be outliers, and that is why we sometimes need to increase the value of the parameter k to get better results from k-NN . By definition, the time complexity of the similarity histogram-based clustering algorithm is O(n2 ), since for each new document we must compute its similarity to all 4.1. Experimental Setup The availability of web document data sets suitable for clustering is limited. However, we used two web document data sets. Table 1 describes the data sets. The first data set (DS1) is a collection of 314 web documents manually collected and labelled from various web sites including the University of Waterloo and some Canadian web sites1 , which was used in in [3]. The second data set (DS2) is a collection of 2340 Reuters news articles posted on Yahoo news, and was used by Boley et al in [1]. 1 The document collection can be downloaded from this location: http://pami.uwaterloo.ca/˜hammouda/webdata/ Table 1. Data Sets Descriptions Data Set DS1 DS2 Name Type UW-CAN Yahoo! news HTML HTML Table 2. SHC Improvement Docs Categories Avg. words/doc 314 2340 10 20 469 289 The similarity histogram-based incremental clustering algorithm was compared against HAC, Single-Pass and kNN . The document similarity matrix was introduced to each of the algorithms. We used the cosine correlation similarity measure [5], with TF-IDF (Term Frequency–Inverse Document Frequency) term weights. 4.2. Evaluation Measures We adopted two quality measures widely used in the text mining literature for the purpose of document clustering [6]. The first is the F-measure, which combines the Precision and Recall ideas. The precision (P ) and recall (R) of a cluster j with respect to a class i are defined as: Nij , P (i, j) = Nj Nij R(i, j) = Ni (3) where Nij : is the number of members of class i in cluster j, Nj : is the number of members of cluster j, and Ni : is the number of members of class i. The F-measure of class i is F (i) = 2P R/(P + R). The overall F-measure for the clustering result C is the weighted average of the F-measure for each class i: P (|i| × F (i)) FC = i P (4) i |i| where |i| is the number of objects in class i. The higher the overall F-measure, the better the clustering, due to the higher accuracy of the clusters mapping to the original classes. The second measure is the Entropy, which provides a measure of homogeneity of un-nested clusters, or for the clusters at one level of a hierarchical clustering. The higher the homogeneity of a cluster, the lower the entropy is, and vice versa. For every cluster j in the clustering result C we compute pij , the probability that a member of cluster j belongs to class i. The entropy of each cluster P j is calculated using the standard formula Ej = − i pij log(pij ), where the sum is taken over all classes. The total entropy for a set of clusters is calculated as the sum of entropies for each cluster weighted by the size of each cluster: EC = m X Nj × Ej ) ( N j=1 (5) SHC HACa Single-Passb k-NN c DS1 F-measure Entropy 0.931 0.119 0.901 0.211 0.641 0.313 0.727 0.173 DS2 F-measure Entropy 0.682 0.156 0.584 0.281 0.502 0.250 0.522 0.161 a Complete Linkage was used as the cluster distance measure. document-to-cluster similarity threshold of 0.25 was used. c A k of 5 and a cluster similarity threshold of 0.25 were used. bA where Nj is the size of cluster j, and N is the total number of data objects. Basically we would like to maximize the F-measure and minimize the Entropy of clusters to achieve high quality clustering. 4.3. Results and Discussion Table 2 shows the result of SHC against HAC, SinglePass, and k-NN clustering. For the first data set, the improvement was very significant, reaching over 20% improvement over k-NN (in terms of F-measure), 3% improvement over HAC, and 29% improvement over SinglePass. For the second data set an improvement between 10% to 18% was achieved over the other methods. However, the absolute F-measure was not really high compared to the first data set. The parameters chosen for the different algorithms were the ones that produced best results. After investigating the second data set, we found that it showed very low pair-wise document similarity within each class. This had the effect of creating high number of clusters, thus the lower F-measure, but still good entropy (since entropy favors small clusters.) Figure 1 shows the above mentioned results more clearly, showing the achieved improvement in comparison with the other methods. The re-assignment strategy showed a slight improvement. This could be attributed to the distance between the original clusters; if the clusters are sufficiently distant from each other, then the probability of a document getting assigned to a cluster it does not belong to in the first place is very low. The time performance comparison of the different clustering algorithms is illustrated in figure 2, showing the performance for both data sets. The performance of SHC is comparable to single-pass and k-NN , while being much better than HAC. HAC spends so much time in recalculating the similarities between the newly merged cluster and all other clusters during every iteration. On the other hand, SHC, single-pass, and k-NN share the same general strategy for processing documents, without having to recalculate similarities at each step. 1 0.9 0.8 0.8 0.7 0.7 0.6 0.6 Entropy F-Measure 1 0.9 0.5 0.4 0.5 0.4 0.3 0.3 0.2 0.2 0.1 0 .1 0 0 DS1 DS2 DS1 DS2 SHC (wit h Re-assig nment ) SHC (wit h Re-assignment ) SHC (no Re-assignment ) HA C Sing le Pass SHC (no Re-assig nment ) HA C Single Pass K-NN K-NN (a) F-measure (b) Entropy Figure 1. Quality of Clustering Comparison Clustering Performance (DS1) Clustering Performance (DS2) 12 250 200 8 Time (sec) Time (sec) 10 6 4 150 100 50 2 0 0 0 50 100 150 200 250 300 0 500 Documents 1000 1500 2000 Documents SHC SHC Single−Pass Single−Pass K−NN K−NN HAC HAC (a) Clustering Performance - DS1 (b) Clustering Performance - DS2 Figure 2. Clustering Performance 5. Conclusion We presented an incremental document clustering algorithm based on maintaining highly coherent clusters at all times. The similarity histogram representation of the pair-wise document similarity distribution inside clusters showed an improvement in judging the similarity between a new document and each cluster. The method shows good performance in terms of clustering quality and time performance compared to standard document clustering techniques, as justified by two evaluation measures. We believe that this approach works well because the similarity histogram representation has the ability to capture essential statistical distribution of similarities in each cluster, and by guiding the algorithm in the direction of enhancing this distribution we are able to maintain high quality clusters. References [1] D. Boley, M. Gini, R. Gross, S. Han, K. Hastings, G. Karypis, V. Kumar, B. Mobasher, and J. Moore. Partitioning-based [2] [3] [4] [5] [6] [7] [8] clustering for web document categorization. Decision Support Systems, 27:329–341, 1999 . M. Charikar, C. Chekuri, T. Feder, and R. Motwani. Incremental clustering and dynamic information retrieval. In The 29th annual ACM symposium on Theory of computing, pages 626–635, 1997. K. Hammouda and M. Kamel. Phrase-based document similarity based on an index graph model. In 2002 IEEE Int. Conf. on Data Mining (ICDM02), pages 203–210, Japan, 2002. A. Jain and R. Dubes. Algorithms for Clustering Data. Prentice Hall, Englewood Cliffs, N.J., 1988. G. Salton, A. Wong, and C. Yang. A vector space model for automatic indexing. Communications of the ACM, 18(11):613–620, November 1975. M. Steinbach, G. Karypis, and V. Kumar. A comparison of document clustering techniques. KDD-2000 Workshop on TextMining, August 2000. W. Wong and A. Fu. Incremental document clustering for web page classification. In 2000 Int. Conf. on Information Society (IS2000), Japan, 2000. O. Zamir and O. Etzioni. Web document clustering: A feasibility demonstration. In Proc. of the 21st Annual Int. ACM SIGIR Conf., pages 46–54, Melbourne, Australia, 1998.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Incremental Document Clustering Using Cluster Similarity Histograms