Download Incremental Document Clustering Using Cluster Similarity Histograms

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Human genetic clustering wikipedia , lookup

K-means clustering wikipedia , lookup

Nearest-neighbor chain algorithm wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
Incremental Document Clustering Using Cluster Similarity Histograms
Khaled M. Hammouda
Mohamed S. Kamel
Department of Systems Design Engineering
University of Waterloo
Waterloo, Ontario, Canada N2L 3G1
E-mail: {hammouda,mkamel}@pami.uwaterloo.ca
Abstract
Clustering of large collections of text documents is a key
process in providing a higher level of knowledge about the
underlying inherent classification of the documents. Web
documents, in particular, are of great interest since managing, accessing, searching, and browsing large repositories of web content requires efficient organization. Incremental clustering algorithms are always preferred to traditional clustering techniques, since they can be applied in
a dynamic environment such as the Web. An incremental
document clustering algorithm is introduced in this paper,
which relies only on pair-wise document similarity information. Clusters are represented using a Cluster Similarity
Histogram, a concise statistical representation of the distribution of similarities within each cluster, which provides
a measure of cohesiveness. The measure guides the incremental clustering process. Complexity analysis and experimental results are discussed and show that the algorithm requires less computational time than standard methods while
achieving a comparable or better clustering quality.
1. Introduction
In dynamic information environments, such as the World
Wide Web, it is usually desirable to apply adaptive methods for document organization such as clustering. Incremental clustering methods are of great interest in particular when we examine their applicability to clustering web
content due to their ability to cope with such magnitude
of changing content. Applications of document clustering
include: clustering of retrieved documents to present organized and understandable results to the user, clustering documents in a collection (e.g. digital libraries), automated (or
semi-automated) creation of document taxonomies (e.g. Yahoo and Open Directory styles), and efficient information
retrieval by focusing on relevant subsets (clusters) rather
than whole collection.
The difference between traditional clustering methods [4] and incremental clustering in particular is the ability
to process new data as they are added to the data collection.
This allows dynamic tracking of the ever-increasing large
scale information being put on the web everyday, without having to perform complete reclustering. Various approaches have been proposed, including Suffix Tree Clustering (STC) [8], DC-tree clustering [7], incremental hierarchical clustering [2], and others.
In this paper we present an incremental clustering algorithm based on maintaining high cluster cohesiveness,
represented as a Cluster Similarity Histogram, which is a
concise statistical representation of the pair-wise document
similarities within each cluster. Clusters are required to
maintain high cohesiveness, expressed in terms of the histogram, while new documents are being added. The process
allows for document reassignment to clusters which were
created after the document was introduced.
The algorithm is compared with other standard clustering techniques, namely Hierarchical Agglomerative Clustering (HAC), Single-Pass Clustering, and k-Nearest Neighbor Clustering. These clustering techniques were chosen
because they require pair-wise document similarity information only. Other classes of clustering techniques that rely
on having information about the original document vectors
(such as k-means) were not considered for comparison since
such comparison would not be accurate due to differences
in the input to each technique. Each clustering algorithm
accepts as its input a document similarity matrix without
having to rely on the original feature vectors.
2. Document clustering
In this section we present a brief overview of incremental clustering algorithms. We also briefly discuss one
non-incremental clustering algorithm to provide a balanced
view. A comparison of some document clustering techniques can be found in [6].
Non-incremental clustering methods mainly rely on having the whole document set ready before applying the algorithm. This is typical in offline processing scenarios. One of
the most widely used non-incremental clustering algorithms
is the Hierarchical Agglomerative Clustering (HAC) [4].
It is a straightforward greedy algorithm that produces a hierarchical grouping of the data. It starts with all instances
each in its own cluster, then repeatedly merges the two clusters that are most similar at each iteration. There are different approaches of how to find the similarity between two
clusters. Complexity of HAC is O(n2 ), which could get
infeasible for very large document sets.
Incremental clustering algorithms work by assigning objects to their respective clusters as they arrive. Problems
faced by such algorithms include how to find the appropriate cluster to assign for the next object, how to deal with
insertion order problems, and how to reassign objects to
other clusters (that were not present when the object was
first introduced.) We will very briefly review four incremental clustering algorithms.
Single-Pass Clustering. This algorithm basically processes documents sequentially, and compares each document to all existing clusters. If the similarity between the
document and any cluster is above a certain threshold, then
the document is added to the closest cluster; otherwise it
forms its own cluster. Usually the method for determining the similarity between a document and a cluster is done
by computing the average similarity of the document to all
documents in that cluster.
K-Nearest Neighbor Clustering. For each new document, the algorithm computes its similarity to every other
document, and chooses the top k documents. The new document is assigned to the cluster where the majority of the
top k documents are assigned.
Other algorithms, which are not compared here because
they rely on the original document vectors not the pairwise document similarity, include Suffix Tree Clustering
(STC) [8]. and DC-tree Clustering [7].
3. Similarity histogram-based incremental
clustering
The clustering approach proposed here is an incremental
dynamic method of building the clusters. We adopt an overlapped cluster model. The key concept for the similarity
histogram-based clustering method (SHC) is to keep each
cluster at a high degree of coherency at any time. We represent the coherency of a cluster with a new concept called
Cluster Similarity Histogram.
Cluster Similarity Histogram: is a concise statistical representation of the set of pair-wise document similarities distribution in the cluster. A
number of bins in the histogram correspond to
fixed similarity value intervals. Each bin contains
the count of pair-wise document similarities in the
corresponding interval.
3.1. Incremental creation of coherent clusters
The way toward achieving coherent clusters is chosen so
as to maintain high-similarity within clusters. In terms of
the similarity histogram, this means keeping the distribution
of similarities skewed to the right of the histogram as much
as possible. New documents that need to be assigned to a
cluster are compared against each cluster histogram, and if
it is found to degrade the distribution very much, it should
not be added, otherwise it is added. A much stricter strategy would be to add documents that will only enhance the
similarity distribution. However, this could create a problem with perfect clusters, due to their tendency to reject any
document unless it keeps the cluster perfect.
We judge the quality of a similarity histogram (cluster
cohesiveness) by calculating the ratio of the count of similarities above a certain similarity threshold ST to the total
count of similarities. The higher this ratio, the more cohesive is the cluster. Let nc be the number of the documents in
a cluster. The number of pair-wise similarities in the cluster
is mc = nc (nc + 1)/2. Let S = {si : i = 1, . . . , mc } be
the set of similarities in the cluster. The histogram of the
similarities in the cluster is represented as:
H
hi
where B:
hi :
sli :
sui :
= {hi : i = 1, . . . , B}
= count(sk )
sli ≤ sk < sui
(1a)
(1b)
the number of histogram bins,
the count of similarities in bin i,
the lower similarity bound of bin i, and
the upper similarity bound of bin i.
The histogram ratio (HR) of a cluster is the measure of
cohesiveness of the cluster as described above, and is calculated as:
PB
hi
(2a)
HRc = Pi=T
B
j=1 hj
T
= bST · Bc
(2b)
where HRc : the histogram ratio of cluster c,
ST : the similarity threshold, and
T : the bin number corresponding to the similarity
threshold.
Basically we would like to keep the histogram ratio of
each cluster high. However, since we allow documents that
can degrade the histogram ratio to be added, this could result in a chain effect of degrading the ratio to zero eventually. To prevent this, we set a minimum histogram ratio
HRmin that clusters should maintain. We also do not allow adding a single document that will bring down the histogram ratio significantly (even if still above HRmin ).
We now present the incremental clustering algorithm
based on the above framework (algorithm SHC). The algorithm works incrementally by receiving a new document,
and for each cluster calculates the cluster histogram before
and after simulating the addition of the document (lines 46). The old and new histogram ratios are compared and if
the new ratio is greater than or equal to the old one, the
document is added to the cluster. If the new ratio is less
than the old one by no more than ε and still above HRmin ,
it is added (lines 7-9). Otherwise it is not added. If after
checking all clusters the document was not assigned to any
cluster, a new cluster is created and the document is added
to it (lines 11-15).
Algorithm SHC
1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
12:
13:
14:
15:
16:
L ← Empty List {Cluster List}
for each document D do
for each cluster C in L do
HRold = HRC
Simulate adding D to C
HRnew = HRC
if (HRnew ≥ HRold ) OR ((HRnew > HRmin )
AND (HRold − HRnew < ε)) then
Add D to C
end if
end for
if D was not added to any cluster then
Create a new cluster C
ADD D to C
ADD C to L
end if
end for
previously seen documents. This is a property of all algorithms that work based on a document similarity matrix.
However, the similarity histogram representation gives us
an advantage with typical document corpora. Typically, a
document similarity vector (containing its similarity to every other document) will be sufficiently sparse. This is
usually because quite a large percentage of documents do
not share any terms (especially documents from different
classes), so their similarity is zero. This saves a lot of computation and makes the algorithm sub-quadratic in typical
situations.
Space requirements for the algorithm depends on
whether similarities are pre-computed, or each new document similarity to other documents is computed as it arrives.
In the former case, the space requirement is O(n2 ), while in
the latter it is O(n), since we only need to store information
about the current document similarity vector.
3.2. Dealing with insertion order problems
Our strategy for the insertion order problem is to implement a document reassignment strategy. This strategy does
not completely eliminate the problem, but it helps reduce
its effect; i.e. the process is non-deterministic, different insertion ordering will result in different partitioning of the
documents.
Documents are considered “bad” for a cluster if they are
contributing negatively to the HR of that cluster; i.e. if they
are removed from the cluster, the HR will increase. We
keep with each document a value indicating the HR value if
the document was not in the cluster. If this value is greater
than the current HR, then the document is a candidate for
leaving the cluster. Upon adding a new document to any
cluster, we consult the documents that are candidates for
leaving the cluster. If any such document can be added to
other clusters, we move it to those clusters, thus benefiting
both clusters.
4. Experimental results
In comparison with the criteria of single-pass clustering
and k-NN clustering, the similarity histogram ratio as a coherency measure provides a more representative measure
of the tightness of the documents in the cluster, and how
the external document would affect such tightness. On the
other hand, single-pass compares the external document to
the average of the similarities in the cluster, while the kNN method takes into consideration only a few similarities
that might be outliers, and that is why we sometimes need
to increase the value of the parameter k to get better results
from k-NN .
By definition, the time complexity of the similarity
histogram-based clustering algorithm is O(n2 ), since for
each new document we must compute its similarity to all
4.1. Experimental Setup
The availability of web document data sets suitable for
clustering is limited. However, we used two web document data sets. Table 1 describes the data sets. The first
data set (DS1) is a collection of 314 web documents manually collected and labelled from various web sites including
the University of Waterloo and some Canadian web sites1 ,
which was used in in [3]. The second data set (DS2) is a
collection of 2340 Reuters news articles posted on Yahoo
news, and was used by Boley et al in [1].
1 The document collection can be downloaded from this location:
http://pami.uwaterloo.ca/˜hammouda/webdata/
Table 1. Data Sets Descriptions
Data Set
DS1
DS2
Name
Type
UW-CAN
Yahoo! news
HTML
HTML
Table 2. SHC Improvement
Docs
Categories
Avg. words/doc
314
2340
10
20
469
289
The similarity histogram-based incremental clustering
algorithm was compared against HAC, Single-Pass and kNN . The document similarity matrix was introduced to
each of the algorithms. We used the cosine correlation similarity measure [5], with TF-IDF (Term Frequency–Inverse
Document Frequency) term weights.
4.2. Evaluation Measures
We adopted two quality measures widely used in the text
mining literature for the purpose of document clustering [6].
The first is the F-measure, which combines the Precision
and Recall ideas. The precision (P ) and recall (R) of a
cluster j with respect to a class i are defined as:
Nij
,
P (i, j) =
Nj
Nij
R(i, j) =
Ni
(3)
where Nij : is the number of members of class i in cluster j,
Nj : is the number of members of cluster j, and
Ni : is the number of members of class i.
The F-measure of class i is F (i) = 2P R/(P + R). The
overall F-measure for the clustering result C is the weighted
average of the F-measure for each class i:
P
(|i| × F (i))
FC = i P
(4)
i |i|
where |i| is the number of objects in class i. The higher
the overall F-measure, the better the clustering, due to the
higher accuracy of the clusters mapping to the original
classes.
The second measure is the Entropy, which provides a
measure of homogeneity of un-nested clusters, or for the
clusters at one level of a hierarchical clustering. The higher
the homogeneity of a cluster, the lower the entropy is, and
vice versa.
For every cluster j in the clustering result C we compute
pij , the probability that a member of cluster j belongs to
class i. The entropy of each cluster
P j is calculated using
the standard formula Ej = − i pij log(pij ), where the
sum is taken over all classes. The total entropy for a set of
clusters is calculated as the sum of entropies for each cluster
weighted by the size of each cluster:
EC =
m
X
Nj
× Ej )
(
N
j=1
(5)
SHC
HACa
Single-Passb
k-NN c
DS1
F-measure
Entropy
0.931
0.119
0.901
0.211
0.641
0.313
0.727
0.173
DS2
F-measure
Entropy
0.682
0.156
0.584
0.281
0.502
0.250
0.522
0.161
a Complete
Linkage was used as the cluster distance measure.
document-to-cluster similarity threshold of 0.25 was used.
c A k of 5 and a cluster similarity threshold of 0.25 were used.
bA
where Nj is the size of cluster j, and N is the total number
of data objects.
Basically we would like to maximize the F-measure and
minimize the Entropy of clusters to achieve high quality
clustering.
4.3. Results and Discussion
Table 2 shows the result of SHC against HAC, SinglePass, and k-NN clustering. For the first data set, the improvement was very significant, reaching over 20% improvement over k-NN (in terms of F-measure), 3% improvement over HAC, and 29% improvement over SinglePass. For the second data set an improvement between 10%
to 18% was achieved over the other methods. However, the
absolute F-measure was not really high compared to the first
data set. The parameters chosen for the different algorithms
were the ones that produced best results.
After investigating the second data set, we found that it
showed very low pair-wise document similarity within each
class. This had the effect of creating high number of clusters, thus the lower F-measure, but still good entropy (since
entropy favors small clusters.)
Figure 1 shows the above mentioned results more clearly,
showing the achieved improvement in comparison with the
other methods. The re-assignment strategy showed a slight
improvement. This could be attributed to the distance between the original clusters; if the clusters are sufficiently
distant from each other, then the probability of a document
getting assigned to a cluster it does not belong to in the first
place is very low.
The time performance comparison of the different clustering algorithms is illustrated in figure 2, showing the performance for both data sets. The performance of SHC is
comparable to single-pass and k-NN , while being much
better than HAC. HAC spends so much time in recalculating the similarities between the newly merged cluster and
all other clusters during every iteration. On the other hand,
SHC, single-pass, and k-NN share the same general strategy for processing documents, without having to recalculate
similarities at each step.
1
0.9
0.8
0.8
0.7
0.7
0.6
0.6
Entropy
F-Measure
1
0.9
0.5
0.4
0.5
0.4
0.3
0.3
0.2
0.2
0.1
0 .1
0
0
DS1
DS2
DS1
DS2
SHC (wit h Re-assig nment )
SHC (wit h Re-assignment )
SHC (no Re-assignment )
HA C
Sing le Pass
SHC (no Re-assig nment )
HA C
Single Pass
K-NN
K-NN
(a) F-measure
(b) Entropy
Figure 1. Quality of Clustering Comparison
Clustering Performance (DS1)
Clustering Performance (DS2)
12
250
200
8
Time (sec)
Time (sec)
10
6
4
150
100
50
2
0
0
0
50
100
150
200
250
300
0
500
Documents
1000
1500
2000
Documents
SHC
SHC
Single−Pass
Single−Pass
K−NN
K−NN
HAC
HAC
(a) Clustering Performance - DS1
(b) Clustering Performance - DS2
Figure 2. Clustering Performance
5. Conclusion
We presented an incremental document clustering algorithm based on maintaining highly coherent clusters at
all times. The similarity histogram representation of the
pair-wise document similarity distribution inside clusters
showed an improvement in judging the similarity between a
new document and each cluster. The method shows good
performance in terms of clustering quality and time performance compared to standard document clustering techniques, as justified by two evaluation measures. We believe that this approach works well because the similarity
histogram representation has the ability to capture essential
statistical distribution of similarities in each cluster, and by
guiding the algorithm in the direction of enhancing this distribution we are able to maintain high quality clusters.
References
[1] D. Boley, M. Gini, R. Gross, S. Han, K. Hastings, G. Karypis,
V. Kumar, B. Mobasher, and J. Moore. Partitioning-based
[2]
[3]
[4]
[5]
[6]
[7]
[8]
clustering for web document categorization. Decision Support Systems, 27:329–341, 1999 .
M. Charikar, C. Chekuri, T. Feder, and R. Motwani. Incremental clustering and dynamic information retrieval. In The
29th annual ACM symposium on Theory of computing, pages
626–635, 1997.
K. Hammouda and M. Kamel. Phrase-based document similarity based on an index graph model. In 2002 IEEE Int. Conf.
on Data Mining (ICDM02), pages 203–210, Japan, 2002.
A. Jain and R. Dubes. Algorithms for Clustering Data. Prentice Hall, Englewood Cliffs, N.J., 1988.
G. Salton, A. Wong, and C. Yang. A vector space model
for automatic indexing.
Communications of the ACM,
18(11):613–620, November 1975.
M. Steinbach, G. Karypis, and V. Kumar. A comparison of
document clustering techniques. KDD-2000 Workshop on
TextMining, August 2000.
W. Wong and A. Fu. Incremental document clustering for web
page classification. In 2000 Int. Conf. on Information Society
(IS2000), Japan, 2000.
O. Zamir and O. Etzioni. Web document clustering: A feasibility demonstration. In Proc. of the 21st Annual Int. ACM
SIGIR Conf., pages 46–54, Melbourne, Australia, 1998.