Download Document Clustering Using Concept Space and Cosine Similarity

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Human genetic clustering wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Principal component analysis wikipedia , lookup

K-means clustering wikipedia , lookup

Nearest-neighbor chain algorithm wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
2009 International Conference on Computer Technology and Development
Document Clustering using Concept Space
and Cosine Similarity Measurement
Lailil Muflikhah
Baharum Baharudin
Department of Computer and Information Science
Universiti Teknologi Petronas
Brawijaya University
Bandar Seri Iskandar, Tronoh, Perak, Malaysia
[email protected]
Department of Computer and Information Science
Universiti Teknologi Petronas
Bandar Seri Iskandar, Tronoh, Perak, Malaysia
[email protected]
browsing of retrieval results [2]. Some experimental
evidences show that IR application can benefit from the use
of document clustering [3]. Document clustering has always
been used as a tool to improve the performance of retrieval
and navigating large data.
The clustering methods can be classified into
hierarchical method and partitioning method. Partitioning
clustering can be divided into hard clustering and
overlapping (fuzzy clustering). In any given document, there
is the possibility that it can contain multiple subject or
category. This issue is purposed to use fuzzy clustering
approach, which allows a document to appear in multiple
clusters. The method used is different from hard clustering
method, which a document is only belongs to one cluster,
not more. It means that we assume well defined boundary
among the clusters. Many researchers have conducted in
document clustering by hard clustering methods. In the
research show that Bisecting K-Means algorithm is better
than Basic K-Means algorithm and Agglomerative
Hierarchical Clustering using similarity concept with
cosine formulation [4]. Thus, another research that in
grouping process used by variance value is better in the
result [5], or by using Euclidean distance [6]. Furthermore,
there are several applications by using Fuzzy C-Means
algorithm, the researcher had applied clustering for
symbolic data and its result of clustering has better quality
than hierarchical method (hard clustering) [7]. Also, it had
been applied in text mining [8].
While grouping or clustering of document, the problem is
very huge terms or words are represented to the full text
vector space. Many lexical matching at term level are
inaccurate. Sometimes the words have the number of
meaning or the number of words has the same meaning, it
effects to match returns irrelevant documents. It is difficult
to judge which documents belong to the same cluster based
on the specific category without any selection for the terms
which have meaning full or correlation between the terms
each other. Therefore, this research is used concept of
Information Retrieval approach is Latent Semantic Index
(LSI). In this method, the documents are projected onto a
small subspace of this vector space and clustered. So, there
is creation of new abstract vector space, which contain of
the important term is captured in order [2].
Abstract— Document clustering is related to data clustering
concept which is one of data mining tasks and unsupervised
classification. It is often applied to the huge data in order to
make a partition based on their similarity. Initially, it used for
Information Retrieval in order to improve the precision and
recall from query. It is very easy to cluster with small data
attributes which contains of important items. Furthermore,
document clustering is very useful in retrieve information
application in order to reduce the consuming time and get high
precision and recall. Therefore, we propose to integrate the
information retrieval method and document clustering as
concept space approach. The method is known as Latent
Semantic Index (LSI) approach which used Singular Vector
Decomposition (SVD) or Principle Component Analysis (PCA).
The aim of this method is to reduce the matrix dimension by
finding the pattern in document collection with refers to
concurrent of the terms. Each method is implemented to
weight of term-document in vector space model (VSM) for
document clustering using fuzzy c-means algorithm. Besides
reduction of term-document matrix, this research also uses the
cosine similarity measurement as replacement of Euclidean
distance to involve in fuzzy c-means. And as a result, the
performance of the proposed method is better than the existing
method with f-measure around 0.91 and entropy around 0.51.
Keywords-data mining; clustering; LSI; SVD; PCA; fuzzy cmeans; euclidean distance; cosine similarity
I.
INTRODUCTION
Data mining is a technique to get the pattern from hidden
information. This technique is to find and describe structural
patterns in data collection as a tool for helping to explain
that data and make predictions from it. Generally, data
mining tasks are divided into two major categories:
predictive tasks which aim to predict the value of a
particular attribute based on the values of other attributes
and another one is descriptive tasks which aim to derive
patterns (correlations, trends, clusters, trajectories, and
anomalies)[1]. Clustering is a method to organize
automatically a large data collection by partition a set data,
so the objects in the same cluster are more similar to one
another than to objects in other clusters. Document
clustering is related to organize a large data text collection.
In the field of Information Retrieval (IR), document
clustering is used to automatically group the document that
belongs to the same topic in order to provide user’s
978-0-7695-3892-1/09 $26.00 © 2009 IEEE
DOI 10.1109/ICCTD.2009.206
58
matrix[9]. And Fig.2 depicts how
w the LSI in getting the
pattern in document collection.
In this paper, firstly it is described abbout information
retrieval using LSI concept. Then, we deescribe document
similarity as basic concept of clustering, and also one of
clustering algorithm applied for document cclustering itself is
Fuzzy C-Means. After that, the methodologgy which used to
implement document clustering by LSI and cosine similarity
embedded to experiment evaluation. Sincee this experiment
is implemented, the performance can be m
made an analysis
and conclusion.
II.
Figure 2.
Description of LSI in data collection
B. Singular Value Decomposition (SVD)
(
The Singular Value Decomposiition (SVD) is a method
which can find the patterns in the matrix
m
and identify which
words and documents are similar to each other. It creates the
new matrices from term (t) x docum
ment (d) matrix A that are
matrices U, ∑ and V such that A=
A USVT which can be
illustrated as in Fig. 3.[10]
INFORMATION RETRIEV
VAL
Information retrieval is the way to seearch the match
information as user desired. Unrelated doocument may be
retrieved simply because terms occur acciddentally in it, and
on the other hand, related documents may be missed
because no term in the document occurrs in the query.
Therefore, the retrieval could be based onn concept rather
than on terms, by mapping first items to a “concept space”
and using ranking of similarity as shown in Fig. 1[2].
Figure 3. The relative sizes of the th
hree matrixes when t > d
In Fig. 3, the SVD matrix shows where U has
orthogonal, unit-length column (UTU=I) and it is called left
singular vectors; V has orthogonaal which is called right
singular vectors, unit-length colu
umn (VTV=I) and ∑ is
diagonal matrix (k x k) of singulaar values, where k is the
rank of A ( ≤ min (t, d)). Generally
y, A = U∑VT matrix must
all be of full rank. The amount of diimension reduction, need
to choice correctly in order to repreesent the real structure in
the data [2].
Figure 1. Using concept for Retrieval Infformation
Fig. 1 describes there is middle layerr into two query
based on the concept (c1 and c2) and docum
ment maps instead
of directly relating documents and term as inn vector retrieval.
This vector, the query c2 of t3 return d2, d3,, d4 in the answer
set based on the observation that they relatte to concept c2,
without requiring that the document containss term t3.
(
C. Principal Component Analysis (PCA)
Principal component analysis is a method to find k
“principal axes” which are orthono
ormal coordinate systems
that can capture most of the variiance in data. Basically,
PCA is formed from Singular Vecto
or Decomposition (SVD)
on the covariance matrix which used eigen vector or value
of covariance matrix [11, 12].
A. Latent Semantic Index (LSI)
Initially, latent semantic indexing (LSII) is obtained to
get pattern in the document collection which used to
Retrieval. It uses
improve the accuracy in Information R
Singular Vector Decomposition (SVD
D) or Principle
Component Analysis (PCA) to decompoose the original
matrix A of document representation and to retain only k
largest singular value from singular value m
matrix ∑. In this
matrix, ∑ selects only the largest singular vvalue which is to
keep the corresponding columns in two oother matrixes U
and VT. The choice of s determines on hhow many of the
“important concepts” the ranking will bee based on. It is
assumption that concepts with small singuular value ∑ are
rather to be considered as “noise” and thuss can be ignored.
Therefore, LSI can be depicted of how the sizes of the
involved matrixes reduce, when only thee first s singular
values are kept for the computation of the ranking and also
how the position between term and document in the
III.
DOCUMENT DISSSIMILARITY
The dissimilarity of data object (document) is shown by
the distance between document ass cluster center and the
others. The Euclidean distance, d, between
b
two points, x and
y, in one-, two-, three-, or higheer-dimensional space, is
given by the equation 1.
∑
,
(1)
where n is the number of dimension
ns and Xk and Yk are
respectively, the kth attributes (comp
ponents) of x and y [1].
59
VI.
In contrast, the similarity between data object
(document) is known as the small distance in one cluster.
Documents are often represented as vectors, where each
attribute represents the frequency with which a particular
term (word) occurs in the document. A measure of
similarity for document clustering is the cosine of the angle
between two vectors as in this equation 2 [1].
cos ,
/
(2)
where di and dj are two different documents
In order to know the performance for quality of
clustering, there are two measurements which are F-measure
and entropy[17]. This basic idea is from information
retrieval concept. In this technique, each cluster is
considered as if it were the result of query and each class as
if it were the desired set of documents for the query.
Furthermore, the formulation of F-measure involves
Precision and recall for each cluster j and class i are as
follows:
FUZZY C-MEANS CLUSTERING
IV.
,
There are various fuzzy clustering algorithms and one
simple fuzzy clustering technique is the fuzzy c-means
algorithm (FCM) by Duda and Hart [13] which was birth of
fuzzy method. The FCM is known to produce reasonable
partitioning of the original data in many cases (see [14]) and
is very quickly compared to some other approaches. Besides
that, FCM is convergence to a minimum or saddle point
from any initializations, it may be either a local or (the)
global minimum of objective function [15].
As principle, this algorithm is based on minimization of
the objective function J(X; U,V). Generally, the objective
function is the summing up dissimilarity weighted by
membership degree ( is shown as equation 3 [16]
∑ ∑
; ,
,
(3)
where d is the distance; V is cluster center and X is data
(document-term matrix).
V.
,
5.
·
(4)
,
,
,
,
(5)
The higher f-measure is the higher accuracy of cluster,
includes precision and recall.
Another measurement which related to the internal
quality of clustering is entropy measurement ( and it can
be formulated:
∑
, . log
,
(6)
where, P(i, j) is probability that a document has class label i
and is assigned to cluster j.
Thus, the total entropy of clusters is obtained by
summing the entropies of each cluster weighted by the size
of each cluster:
The proposed method is LSI concept in order to get the
small vector space. The details steps for document
clustering are as below:
1. Document preprocessing which includes case
folding, parsing, removing stop word and stemming.
2. Removing the terms which have global frequency
less than 2 and local frequency is more than a half of
the total document.
3. Representation full text document to term-document
A vector (vector space model) using TF-IDF weight
term.
4. Mapping the term-document A matrix to V document
matrix in concept space using LSI approach.
Since
,
;
where nij is the number of documents with class label i in
cluster j, ni is the number of documents with class label i
and nj is the number of documents in cluster j.
Thus, the F-measure cluster j and class i is obtained as
this below equation:
METHODOLOGY
∑
PERFORMANCE EVALUATION
∑
(7)
where, nj is size of cluster j and n is total document number
in the corpus. The lower value of entropy, the higher quality
of cluster internally.
VII. EXPERIMENTAL RESULT
We evaluated the performance of the proposed method
using the data sets taken from 20News Group [18]. The
dataset is made up of four groups with refer to the data
volume: Binary2, Multi5, Multi7 and Multi10. They consist
of short news with various topics which used as class
reference in clustering process. The binary2 dataset contains
of 200 documents (10215 terms) and 100 documents in each
topic. Also for the other dataset, there are 100 documents in
each topic. The multi5 dataset contains of 500 documents
(18367 terms), and multi7 dataset contains of 700
documents (18661 terms). And another dataset contains of
1000 documents (25627 terms).
Thus, each dataset is
clustered separately based on the number of topics. The first
step is document preprocessing in order to reduce the
volume density of data and the result after preprocessing is
shown in the Table I.
·∑·
·
·
1 (as property of SVD matrix), thus:
· ·∑
Implementing Fuzzy C-means clustering algorithm
by term-document V vector as representative of the
document collection and using Cosine similarity as
replacement of Euclidean distance which defined as
[ 1 cos ,
. It is applied into the objective
function of Fuzzy C-means algorithm.
60
TABLE I.
Dataset
Total
Topics
Binary2
Multi5
Multi7
Multi10
After that, it is represented the weight term using TFIDF method in matrix (vector space model). By applying
LSI method, the term-document matrix dimension is
reduced as shown in the Table II. The size of volume
reduction is based on the selected k-rank with optimum
condition at interval [2...50].
DESCRIPTION OF DATASET AFTER PREPROCESSING
talk.politics.mideast
talk.politics.misc
comp.graphics
rec.motorcycles
rec.sport.baseball
sci.space
talk.politics.mideast
alt.atheism
comp.sys.mac.hardware
misc.forsale
rec.autos
rec.sport.hockey
sci.electronics
talk.politics.guns
alt.atheism
comp.sys.mac.hardware
misc.forsale
rec.autos
rec.sport.hockey
sci.crypt
sci.electronics
sci.med
sci.space
talk.politics.guns
Docs
Terms
200
7432
500
13646
TABLE II.
700
Binary2
Total
Documents
200
Total Patternterms (SVD)
18
Total Patternterms (PCA)
26
Multi2
500
22
22
Multi7
700
28
24
Multi10
1000
20
24
Data set
13959
1000
Thus, to cluster the document collection is used Fuzzy
C-Means algorithm by parameter fuzziness (m=1.1), error
rate ( =0.001) and specific cluster number (c) based on the
number of topic or class. By applied LSI method (SVD and
PCA), the distribution of term-document for binary2 dataset
and the position of cluster center at the certain k-rank can be
illustrated at Fig. 4. There is different distribution of dataset
in clustering for the both methods.
18655
SVD method of binary2 dataset
PCA method of binary2 dataset
0.2
0.3
0.15
0.25
0.1
0.2
0.05
0.15
0
0.1
-0.05
0.05
-0.1
0
-0.15
-0.05
-0.1
-0.2
-0.25
-0.18
REDUCTION DIMENSION OF TERM-DOCUMENT USING LSI
-0.16
-0.14
-0.12
-0.1
-0.08
-0.06
-0.04
-0.15
-0.3
-0.02
-0.25
-0.2
-0.15
-0.1
-0.05
0
0.05
0.1
Figure 4. Dataset and cluster center distribution using LSI approach
VIII. DISCUSSION
1.2
It means that there are k patterns in each document
collection. In order to know the effect of k-rank with
quality of cluster and get optimum condition (high
performance), we apply these methods by various k-rank.
It is applied to binary2 dataset using SVD and PCA with
k=1, 2, 3, …, 50 and the result is shown in Fig. 4 and 5.
0.8
1
0.6
0.4
0.2
0
2
6
10 14 18 22 26 30 34 38 42 46 50
k-rank
prec
recall
f-measure
entropy
Figure 5. Performance of clustering for binary2 using SVD
61
entropy (svd:0.710; pca:0.732). The performance of rest
data sets is also increase even not significant.
1.2
1
0.8
IX.
0.6
The document clustering can be applied using concept
space and cosine similarity. It had made the significant
reduction of term-document matrix dimension with refer
to k-rank (total number of pattern). Also their average
performance is very high with f-measure about 0.91 and
entropy about 0.51. It is significant improvement when
applied in huge data volume (multi7 and multi10 dataset)
until more than 50% increasing.
0.4
0.2
0
2
6 10 14 18 22 26 30 34 38 42 46 50
k-rank
prec
recall
fmeasure
CONCLUSION
entropy
Figure 6. Performance of clustering for binary2 using PCA
REFERENCES
The performance of clustering for each LSI method is
obtained optimum condition at different k-rank. This is
showed that the optimum SVD of binary2 is at k-rank=18,
and for PCA is at k-rank=6. It is also applied to the other
data sets: multi5, multi7 and multi10. Furthermore, the
comparison of performance for document clustering
between without LSI applied and with LSI applied
including Cosine similarity measurement as replacement
of Euclidean distance is shown in Fig. 7 for all data sets.
1. Pang-Ning Tan, M.S., Vipin Kumar, Introduction to Data Mining.
Pearson International ed. 2006: Pearson Education, Inc.
2. M.A. Hearst, a.J.O.P. Reexamining the cluster hypothesis. 1996: In
Proceeding of SIGIR '96.
3. Jardine, N.a.v.R., C.J., The Use of Hierarchical Clustering in
Information Retrieval. Information Storage and Retrieval. Vol. 7.
1971.
4. Steinbach M., K.G., Kumar V., A Comparison of Document
Clustering Techniques. 2000, University of Mineasota.
5. Saveresi, S.M., D.L. Boley, S.Bittanti and G. Gazzaniga, Cluster
Selection in Divisive Clustering Algorithms. 2002.
6. Larose, D.T., An Introduction to Data Mining. Discovering
Knowledge in Data. 2005: Willi & Sons, Inc.
7. El-Sonbaty, Y.a.I., M.A., Fuzzy Clustering for Symbol Data. IEEE
Transactions on Fuzzy Systems, 1998. 6.
8. Rodrigues, M.E.S.M.a.S., L. A Scalable Hierarchical Fuzzy
Clustering Algorithm for Text Mining. in The 5th International
Conference on Recent Advances in Soft Cpmputing. 2004.
9. Aberer, K., EPFL-SSC, L.d.s.d.i. repartis, Editor. 2003.
10. S. Deerwester, e.a., Indexing by latent semantic analysis. Journal of
American Society for Information Science and Technology, 1990.
41: p. 391-407.
11. Smith, L., A Tutorial on Principal Component Analysis. 2002.
12. Shlens, J., A Tutorial on Principal Component Analysis. 2009.
13. Bezdek, J.C., Fuzzy Mathematics in Pattern Classification. 1973,
Cornell University: Ithaca, New York.
14. Bezdek, J.C., Pattern Recognition with Fuzzy Objective Function
Algorithm. 1981: Plenum Press.
15. Hathaway, R., Bezdek, J., and Tucker, W., An Improved
Convergence Theory for the Fuzzy ISODATA Clustering Algorithms.
Analysis of Fuzzy Information, 1987. 3(Boca Raton: CRC Press): p.
123 - 132.
16. Sadaaki Miyamoto, H.I., Katsuhiro Honda, Algorithm for Fuzzy
Clustering. Methods in c-Means Clustering with Applications, ed.
S.i.F.a.S. Computing. Vol. 229. 2008, Osaka, Japan: Scientific
Publishing Services Pvt. Ltd., Chennai, India.
17. Brojner Larsen and Chinatsu Aone, Fast and Effective Text Mining
Using Linear-time Document Clustering, in KDD-99. 1999: San
Diego, California.
18. http://kdd.ics.uci.edu/ databases/20newgroups.html
3
2.5
Precision
Euclidean
Recall cosine
2
1.5
Recall
Euclidean
F-measure
cosine
F-measure
Euclidean
Entropy cosine
1
0.5
no LSI
SVD
PCA
no LSI
SVD
PCA
no LSI
SVD
PCA
no LSI
SVD
PCA
0
binary2
multi5
multi7
Entropy
Euclidean
multi10
Figure 7. Performance comparison of document clustering
Fig. 7 depicts that the accuracy of document clustering
without LSI applied is very low, especially for huge data
volume (multi7 and multi10) which using Euclidean
distance. When it is applied to multi7 dataset, it has
precision=0.464, recall=0.324, and f-measure =0.381, but
it has high entropy which is 2.337. It is also happened to
multi10 dataset which has the worst performance with
precision=0.410, recall=0.341, f-measure=0.372 and
entropy=2.422. In contrast, it is used LSI approach and
Cosine similarity as reprecement of Euclidean distance
method to be applied in FC-Means algorithm. And the
performance for external and internal quality of cluster is
very high. The multi7 dataset is obtain f-measure
(svd:0.906; pca:0.903) and entropy(svd:0.597; pca:0.609)
and multi10 with f-measure (svd:0.888; pca:0.887) and
62