Download clustering sentence level text using a hierarchical fuzzy

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Human genetic clustering wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

K-means clustering wikipedia , lookup

Nearest-neighbor chain algorithm wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
ISSN: 2277 – 9043
International Journal of Advanced Research in Computer Science and Electronics Engineering (IJARCSEE)
Volume 3, Issue 8, August 2014
CLUSTERING SENTENCE LEVEL TEXT USING A HIERARCHICAL FUZZY
RELATIONAL AND K-MEANS CLUSTERING ALGORITHM
S.Thenmozhi,
N.Sumathi,
Research Scholar,
Department of Computer Science,
S.N.R Sons College,
Coimbatore.
Associate Professor,
Department of MCA,
S.N.R Sons College,
Coimbatore.
.
Abstract---Clustering is one of the widely used data mining
techniques and most important un-supervised learning
technique. Most of the clustering methods group data based on
distance and few methods cluster data based on similarity. In
this paper, similarity relationship among relational input data
with similar expression patterns are considered so that a
consequential and simple analytical decision can be made
from the proposed Hierarchical Fuzzy Relational K-means
Clustering Algorithm (HFRECCA). This work is an extension
of FRECCA which is used for the clustering of text data.
HFRECCA and K-means clustering algorithm are compared
with each other on the basis of various parameters such as
entropy, purity and V-measure. Experimental results prove
that the proposed method is better than the existing method.
Keywords: Entropy, Fuzzy Relational, Hierarchical
Clustering, K-means Clustering, Purity, Sentence Clustering,
Text Clustering V-measure.
I.
INTRODUCTION
In text mining and document clustering, utilizing the very
large amount of data is difficult problem. One of the most
important methods to help the users efficiently is document
clustering. This is powerful method to organize the text
documents. To cluster the meaningful documents from large
amount of documents, a document clustering is used.
Document clustering is to collect or browse the data and give
the result with respect to the user’s query [1, 2, 3 and 4]
Agglomerative hierarchical and partition methods are two
basic techniques for clustering methods. These techniques are
well suitable for sentence clustering. The basic idea in
Agglomerative hierarchical clustering (AHC) is to treat each
sentence as clusters and merge the closet pair by using several
kinds of distance function. It is used to examine the similarity
between the sentences [5]. This method is repeated until
desired number of clusters is obtained. Another method is
partition method which belongs to the k-means algorithm.
This algorithm takes the centroid as the cluster, based on
distance a measure of each sentence, after selecting the
number of k centroids.
This type of method is repeated until the k number of clusters
is obtained [6, 7 and 8].
The idea of using background knowledge or gathered
statistical information from large text analysis in order to
compute text similarity is well studied in the past [9,10], and
there exist many research works that introduce efficient
similarity or relatedness measures between terms. With
regards to works that employ such measures for document
clustering, Word Net is one of the most widely used lexical
thesauri [11, 12].
In traditional data clustering, similarity of a cluster of
objects is measured by pair wise similarity of objects. In that
cluster they argue that such measures are not appropriate for
transactions that are sets of items. They proposed the notion of
large items i.e. items contained in some minimum fraction of
transactions in a cluster to measure the similarity of a cluster
of transactions the intuition of that clustering criterion is that
there should be many large items within a cluster and little
overlapping of such items across clusters. They discussed the
rationale behind their approach and its implication on
providing a better solution to the clustering problem. It
presented a clustering algorithm based on the new clustering
criterion and evaluates its effectiveness [13].
The rest of the paper is organized as follows: Section II
discusses the related work in the field of text clustering.
Section III gives the detailed explanation and algorithm for
proposed method. Section IV explains the experimental setup;
finally conclude the paper in Section V.
II.
RELATED WORK
In [14], R.J.Hathaway describes a framework for key phrase
extraction using sentence clustering and reinforcement
principle. The sentences of the document into topical groups
using spectral graph clustering. This is used to enhance the
document quality. Based on mutual reinforcement principle,
saliency scores for key phrases and sentences are generated for
a topical group. The key phrases and sentences are then ranked
according to their Symbolic clustering using a new similarity
measure saliency scores and selected for inclusion in the top
key phrase list and summaries of the document.
In [15], Mario proposed a novel approach to fuzzy
clustering based on a dissimilarity relation extracted from data
using TS. Author proposed to exploit very few pairs of
patterns with known dissimilarity to build a TS system to
model the dissimilarity relation. Among other things, the rules
of the TS system provide an intuitive description of the
dissimilarity relation itself. Then they use the TS system to
build a dissimilarity matrix which is fed as input to an
unsupervised fuzzy relational clustering algorithm, denoted
ARCA (Any Relation Clustering Algorithm), which partitions
399
All Rights Reserved © 2014 IJARCSEE
ISSN: 2277 – 9043
International Journal of Advanced Research in Computer Science and Electronics Engineering (IJARCSEE)
Volume 3, Issue 8, August 2014
the data set based on the proximity of the vectors containing
the dissimilarity values between each pattern and all the other
patterns in the data set.
In sentence level text clustering [16], most of the common
techniques are based on the statistical analysis of a term, either
word or phrase. Sentence may be related to more than one
theme in a given document hence relational fuzzy clustering
algorithm is used in text. Relational fuzzy algorithm allows
patterns to belong to more than one cluster. Page Rank
algorithm is used as general graph centrality measure. Data is
used within an expectation maximization framework to
contract a complete relational fuzzy clustering algorithm. This
algorithm is able to identify overlapping clusters of
semantically related sentences therefore it is widely used in
variety of text mining tasks.
III.
PROPOSED METHODOLOGY
Some of the basic properties of rough sets are
 An object v can be part of at most one lower
approximation.
 For a set X and object v, if v∈ lower approximation
𝑥𝑖 then v∈ upper approximation 𝑥𝑖
 If an object v not part of any lower approximation,
then v belongs to two or more upper approximations.
B. HIERARCHICAL FRECCA
HFRECCA describes the use of Page Rank as a general
graph centrality measure, and review the Gaussian mixture
model approach. We describe how Page Rank can be used
within an Expectation-Maximization framework to construct a
complete relational fuzzy clustering algorithm. Since Page
Rank centrality can be viewed as a special case of HFRECCA
centrality, they are applied to Page Rank centrality for Fuzzy
Relational K Means Clustering and applying hierarchical
Structure for (HFRECCA) Algorithm.
A. K-MEANS CLUSTERING ALGORITHM:
K-means is a mathematical tool used to deal with
uncertainty. When they have insufficient knowledge to
precisely define clusters assets, they use rough sets; here, a
cluster is represented by a rough set based on a lower
approximation and an upper approximation. The K-Means
clustering algorithm is given bellow.
K-means clustering algorithm
Input: Data set, number of clusters, K,
and𝑤𝑙𝑜𝑤𝑒𝑟 𝑎𝑛𝑑 𝑤𝑢𝑝𝑝𝑒𝑟 , and ε threshold constants
Output: K-centroids, lower approximation and upper
approximation of each cluster
Steps:
Step 1: Set initial centroid C = <C1, C2.., Ck>
Step 2: Partition data into P subgroups
Step 3: For each P
Step 4: Create a new process to calculate
distance
Step 5: Send result back to parent process
Step 6: Receive distances and initial members of
lower initialize of K clusters from P
Step 7: Calculate upper approximation of each
member in K clusters
Step 8: Recalculate new centroid C’
Step 9: If difference (C, C’)
Step 10: Then set C to C’ and go back to step 3
Step 11: Else stop and return C as well as cluster
members
Figure 1: K-Means clustering algorithm
HFRECCA Algorithm
Input: A set of Famous Quotation Data set (D)
Output: Clustering the Quotation using K means
Clustering and Re ranking using (HFRECCA)
Hierarchical Fuzzy Relational Clustering Algorithm
Steps:
Step 1: A set of Famous Quotation Data set (D)
Step 2: Initialize and normalize membership
values
Step 3: Random numbers [0, 1] and assign
Member Ship value
Step 4: Apply Expectation Maximization into
Weighted affinity matrix
Step 5: Fetch Centroid Point to The Membership
Values
Step 6: Computation Of Equaling Distance and
Forming Cluster
Step 7: Calculate Page Rank scores for cluster m
Step 8: Repeat the step 2 until reach all Quotation
Hierarchical Cluster C
Figure 2: HFRECCA Algorithm
Figure 2 illustrates that the famous quotation data set as
input for this algorithm, 0 and 1 is assign to the membership
value, then apply the expectation and maximization to find out
the affinity matrix value, finally calculate Page Rank values it
arrange the quotation in hierarchical cluster.
K-Means clustering is used to find out the lower approximation
and upper approximation of each cluster.
400
All Rights Reserved © 2014 IJARCSEE
ISSN: 2277 – 9043
International Journal of Advanced Research in Computer Science and Electronics Engineering (IJARCSEE)
Volume 3, Issue 8, August 2014
Input
Famous Quotation data set
Preprocessing the data
Similarity computation between
Documents
Entropy filtering data
Page Rank Computation
Applying K means Clustering and
HFRECCA algorithm
Output
Figure 3: Hierarchical FRECCA and K-MEANS Clustering
Figure 3 illustrates that the famous quotation data set is
given as an input for this process. Then preprocessing helps to
split out the sentence. Calculate similarity from one quotation
to another quotation to find similarity value as a membership
value. This value is used to find out the equaling distance.
Then Fetch the centroid point to the membership values, to
calculate Page Rank values. Finally apply the HFRECCA
algorithm and K-means clustering for displaying the data in
structure format.
.
C.Pattern representation
Pattern representation is just about the basic understanding
problems in pattern acknowledgement. According to the
problem on hand, it's important to choose the most appropriate
representation intended for patterns. Within statistical
understanding, a pattern is mostly represented from the vector,
when the pattern can be viewed as one point in the
dimensional living space. Such a representation indeed brings
convenience in applications. However, when a pattern
involved has spatial framework, for case, a 2-D terms array or
may be matrix, the matrix really needs to be vectorized
together with a lack of information concerning spatial
relationships between likeness words. Any particular one
pattern representation isn't always much better than another
when no preceding knowledge is actually injected. Therefore,
it isn't always effective that we manage patterns in a type of a
vector. In particular, in the process of dealing with words, the
vector representation even will cause a high-dimensional
characteristic space as well as increases computational
intricacy.
D.Similarity computation
In order to cluster the items in a data set, some means of
quantifying the degree of association between them is
required. This may be a distance measure, or a measure of
similarity or dissimilarity. Some clustering methods have a
theoretical requirement for use of a specific measure
(Euclidean distance for Ward's method, for example), but
more commonly the choice of measure is at the discretion of
the researcher. While there are a number of similarity
measures available, and the choice of similarity measure can
401
All Rights Reserved © 2014 IJARCSEE
ISSN: 2277 – 9043
International Journal of Advanced Research in Computer Science and Electronics Engineering (IJARCSEE)
Volume 3, Issue 8, August 2014
have an effect on the clustering results obtained, there have
been only a few comparative studies. In cluster-based
retrieval, the determination of inter document similarity
depends on both the document representation, in terms of the
weights assigned to the indexing terms characterizing each
document, and the similarity coefficient that is chosen.
E.Similarity measures
Before clustering, a similarity/distance measure must be
determined. The measure reflects the degree of closeness or
separation of the target objects and should correspond to the
characteristics that are believed to distinguish the clusters
embedded in the data. In many cases, these characteristics are
dependent on the data or the problem context at hand, and
there is no measure that is universally best for all kinds of
clustering problems. Page Rank is used as a graph centrality
measure.
Similarity relating to the content is actually measured
regarding a number of long distance consume sort of functions
are generally Euclidean long distance or Manhattan long
distance. The decision founded on each of the necessity which
induces your group measurement as well as formulates your
achievement of clustering for the particular program domain.
Existing sentence clustering procedures generally symbolize
content like a phrase document matrix as well as carry out
clustering method on there. Despite the fact that most of these
clustering procedures can team your docs satisfactorily, it truly
is even now difficult for those to record your definitions of the
docs while there is zero acceptable decryption for every single
record group. In line with the similarity or significant
difference beliefs clustering will certainly occurs.
IV.
EXPERIMENTAL RESULTS
The proposed Method is developed using Java. In this paper
various parameters [16] are used to show the performance of
the HFRECCA Algorithm.
Here 𝐶𝑖 represents cluster, 𝑤𝑗 denotes word and 𝑃𝑗 indicates
the purity of cluster j.
Overall purity is just the weighted average of the individual
cluster purities:
1
𝐿
𝑂𝑣𝑒𝑟𝑎𝑙𝑙 𝑃𝑢𝑟𝑖𝑡𝑦 =
(3)
𝑗 =1 𝑤𝑗 × 𝑃𝑗
𝑁
C. Entropy
The entropy of a cluster j is a measure of how mixed the
objects within the cluster are, and is defined as
𝐸𝑗 =
1
𝑙𝑜𝑔 𝐶
𝑐 𝑤 𝑗 ∩𝑐 𝑖
𝑖=1 𝑤
𝑗
𝑙𝑜𝑔
𝑤 𝑗 ∩𝑐 𝑖
(4)
𝑤𝑗
Overall entropy is the weighted average of the individual
cluster entropies:
1
𝐿
𝑂𝑣𝑒𝑟𝑎𝑙𝑙 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 =
(5)
𝑗 =1 𝑤𝑗 × 𝐸𝑗
𝑁
D. V-Measure
The problem with purity and entropy is overcome by the V measure, also known as the Normalized Mutual Information
(NMI), which is defined as the harmonic mean of
homogeneity(h) and completeness (c); i.e.,
𝑉 = ℎ𝑐/(ℎ + 𝑐)
(6)
The performance comparison table is as given bellow.
Table 1: Performance Comparison
Existing
Proposed
Parameters
method
method
Partition
0.01232..
0.01154
Purity
0.02201
0.055075
Entropy
0
0.000944
V-Measure
0.00229
0.000322
Table 1 illustrates that the performance comparison between
the existing and proposed method for four parameters such as
partition, purity, entropy and V-measure.
A. Partition Entropy Coefficient (PE)
Various unsupervised evaluation measures have been
defined, but most are only applicable to clusters represented
using prototypes. Two exceptions are the Partition Coefficient
(PC) and the closely related Partition Entropy Coefficient, the
latter of which is defined as
1 𝑁
𝐿
𝑃𝐸 = −
𝑢𝑖𝑗 𝑙𝑜𝑔𝑎 𝑢𝑖𝑗
(1)
𝑁 𝑖=1 𝑗 =1
Here 𝑢𝑖𝑗 represents the membership of instance i to cluster
j, the value of this index ranges from 0 to 𝑙𝑜𝑔𝑎 𝑢𝑖𝑗 the closer
the value is to 0, the crisper the clustering is. The highest
value is obtained when all of the 𝑢𝑖𝑗 are equal.
B. Purity
Two widely used external clustering evaluation criteria are
purity and entropy. The purity of a cluster is defined as the
fraction of the cluster size that the largest class of objects
assigned.
1
𝑃𝑗 = 𝜔 𝑚𝑎𝑥𝑖 𝑤𝑗 ∩ 𝐶𝑖
(2)
𝑗
Figure: 4 Partition Entropy Performance Comparisons
Between HFRECCA and FRECCA
Figure 4 shows the comparison of partition entropy of
HFRECCA and FRECCA. Here partition entropy of
HFRECCA is 0.012 and the FRECCA partition entropy is
0.011. HFRECCA entropy is high compared to existing
FRECCA.
402
All Rights Reserved © 2014 IJARCSEE
ISSN: 2277 – 9043
International Journal of Advanced Research in Computer Science and Electronics Engineering (IJARCSEE)
Volume 3, Issue 8, August 2014
Figure 7 shows that the performance comparison of V –
measure. The performance of HFRECCA is low compared to
existing FRECCA.
Figure: 5 Purity comparison between HFRECCA and
FRECCA
Figure 5 illustrates that the HFRECCA performs better than
the FRECCA as measured by all five external cluster
evaluation criteria. Purity of HFRECCA is 0.0550 and purity
of FRECCA is 0.0250. HFRECCA purity is high compared to
FRECCA.
Using this HFRECCA algorithm the results are displayed
based on the user fact. In the existing system applying the
algorithm to sentence clustering tasks demonstrate that the
algorithm is capable of identifying overlapping clusters of
semantically related sentences, and that it is therefore of
potential use in a variety of text mining tasks. This proposed
method is more suitable to display the results based on the
Famous Quotation data set. By using HFRECCA one data
quotation is compared and clustered with the related data from
the entire quotation data set and provide the result in the
structured data. HFRECCA is used to improve the accuracy
and effectiveness of the suggested query.
CONCLUSION
Famous Quotation data set contain hierarchical structure
hence extension of FRECCA i.e. Hierarchical Fuzzy
Relational Clustering Algorithm (HFRECCA) is useful for
clustering the sentence. Comparisons with the FRECCA
algorithm on each of these data sets suggest that HFRECCA is
capable of identifying softer clusters, without sacrificing
performance as evaluated by external measures. This paper
proves the results through Partition Entropy Coefficient,
Purity, Entropy and V-Measure. In future work, Fuzzy soft Kmeans and Genetic K-means clustering algorithm can be
applied.
REFERENCES
[1]
Figure: 6 Entropy Comparison between HFRECCA and
FRECCA
Figure 6 discribs that the entropy of HFRECCA is 0 and [2]
FRECCA entropy is high compared to HFRECCA. Entropy
of FRECCA is 0.000500.
[3]
[4]
[5]
[6]
[7]
Figure: 7 HFRECCA and FRECCA for V_MEASURE
J.D. Holt, S.M. Chung, Y. Li, Usage of mined word
associations for text retrieval, in: Proc. of IEEE Int’l
Conf. on Tools with Artificial Intelligence (ICTAI- 2007),
vol. 2, 2007, pp. 45–49.
Y. Li, S.M. Chung, J.D. Holt, Text document clustering
based on frequent word meaning sequences, Data and
Knowledge Engineering 64 (1) (2008) 381–404.
Y. Li, C. Luo, S.M. Chung, Text clustering with feature
selection by using statistical data, IEEE Transactions on
Knowledge and Data Engineering 20 (5) (2008) 641–652.
C.J. van Rijsbergen, Information Retrieval, second ed.,
Buttersworth, London, 1979.
A.K. Jain, R.C. Dubes, Algorithms for Clustering Data,
Prentice Hall, Englewood Cliffs, 1988.
L. Kaufman, P.J. Rousseeuw, Finding Groups in Data: An
Introduction to Cluster Analysis, John Wiley & Sons,
1990.
B. Larsen, C. Aone, Fast and effective text mining using
linear-time document clustering, in: Proc. of ACM
SIGKDD Int’l Conf. on Knowledge Discovery and Data
Mining, 1999, pp. 16–22.
403
All Rights Reserved © 2014 IJARCSEE
ISSN: 2277 – 9043
International Journal of Advanced Research in Computer Science and Electronics Engineering (IJARCSEE)
Volume 3, Issue 8, August 2014
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
C. Ordonez, E. Omiecinski, Efficient disk-based k-means
clustering for relational databases, IEEE Transactions on
Knowledge and Data Engineering 16 (8) (2004) 909–921.
A. Budanitsky, G. Hirst, Evaluating WordNet-based
measures of lexical semantic relatedness, Comput. Ling.
32 (1) (2006) 13–47.
Z. Zhang, A. Gentile, F. Ciravegna, Recent advances in
methods of lexical semantic relatedness – a survey, Nat.
Lang.
Eng.,
doi:http://dx.doi.org/10.1017/
S1351324912000125.
A. Hotho, S. Staab, G. Stumme, Wordnet improves text
document clustering, in: Proceedings of the SIGIR 2003
Semantic Web Workshop, 2003, pp. 541–544.
J. Sedding, D. Kazakov, Wordnet-based text document
clustering, in: Proceedings of the 3rd Workshop on
RObust Methods in Analysis of Natural Language Data,
ROMAND ’04, Association for Computational
Linguistics, Stroudsburg, PA, USA, 2004, pp. 104–113.
Ke Wang and Chu Xu and Bing Liu, ”Clustering
Transactions Using Large Items”, School of Computing
National University of Singapore.
R.J. Hathaway, “Relational dual of the C-means
clustering algorithms”, Pattern Recognition, vol. 22, no.
2, pp. 205–212, 1989.
Mario G.C.A, “A novel approach to fuzzy clustering
based on a dissimilarity relation extracted from data using
a TS system”, Rev. Stat. Appliqué, vol. XIX(2), pp. 19–
34, 1975.
Andrew Skabar and Khaled Abdalgader,”Clustering
Sentence-Level Text Using a Novel Fuzzy Relational
Clustering Algorithm”,IEEE Transactions on Knowledge
and Data Engineering, vol. 25, no. 1, January 2013.
404
All Rights Reserved © 2014 IJARCSEE