Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
ISSN: 2277 – 9043 International Journal of Advanced Research in Computer Science and Electronics Engineering (IJARCSEE) Volume 3, Issue 8, August 2014 CLUSTERING SENTENCE LEVEL TEXT USING A HIERARCHICAL FUZZY RELATIONAL AND K-MEANS CLUSTERING ALGORITHM S.Thenmozhi, N.Sumathi, Research Scholar, Department of Computer Science, S.N.R Sons College, Coimbatore. Associate Professor, Department of MCA, S.N.R Sons College, Coimbatore. . Abstract---Clustering is one of the widely used data mining techniques and most important un-supervised learning technique. Most of the clustering methods group data based on distance and few methods cluster data based on similarity. In this paper, similarity relationship among relational input data with similar expression patterns are considered so that a consequential and simple analytical decision can be made from the proposed Hierarchical Fuzzy Relational K-means Clustering Algorithm (HFRECCA). This work is an extension of FRECCA which is used for the clustering of text data. HFRECCA and K-means clustering algorithm are compared with each other on the basis of various parameters such as entropy, purity and V-measure. Experimental results prove that the proposed method is better than the existing method. Keywords: Entropy, Fuzzy Relational, Hierarchical Clustering, K-means Clustering, Purity, Sentence Clustering, Text Clustering V-measure. I. INTRODUCTION In text mining and document clustering, utilizing the very large amount of data is difficult problem. One of the most important methods to help the users efficiently is document clustering. This is powerful method to organize the text documents. To cluster the meaningful documents from large amount of documents, a document clustering is used. Document clustering is to collect or browse the data and give the result with respect to the user’s query [1, 2, 3 and 4] Agglomerative hierarchical and partition methods are two basic techniques for clustering methods. These techniques are well suitable for sentence clustering. The basic idea in Agglomerative hierarchical clustering (AHC) is to treat each sentence as clusters and merge the closet pair by using several kinds of distance function. It is used to examine the similarity between the sentences [5]. This method is repeated until desired number of clusters is obtained. Another method is partition method which belongs to the k-means algorithm. This algorithm takes the centroid as the cluster, based on distance a measure of each sentence, after selecting the number of k centroids. This type of method is repeated until the k number of clusters is obtained [6, 7 and 8]. The idea of using background knowledge or gathered statistical information from large text analysis in order to compute text similarity is well studied in the past [9,10], and there exist many research works that introduce efficient similarity or relatedness measures between terms. With regards to works that employ such measures for document clustering, Word Net is one of the most widely used lexical thesauri [11, 12]. In traditional data clustering, similarity of a cluster of objects is measured by pair wise similarity of objects. In that cluster they argue that such measures are not appropriate for transactions that are sets of items. They proposed the notion of large items i.e. items contained in some minimum fraction of transactions in a cluster to measure the similarity of a cluster of transactions the intuition of that clustering criterion is that there should be many large items within a cluster and little overlapping of such items across clusters. They discussed the rationale behind their approach and its implication on providing a better solution to the clustering problem. It presented a clustering algorithm based on the new clustering criterion and evaluates its effectiveness [13]. The rest of the paper is organized as follows: Section II discusses the related work in the field of text clustering. Section III gives the detailed explanation and algorithm for proposed method. Section IV explains the experimental setup; finally conclude the paper in Section V. II. RELATED WORK In [14], R.J.Hathaway describes a framework for key phrase extraction using sentence clustering and reinforcement principle. The sentences of the document into topical groups using spectral graph clustering. This is used to enhance the document quality. Based on mutual reinforcement principle, saliency scores for key phrases and sentences are generated for a topical group. The key phrases and sentences are then ranked according to their Symbolic clustering using a new similarity measure saliency scores and selected for inclusion in the top key phrase list and summaries of the document. In [15], Mario proposed a novel approach to fuzzy clustering based on a dissimilarity relation extracted from data using TS. Author proposed to exploit very few pairs of patterns with known dissimilarity to build a TS system to model the dissimilarity relation. Among other things, the rules of the TS system provide an intuitive description of the dissimilarity relation itself. Then they use the TS system to build a dissimilarity matrix which is fed as input to an unsupervised fuzzy relational clustering algorithm, denoted ARCA (Any Relation Clustering Algorithm), which partitions 399 All Rights Reserved © 2014 IJARCSEE ISSN: 2277 – 9043 International Journal of Advanced Research in Computer Science and Electronics Engineering (IJARCSEE) Volume 3, Issue 8, August 2014 the data set based on the proximity of the vectors containing the dissimilarity values between each pattern and all the other patterns in the data set. In sentence level text clustering [16], most of the common techniques are based on the statistical analysis of a term, either word or phrase. Sentence may be related to more than one theme in a given document hence relational fuzzy clustering algorithm is used in text. Relational fuzzy algorithm allows patterns to belong to more than one cluster. Page Rank algorithm is used as general graph centrality measure. Data is used within an expectation maximization framework to contract a complete relational fuzzy clustering algorithm. This algorithm is able to identify overlapping clusters of semantically related sentences therefore it is widely used in variety of text mining tasks. III. PROPOSED METHODOLOGY Some of the basic properties of rough sets are An object v can be part of at most one lower approximation. For a set X and object v, if v∈ lower approximation 𝑥𝑖 then v∈ upper approximation 𝑥𝑖 If an object v not part of any lower approximation, then v belongs to two or more upper approximations. B. HIERARCHICAL FRECCA HFRECCA describes the use of Page Rank as a general graph centrality measure, and review the Gaussian mixture model approach. We describe how Page Rank can be used within an Expectation-Maximization framework to construct a complete relational fuzzy clustering algorithm. Since Page Rank centrality can be viewed as a special case of HFRECCA centrality, they are applied to Page Rank centrality for Fuzzy Relational K Means Clustering and applying hierarchical Structure for (HFRECCA) Algorithm. A. K-MEANS CLUSTERING ALGORITHM: K-means is a mathematical tool used to deal with uncertainty. When they have insufficient knowledge to precisely define clusters assets, they use rough sets; here, a cluster is represented by a rough set based on a lower approximation and an upper approximation. The K-Means clustering algorithm is given bellow. K-means clustering algorithm Input: Data set, number of clusters, K, and𝑤𝑙𝑜𝑤𝑒𝑟 𝑎𝑛𝑑 𝑤𝑢𝑝𝑝𝑒𝑟 , and ε threshold constants Output: K-centroids, lower approximation and upper approximation of each cluster Steps: Step 1: Set initial centroid C = <C1, C2.., Ck> Step 2: Partition data into P subgroups Step 3: For each P Step 4: Create a new process to calculate distance Step 5: Send result back to parent process Step 6: Receive distances and initial members of lower initialize of K clusters from P Step 7: Calculate upper approximation of each member in K clusters Step 8: Recalculate new centroid C’ Step 9: If difference (C, C’) Step 10: Then set C to C’ and go back to step 3 Step 11: Else stop and return C as well as cluster members Figure 1: K-Means clustering algorithm HFRECCA Algorithm Input: A set of Famous Quotation Data set (D) Output: Clustering the Quotation using K means Clustering and Re ranking using (HFRECCA) Hierarchical Fuzzy Relational Clustering Algorithm Steps: Step 1: A set of Famous Quotation Data set (D) Step 2: Initialize and normalize membership values Step 3: Random numbers [0, 1] and assign Member Ship value Step 4: Apply Expectation Maximization into Weighted affinity matrix Step 5: Fetch Centroid Point to The Membership Values Step 6: Computation Of Equaling Distance and Forming Cluster Step 7: Calculate Page Rank scores for cluster m Step 8: Repeat the step 2 until reach all Quotation Hierarchical Cluster C Figure 2: HFRECCA Algorithm Figure 2 illustrates that the famous quotation data set as input for this algorithm, 0 and 1 is assign to the membership value, then apply the expectation and maximization to find out the affinity matrix value, finally calculate Page Rank values it arrange the quotation in hierarchical cluster. K-Means clustering is used to find out the lower approximation and upper approximation of each cluster. 400 All Rights Reserved © 2014 IJARCSEE ISSN: 2277 – 9043 International Journal of Advanced Research in Computer Science and Electronics Engineering (IJARCSEE) Volume 3, Issue 8, August 2014 Input Famous Quotation data set Preprocessing the data Similarity computation between Documents Entropy filtering data Page Rank Computation Applying K means Clustering and HFRECCA algorithm Output Figure 3: Hierarchical FRECCA and K-MEANS Clustering Figure 3 illustrates that the famous quotation data set is given as an input for this process. Then preprocessing helps to split out the sentence. Calculate similarity from one quotation to another quotation to find similarity value as a membership value. This value is used to find out the equaling distance. Then Fetch the centroid point to the membership values, to calculate Page Rank values. Finally apply the HFRECCA algorithm and K-means clustering for displaying the data in structure format. . C.Pattern representation Pattern representation is just about the basic understanding problems in pattern acknowledgement. According to the problem on hand, it's important to choose the most appropriate representation intended for patterns. Within statistical understanding, a pattern is mostly represented from the vector, when the pattern can be viewed as one point in the dimensional living space. Such a representation indeed brings convenience in applications. However, when a pattern involved has spatial framework, for case, a 2-D terms array or may be matrix, the matrix really needs to be vectorized together with a lack of information concerning spatial relationships between likeness words. Any particular one pattern representation isn't always much better than another when no preceding knowledge is actually injected. Therefore, it isn't always effective that we manage patterns in a type of a vector. In particular, in the process of dealing with words, the vector representation even will cause a high-dimensional characteristic space as well as increases computational intricacy. D.Similarity computation In order to cluster the items in a data set, some means of quantifying the degree of association between them is required. This may be a distance measure, or a measure of similarity or dissimilarity. Some clustering methods have a theoretical requirement for use of a specific measure (Euclidean distance for Ward's method, for example), but more commonly the choice of measure is at the discretion of the researcher. While there are a number of similarity measures available, and the choice of similarity measure can 401 All Rights Reserved © 2014 IJARCSEE ISSN: 2277 – 9043 International Journal of Advanced Research in Computer Science and Electronics Engineering (IJARCSEE) Volume 3, Issue 8, August 2014 have an effect on the clustering results obtained, there have been only a few comparative studies. In cluster-based retrieval, the determination of inter document similarity depends on both the document representation, in terms of the weights assigned to the indexing terms characterizing each document, and the similarity coefficient that is chosen. E.Similarity measures Before clustering, a similarity/distance measure must be determined. The measure reflects the degree of closeness or separation of the target objects and should correspond to the characteristics that are believed to distinguish the clusters embedded in the data. In many cases, these characteristics are dependent on the data or the problem context at hand, and there is no measure that is universally best for all kinds of clustering problems. Page Rank is used as a graph centrality measure. Similarity relating to the content is actually measured regarding a number of long distance consume sort of functions are generally Euclidean long distance or Manhattan long distance. The decision founded on each of the necessity which induces your group measurement as well as formulates your achievement of clustering for the particular program domain. Existing sentence clustering procedures generally symbolize content like a phrase document matrix as well as carry out clustering method on there. Despite the fact that most of these clustering procedures can team your docs satisfactorily, it truly is even now difficult for those to record your definitions of the docs while there is zero acceptable decryption for every single record group. In line with the similarity or significant difference beliefs clustering will certainly occurs. IV. EXPERIMENTAL RESULTS The proposed Method is developed using Java. In this paper various parameters [16] are used to show the performance of the HFRECCA Algorithm. Here 𝐶𝑖 represents cluster, 𝑤𝑗 denotes word and 𝑃𝑗 indicates the purity of cluster j. Overall purity is just the weighted average of the individual cluster purities: 1 𝐿 𝑂𝑣𝑒𝑟𝑎𝑙𝑙 𝑃𝑢𝑟𝑖𝑡𝑦 = (3) 𝑗 =1 𝑤𝑗 × 𝑃𝑗 𝑁 C. Entropy The entropy of a cluster j is a measure of how mixed the objects within the cluster are, and is defined as 𝐸𝑗 = 1 𝑙𝑜𝑔 𝐶 𝑐 𝑤 𝑗 ∩𝑐 𝑖 𝑖=1 𝑤 𝑗 𝑙𝑜𝑔 𝑤 𝑗 ∩𝑐 𝑖 (4) 𝑤𝑗 Overall entropy is the weighted average of the individual cluster entropies: 1 𝐿 𝑂𝑣𝑒𝑟𝑎𝑙𝑙 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 = (5) 𝑗 =1 𝑤𝑗 × 𝐸𝑗 𝑁 D. V-Measure The problem with purity and entropy is overcome by the V measure, also known as the Normalized Mutual Information (NMI), which is defined as the harmonic mean of homogeneity(h) and completeness (c); i.e., 𝑉 = ℎ𝑐/(ℎ + 𝑐) (6) The performance comparison table is as given bellow. Table 1: Performance Comparison Existing Proposed Parameters method method Partition 0.01232.. 0.01154 Purity 0.02201 0.055075 Entropy 0 0.000944 V-Measure 0.00229 0.000322 Table 1 illustrates that the performance comparison between the existing and proposed method for four parameters such as partition, purity, entropy and V-measure. A. Partition Entropy Coefficient (PE) Various unsupervised evaluation measures have been defined, but most are only applicable to clusters represented using prototypes. Two exceptions are the Partition Coefficient (PC) and the closely related Partition Entropy Coefficient, the latter of which is defined as 1 𝑁 𝐿 𝑃𝐸 = − 𝑢𝑖𝑗 𝑙𝑜𝑔𝑎 𝑢𝑖𝑗 (1) 𝑁 𝑖=1 𝑗 =1 Here 𝑢𝑖𝑗 represents the membership of instance i to cluster j, the value of this index ranges from 0 to 𝑙𝑜𝑔𝑎 𝑢𝑖𝑗 the closer the value is to 0, the crisper the clustering is. The highest value is obtained when all of the 𝑢𝑖𝑗 are equal. B. Purity Two widely used external clustering evaluation criteria are purity and entropy. The purity of a cluster is defined as the fraction of the cluster size that the largest class of objects assigned. 1 𝑃𝑗 = 𝜔 𝑚𝑎𝑥𝑖 𝑤𝑗 ∩ 𝐶𝑖 (2) 𝑗 Figure: 4 Partition Entropy Performance Comparisons Between HFRECCA and FRECCA Figure 4 shows the comparison of partition entropy of HFRECCA and FRECCA. Here partition entropy of HFRECCA is 0.012 and the FRECCA partition entropy is 0.011. HFRECCA entropy is high compared to existing FRECCA. 402 All Rights Reserved © 2014 IJARCSEE ISSN: 2277 – 9043 International Journal of Advanced Research in Computer Science and Electronics Engineering (IJARCSEE) Volume 3, Issue 8, August 2014 Figure 7 shows that the performance comparison of V – measure. The performance of HFRECCA is low compared to existing FRECCA. Figure: 5 Purity comparison between HFRECCA and FRECCA Figure 5 illustrates that the HFRECCA performs better than the FRECCA as measured by all five external cluster evaluation criteria. Purity of HFRECCA is 0.0550 and purity of FRECCA is 0.0250. HFRECCA purity is high compared to FRECCA. Using this HFRECCA algorithm the results are displayed based on the user fact. In the existing system applying the algorithm to sentence clustering tasks demonstrate that the algorithm is capable of identifying overlapping clusters of semantically related sentences, and that it is therefore of potential use in a variety of text mining tasks. This proposed method is more suitable to display the results based on the Famous Quotation data set. By using HFRECCA one data quotation is compared and clustered with the related data from the entire quotation data set and provide the result in the structured data. HFRECCA is used to improve the accuracy and effectiveness of the suggested query. CONCLUSION Famous Quotation data set contain hierarchical structure hence extension of FRECCA i.e. Hierarchical Fuzzy Relational Clustering Algorithm (HFRECCA) is useful for clustering the sentence. Comparisons with the FRECCA algorithm on each of these data sets suggest that HFRECCA is capable of identifying softer clusters, without sacrificing performance as evaluated by external measures. This paper proves the results through Partition Entropy Coefficient, Purity, Entropy and V-Measure. In future work, Fuzzy soft Kmeans and Genetic K-means clustering algorithm can be applied. REFERENCES [1] Figure: 6 Entropy Comparison between HFRECCA and FRECCA Figure 6 discribs that the entropy of HFRECCA is 0 and [2] FRECCA entropy is high compared to HFRECCA. Entropy of FRECCA is 0.000500. [3] [4] [5] [6] [7] Figure: 7 HFRECCA and FRECCA for V_MEASURE J.D. Holt, S.M. Chung, Y. Li, Usage of mined word associations for text retrieval, in: Proc. of IEEE Int’l Conf. on Tools with Artificial Intelligence (ICTAI- 2007), vol. 2, 2007, pp. 45–49. Y. Li, S.M. Chung, J.D. Holt, Text document clustering based on frequent word meaning sequences, Data and Knowledge Engineering 64 (1) (2008) 381–404. Y. Li, C. Luo, S.M. Chung, Text clustering with feature selection by using statistical data, IEEE Transactions on Knowledge and Data Engineering 20 (5) (2008) 641–652. C.J. van Rijsbergen, Information Retrieval, second ed., Buttersworth, London, 1979. A.K. Jain, R.C. Dubes, Algorithms for Clustering Data, Prentice Hall, Englewood Cliffs, 1988. L. Kaufman, P.J. Rousseeuw, Finding Groups in Data: An Introduction to Cluster Analysis, John Wiley & Sons, 1990. B. Larsen, C. Aone, Fast and effective text mining using linear-time document clustering, in: Proc. of ACM SIGKDD Int’l Conf. on Knowledge Discovery and Data Mining, 1999, pp. 16–22. 403 All Rights Reserved © 2014 IJARCSEE ISSN: 2277 – 9043 International Journal of Advanced Research in Computer Science and Electronics Engineering (IJARCSEE) Volume 3, Issue 8, August 2014 [8] [9] [10] [11] [12] [13] [14] [15] [16] C. Ordonez, E. Omiecinski, Efficient disk-based k-means clustering for relational databases, IEEE Transactions on Knowledge and Data Engineering 16 (8) (2004) 909–921. A. Budanitsky, G. Hirst, Evaluating WordNet-based measures of lexical semantic relatedness, Comput. Ling. 32 (1) (2006) 13–47. Z. Zhang, A. Gentile, F. Ciravegna, Recent advances in methods of lexical semantic relatedness – a survey, Nat. Lang. Eng., doi:http://dx.doi.org/10.1017/ S1351324912000125. A. Hotho, S. Staab, G. Stumme, Wordnet improves text document clustering, in: Proceedings of the SIGIR 2003 Semantic Web Workshop, 2003, pp. 541–544. J. Sedding, D. Kazakov, Wordnet-based text document clustering, in: Proceedings of the 3rd Workshop on RObust Methods in Analysis of Natural Language Data, ROMAND ’04, Association for Computational Linguistics, Stroudsburg, PA, USA, 2004, pp. 104–113. Ke Wang and Chu Xu and Bing Liu, ”Clustering Transactions Using Large Items”, School of Computing National University of Singapore. R.J. Hathaway, “Relational dual of the C-means clustering algorithms”, Pattern Recognition, vol. 22, no. 2, pp. 205–212, 1989. Mario G.C.A, “A novel approach to fuzzy clustering based on a dissimilarity relation extracted from data using a TS system”, Rev. Stat. Appliqué, vol. XIX(2), pp. 19– 34, 1975. Andrew Skabar and Khaled Abdalgader,”Clustering Sentence-Level Text Using a Novel Fuzzy Relational Clustering Algorithm”,IEEE Transactions on Knowledge and Data Engineering, vol. 25, no. 1, January 2013. 404 All Rights Reserved © 2014 IJARCSEE