Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
DOI 10.4010/2016.1898 ISSN 2321 3361 © 2016 IJESC Research Article Volume 6 Issue No. 7 PageRank Technique Along With Probability-Maximization in Sentence-Clustering M.Lakshmi Tara1, Dr.G.V.S.N.R.V.Prasad2 M.Tech Scholar1, Professor and Dean-A.A2 Department of CSE Gudlavalleru Engineering College, Gudlavalleru, AP, India [email protected] Abstract: Cosine similarity coefficient, a pace that's generally found in clustering, measures the similarity between groups. FRECCA's approach to use Cosine's similarity co-efficient increases time complexity greatly. Hence Cosine's similarity coefficient is replaced with Jaro Winkler similarity measure to obtain the cluster similarity matching. Jaro-Winkler does a better job at working the similarity of strings because it takes order of characters into account using positional indexes to estimate relevancy. It is presumed that Jaro-Winkler driven FRECCA's performance regarding one-to-many data linkages offers an enhanced performance in contrast to Cosine driven FRECCA's workings. Fuzzy clustering techniques permit patterns that belong to all clusters with differing degrees of membership in which a pattern belongs to a single cluster, in comparison to hard clustering schemes. This technique is significant, when a sentence is to be related with more than one subject or topic in a document or a set of documents, in the sentence clustering domains. The classical fuzzy clustering strategies based on models or combinations of Gaussians are not applicable to sentence clustering, because most of the sentence similarity measures do not represent the sentence in common metric space. So, a new improved fuzzy clustering technique is proposed that works on relational input data, which takes pair wise similarities among data and objects in a square matrix format. This technique operates by taking graph centrality of an object as likelihood and uses a graph representation of data. Finally, the results demonstrate the capability of identifying overlapping clusters of semantically related sentences. Hence it can be used as text mining purpose and also in various domains as benchmark data sets. Keywords: Sentence clustering, Graph representation, Fuzzy clustering, graph centrality, text mining. I INTRODUCTION In many text-processing activities sentence clustering plays a vital role. Without depending on specific task whether it is document summarization or text mining most documents will contain interrelated topics or themes. Many sentences will be related to some of these with some degree. The description of the paper is applicability of such fuzzy relationships into sentence clustering, which increase the breadth and scope of problems. Clustering is easy when applicable to larger segments of text such as documents; however, clustering text is challenging at sentence level. Hence there is a need to examine these two levels of clustering [1]. In the previous study of Information Retrieval, it was observed that the document level clustering text is well established, where documents are noted as data points. The information is represented in a high-dimensional vector space, each dimension corresponds to a unique keyword which leads to a table where rows contain documents and columns contain properties of those documents. This type of data representation is called as “properties data “or “attribute data” and is responsive to clustering using a large range of techniques. Prototype-based techniques like K-Means, Isodata [2], Fuzzy c-Means, and the closely related mixture model strategy are applicable to attribute data because of the metric space. All these techniques will represent clusters in terms of parameters such as means and covariance. Using similarity measures like International Journal of Engineering Science and Computing, July 2016 cosine similarity, pair wise similarities or dissimilarities among data points can be calculated from the attribute data and applies relational clustering techniques like Affinity Propagation and Spectral Clustering [3]. A large range of hierarchical clustering techniques also presented. In view of document-level text, the vector-space method has been successful in IR, because documents that are semantically related are likely to contain huge words in common, thus finding similar according to vector space measures such as cosine similarity based on word co-occurrences. Hence the similarity can measured in terms of word co-occurrence is valid at document level, but this doesn’t work for small-sized text such as sentences. To overcome the difficulty, instead of representing sentences in a common vector space, these measures define sentence similarity as some function of intersentence word-to-word similarities [4]. Therefore, so many researchers proposed their algorithms and programming techniques. But they have some drawbacks. These results to propose an innovative fuzzy relational clustering technique. In consideration of mixture model strategy the data is structured as combinational components. In this model without using any outside density model, a graph based model for representation of data is used where nodes are treated as objects and weighted edges as similarity among objects. With the help of Page Rank technique being applied to each cluster and interpreting the page-rank score of an object within some cluster as likelihood, the Probability-Maximization 8122 http://ijesc.org/ methodology [5] is used to identify the model attributes. In result we have fuzzy relational clustering technique, and can be applicable to any domain where the relationship among objects is expressed in terms of pair wise similarities. Sentences (.txt file) II. EXISTING SYSTEM This paper starts with the discussion on the Page Rank technique as a general graph centrality measure, and the strategy mixture method. And then, we study in which way the Page Rank technique used within a ProbabilityMaximization construction in building the proposed technique that is relational fuzzy clustering algorithm. We name this by adding one more property that is Fuzzy Relational Eigenvector Centrality-based Clustering Technique, because Page Rank centrality viewed as a special case of eigenvector centrality. Finally, a discussion about the problems related to convergence, duplicate clusters, and issues in implementations takes place. Measuring sentence similarity (cosine similarity) A. Description of Page Rank Centrality: The centrality measurement can be observed based on the fundamental structure of the Page Rank technique that is the identification of the importance of a node within a graph depends on global information recursively computed from the entire graph in connections to high-scoring nodes contributing more scores than connections to low-scoring nodes. A numerical score value between 0 and 1, is assigned to each node in a directed graph which is called as Page Rank score. The Page Rank score for directed weighted graph can be calculated by using the following equation, where vi and vj are the vertices and wjk is the affinity matrix weights PR ( ) = (1-d) +d * B. Description of Gaussian Mixture Method & the Probability-Maximization technique: The Existing Fuzzy Relational Eigenvector Centrality-based Clustering Technique is motivated by the mixture model strategy, that is a density is modeled as a linear combination of ‘C’ component densities(x/m) in the form ∑πMP(x/m), in which πm are mixing coefficients and represent the previous Probabilities of data point x having been generated from component m of the mixture. C. Structure of Existing System: The existing technique uses the Page Rank score of an object as a measurement of its centrality of that cluster, and then treated as likelihood. To optimize the parameters like values of the cluster membership and mixing coefficients, the algorithm uses ProbabilityMaximization method. i) Text file containing sentences is given as input. ii) Sentence similarity is measured between the sentences using cosine similarity. iii) Then FRECCA (Fuzzy Relational Eigenvector Centrality based Clustering Algorithm) is applied. iv) Finally clusters will be formed with sentences having different degree of membership. This can be viewed below: FRECCA Algorithm Formation of clusters C1 C2 Fig 1: Structure of existing technique FRECCA Algorithm:Input: Pair wise similarity values S = { where is the similarity between sentences i and j. Number of clusters, C. Output: cluster membership values { |i=1…N, m=1…C} i) Initialization:Assuming that initialization of values of cluster membership are done randomly, and then normalized. Mixing coefficients are initialized such that priors for all clusters are equal. //initialize and normalize membership values for i=1 to N for m=1 to C = rnd //random number on [0, 1] End for For m=1 to C //normalize End for End for For m=1 to C //equal priors end for ii) E-step (Expectation step):The Page Rank value of the object in each cluster is calculated by using the E-step. repeat until convergence for m=1 to C //create weighted affinity matrix for cluster ‘m’ for i=1 to N for j=1 to N end for International Journal of Engineering Science and Computing, July 2016 8123 http://ijesc.org/ end for //calculate Page Rank scores for cluster m Repeat until convergence = (1-d) +d * End repeat //assign Page Rank scores to likelihoods End for //calculate new cluster membership values For i=1 to N For m=1 to C )/ ) End for End for iii) M-step (Maximization step):- The M-step is used as a single step of updating the mixing coefficients based on values calculated in E-step. //update mixing coefficients For m=1 to C = End for End repeat Drawbacks with Existing System: i) Cosine similarity coefficient, a measure that is commonly used in clustering, measures the similarity between clusters ii) FRECCA's approach to use Cosine's similarity co-efficient increases time complexity exponentially. III. PROPOSED SYSTEM i) In proposed system, the same algorithm which is used in existing system is implemented .But, instead of using cosine similarity; “Jaro-Winkler” similarity measure is used. ii) The replacement of cosine similarity with Jaro-Winkler similarity measure is due to more time complexity introduced by former one. Function of Jaro-Winkler similarity measure: Intuition 1: Similarity of first few letters is most important. -Let ‘p’ be the length of the common prefix of x and y. -Simwinkler(x, y) =simjaro(x, y) + (1-simjaro(x, y)) = 1 if common prefix is >=10 Intuition 2: Longer strings with even more common letters -simwinkler, long(x, y) =simwinkler(x, y) + (1-simwinkler(x, y)) -Where c is overall number of common letters -Apply only if: Long strings: min (|x|, |y|) ≥ 5 Two additional common letters: c-p≥ 2 At least half remaining letters of shorter string are common C-p≥ IV. DATASET: In this paper, “Quotations dataset” is taken from “famous quotations.com”. On this dataset, cosine similarity based International Journal of Engineering Science and Computing, July 2016 FRECCA and Jaro-Winkler similarity based FRECCA are applied on 1000 quotations of size 70Kb respectively. V.PERFORMANCE ANALYSIS: After applying Cosine similarity based FRECCA approach and Jaro-Winkler based FRECCA approach on “Quotations dataset” the following observations are made: Processing Time Cosine Similarity Jaro-Winkler 0.008secs 0.006secs 0.003secs similarity 0.002secs 0.003secs 0.001secs Table1: Comparison between similarity measures based on processing time of each sentence. From the above table it is clear that the time complexity which is drawback for cosine similarity based FRECCA approach is reduced in JaroWinkler based FRECCA Approach. No. of clusters Cosine Similarity Jaro-Winkler similarity 5 2 6 5 6 2 Table2: Comparison between similarity measures based on no. of clusters From the above table it is clear that not only time complexity but also clusters quality has been improved in terms of no. of clusters. Thus it is clear that Jaro-Winkler does a much better job at determining the similarity of strings because it takes order into account using positional indexes to estimate relevancy. VI.CONCLUSION The proposed technique was constructed with the motivation of FRECCA algorithm. The resulted data shows that this technique achieved effective performance as a benchmark for Spectral Clustering and k-Medoids algorithms. These can also be applied on news articles. With this algorithm the utmost application is document-summarization, and used generally in mining of text. The time complexity and costless is the major advantage of this technique, and is applied on relational data and attribute data. Representing the results in graph based is excited by the researchers. Finally the main future objective is to develop hierarchical fuzzy relational clustering technique. VII. REFERENCES [1] G. Salton, Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer. Addison-Wesley, 1989. 8124 http://ijesc.org/ [2] J.C. Dunn, “A Fuzzy Relative of the ISODATA Process and its Use in Detecting Compact Well-Separated Clusters,” J. Cybernetics, vol. 3, no. 3, pp. 32-57, 1973. [3] T. Geweniger, D. Zu¨ hlke, B. Hammer, and T. Villmann, “FuzzyVariant of Affinity Propagation in Comparison to Median Fuzzy c-Means,” Proc. Seventh Int’l Workshop Advances in Self-Organizing Maps, pp. 72-79, 2009 [4] A. Budanitsky and G. Hirst, “Evaluating WordNet-Based Measuresof Lexical Semantic Relatedness,” Computational Linguistics, vol. 32, no. 1, pp. 13-47, 2006... [5] L. Kaufman and P.J. Rousseeuw, “Clustering by Means ofMedoids,” Statistical Analysis Based on the L1 Norm, Y. Goodge, eds., pp. 405-416, North Holland/Elsevier, 1987. International Journal of Engineering Science and Computing, July 2016 8125 http://ijesc.org/