Download PageRank Technique Along With Probability-Maximization

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Mixture model wikipedia , lookup

Human genetic clustering wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

K-means clustering wikipedia , lookup

Nearest-neighbor chain algorithm wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
DOI 10.4010/2016.1898
ISSN 2321 3361 © 2016 IJESC
Research Article
Volume 6 Issue No. 7
PageRank Technique Along With Probability-Maximization in
Sentence-Clustering
M.Lakshmi Tara1, Dr.G.V.S.N.R.V.Prasad2
M.Tech Scholar1, Professor and Dean-A.A2
Department of CSE
Gudlavalleru Engineering College, Gudlavalleru, AP, India
[email protected]
Abstract:
Cosine similarity coefficient, a pace that's generally found in clustering, measures the similarity between groups. FRECCA's
approach to use Cosine's similarity co-efficient increases time complexity greatly. Hence Cosine's similarity coefficient is
replaced with Jaro Winkler similarity measure to obtain the cluster similarity matching. Jaro-Winkler does a better job at working
the similarity of strings because it takes order of characters into account using positional indexes to estimate relevancy. It is
presumed that Jaro-Winkler driven FRECCA's performance regarding one-to-many data linkages offers an enhanced performance
in contrast to Cosine driven FRECCA's workings. Fuzzy clustering techniques permit patterns that belong to all clusters with
differing degrees of membership in which a pattern belongs to a single cluster, in comparison to hard clustering schemes. This
technique is significant, when a sentence is to be related with more than one subject or topic in a document or a set of documents,
in the sentence clustering domains. The classical fuzzy clustering strategies based on models or combinations of Gaussians are not
applicable to sentence clustering, because most of the sentence similarity measures do not represent the sentence in common
metric space. So, a new improved fuzzy clustering technique is proposed that works on relational input data, which takes pair
wise similarities among data and objects in a square matrix format. This technique operates by taking graph centrality of an object
as likelihood and uses a graph representation of data. Finally, the results demonstrate the capability of identifying overlapping
clusters of semantically related sentences. Hence it can be used as text mining purpose and also in various domains as benchmark
data sets.
Keywords: Sentence clustering, Graph representation, Fuzzy clustering, graph centrality, text mining.
I INTRODUCTION
In many text-processing activities sentence clustering plays a
vital role. Without depending on specific task whether it is
document summarization or text mining most documents will
contain interrelated topics or themes. Many sentences will be
related to some of these with some degree. The description of
the paper is applicability of such fuzzy relationships into
sentence clustering, which increase the breadth and scope of
problems. Clustering is easy when applicable to larger
segments of text such as documents; however, clustering text
is challenging at sentence level. Hence there is a need to
examine these two levels of clustering [1]. In the previous
study of Information Retrieval, it was observed that the
document level clustering text is well established, where
documents are noted as data points. The information is
represented in a high-dimensional vector space, each
dimension corresponds to a unique keyword which leads to a
table where rows contain documents and columns contain
properties of those documents. This type of data representation
is called as “properties data “or “attribute data” and is
responsive to clustering using a large range of techniques.
Prototype-based techniques like K-Means, Isodata [2], Fuzzy
c-Means, and the closely related mixture model strategy are
applicable to attribute data because of the metric space. All
these techniques will represent clusters in terms of parameters
such as means and covariance. Using similarity measures like
International Journal of Engineering Science and Computing, July 2016
cosine similarity, pair wise similarities or dissimilarities
among data points can be calculated from the attribute data
and applies relational clustering techniques like Affinity
Propagation and Spectral Clustering [3]. A large range of
hierarchical clustering techniques also presented. In view of
document-level text, the vector-space method has been
successful in IR, because documents that are semantically
related are likely to contain huge words in common, thus
finding similar according to vector space measures such as
cosine similarity based on word co-occurrences. Hence the
similarity can measured in terms of word co-occurrence is
valid at document level, but this doesn’t work for small-sized
text such as sentences. To overcome the difficulty, instead of
representing sentences in a common vector space, these
measures define sentence similarity as some function of intersentence word-to-word similarities [4]. Therefore, so many
researchers proposed their algorithms and programming
techniques. But they have some drawbacks. These results to
propose an innovative fuzzy relational clustering technique. In
consideration of mixture model strategy the data is structured
as combinational components. In this model without using any
outside density model, a graph based model for representation
of data is used where nodes are treated as objects and
weighted edges as similarity among objects. With the help of
Page Rank technique being applied to each cluster and
interpreting the page-rank score of an object within some
cluster as likelihood, the Probability-Maximization
8122
http://ijesc.org/
methodology [5] is used to identify the model attributes. In
result we have fuzzy relational clustering technique, and can
be applicable to any domain where the relationship among
objects is expressed in terms of pair wise similarities.
Sentences (.txt file)
II. EXISTING SYSTEM
This paper starts with the discussion on the Page Rank
technique as a general graph centrality measure, and the
strategy mixture method. And then, we study in which way
the Page Rank technique used within a ProbabilityMaximization construction in building the proposed technique
that is relational fuzzy clustering algorithm. We name this by
adding one more property that is Fuzzy Relational Eigenvector
Centrality-based Clustering Technique, because Page Rank
centrality viewed as a special case of eigenvector centrality.
Finally, a discussion about the problems related to
convergence, duplicate clusters, and issues in implementations
takes place.
Measuring sentence
similarity (cosine
similarity)
A. Description of Page Rank Centrality: The centrality
measurement can be observed based on the fundamental
structure of the Page Rank technique that is the identification
of the importance of a node within a graph depends on global
information recursively computed from the entire graph in
connections to high-scoring nodes contributing more scores
than connections to low-scoring nodes. A numerical score
value between 0 and 1, is assigned to each node in a directed
graph which is called as Page Rank score. The Page Rank
score for directed weighted graph can be calculated by using
the following equation, where vi and vj are the vertices and wjk
is the affinity matrix weights
PR ( ) = (1-d) +d *
B. Description of Gaussian Mixture Method & the
Probability-Maximization technique:
The Existing Fuzzy Relational Eigenvector Centrality-based
Clustering Technique is motivated by the mixture model
strategy, that is a density is modeled as a linear combination of
‘C’ component densities(x/m) in the form ∑πMP(x/m), in
which πm are mixing coefficients and represent the previous
Probabilities of data point x having been generated from
component m of the mixture.
C. Structure of Existing System: The existing technique uses
the Page Rank score of an object as a measurement of its
centrality of that cluster, and then treated as likelihood. To
optimize the parameters like values of the cluster membership
and mixing coefficients, the algorithm uses ProbabilityMaximization method.
i) Text file containing sentences is given as input.
ii) Sentence similarity is measured between the sentences
using cosine similarity.
iii) Then FRECCA (Fuzzy Relational Eigenvector Centrality
based Clustering Algorithm) is applied.
iv) Finally clusters will be formed with sentences having
different degree of membership. This can be viewed below:
FRECCA Algorithm
Formation of
clusters
C1
C2
Fig 1: Structure of existing technique
FRECCA Algorithm:Input: Pair wise similarity values S = {
where
is the similarity between sentences i and j.
Number of clusters, C.
Output: cluster membership values {
|i=1…N, m=1…C}
i) Initialization:Assuming that initialization of values of cluster membership
are done randomly, and then normalized. Mixing coefficients
are initialized such that priors for all clusters are equal.
//initialize and normalize membership values
for i=1 to N
for m=1 to C
= rnd //random number on [0, 1]
End for
For m=1 to C
//normalize
End for
End for
For m=1 to C
//equal priors
end for
ii) E-step (Expectation step):The Page Rank value of the object in each cluster is calculated
by using the E-step.
repeat until convergence
for m=1 to C
//create weighted affinity matrix for cluster
‘m’
for i=1 to N
for j=1 to N
end for
International Journal of Engineering Science and Computing, July 2016
8123
http://ijesc.org/
end for
//calculate Page Rank scores for cluster m
Repeat until convergence
= (1-d) +d *
End repeat
//assign Page Rank scores to likelihoods
End for
//calculate new cluster membership values
For i=1 to N
For m=1 to C
)/
)
End for
End for
iii) M-step (Maximization step):- The M-step is used as a
single step of updating the mixing coefficients based on values
calculated in E-step.
//update mixing coefficients
For m=1 to C
=
End for
End repeat
Drawbacks with Existing System:
i) Cosine similarity coefficient, a measure that is commonly
used in clustering, measures the similarity between clusters
ii) FRECCA's approach to use Cosine's similarity co-efficient
increases time complexity exponentially.
III. PROPOSED SYSTEM
i) In proposed system, the same algorithm which is used in
existing system is implemented .But, instead of using cosine
similarity; “Jaro-Winkler” similarity measure is used.
ii) The replacement of cosine similarity with Jaro-Winkler
similarity measure is due to more time complexity introduced
by former one.
Function of Jaro-Winkler similarity measure:
Intuition 1: Similarity of first few letters is most important.
-Let ‘p’ be the length of the common prefix of x and y.
-Simwinkler(x, y) =simjaro(x, y) + (1-simjaro(x, y))
= 1 if common prefix is >=10
Intuition 2: Longer strings with even more common letters
-simwinkler, long(x, y) =simwinkler(x, y) + (1-simwinkler(x,
y))
-Where c is overall number of common letters
-Apply only if:
Long strings: min (|x|, |y|) ≥ 5
Two additional common letters: c-p≥ 2
At least half remaining letters of shorter string are common
C-p≥
IV. DATASET:
In this paper, “Quotations dataset” is taken from “famous
quotations.com”. On this dataset, cosine similarity based
International Journal of Engineering Science and Computing, July 2016
FRECCA and Jaro-Winkler similarity based FRECCA are
applied on 1000 quotations of size 70Kb respectively.
V.PERFORMANCE ANALYSIS:
After applying Cosine similarity based FRECCA approach
and Jaro-Winkler based FRECCA approach on “Quotations
dataset” the following observations are made:
Processing Time
Cosine Similarity
Jaro-Winkler
0.008secs
0.006secs
0.003secs
similarity
0.002secs
0.003secs
0.001secs
Table1: Comparison between similarity measures based on
processing time of each sentence.
From the above table it is clear that the time complexity which
is drawback for cosine similarity based FRECCA approach is
reduced in JaroWinkler based FRECCA Approach.
No. of clusters
Cosine Similarity
Jaro-Winkler
similarity
5
2
6
5
6
2
Table2: Comparison between similarity measures based on
no. of clusters
From the above table it is clear that not only time complexity
but also clusters quality has been improved in terms of no. of
clusters.
Thus it is clear that Jaro-Winkler does a much better
job at determining the similarity of strings because it takes
order into account using positional indexes to estimate
relevancy.
VI.CONCLUSION
The proposed technique was constructed with the motivation
of FRECCA algorithm. The resulted data shows that this
technique achieved effective performance as a benchmark for
Spectral Clustering and k-Medoids algorithms. These can also
be applied on news articles. With this algorithm the utmost
application is document-summarization, and used generally in
mining of text. The time complexity and costless is the major
advantage of this technique, and is applied on relational data
and attribute data. Representing the results in graph based is
excited by the researchers. Finally the main future objective is
to develop hierarchical fuzzy relational clustering technique.
VII. REFERENCES
[1] G. Salton, Automatic Text Processing: The
Transformation, Analysis, and Retrieval of Information by
Computer. Addison-Wesley, 1989.
8124
http://ijesc.org/
[2] J.C. Dunn, “A Fuzzy Relative of the ISODATA Process
and its Use in Detecting Compact Well-Separated Clusters,” J.
Cybernetics, vol. 3, no. 3, pp. 32-57, 1973.
[3] T. Geweniger, D. Zu¨ hlke, B. Hammer, and T. Villmann,
“FuzzyVariant of Affinity Propagation in Comparison to
Median Fuzzy c-Means,” Proc. Seventh Int’l Workshop
Advances in Self-Organizing
Maps, pp. 72-79, 2009
[4] A. Budanitsky and G. Hirst, “Evaluating WordNet-Based
Measuresof Lexical Semantic Relatedness,” Computational
Linguistics,
vol. 32, no. 1, pp. 13-47, 2006...
[5] L. Kaufman and P.J. Rousseeuw, “Clustering by Means
ofMedoids,” Statistical Analysis Based on the L1 Norm,
Y. Goodge, eds., pp. 405-416, North Holland/Elsevier, 1987.
International Journal of Engineering Science and Computing, July 2016
8125
http://ijesc.org/