Download document 8900388

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Expectation–maximization algorithm wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

K-means clustering wikipedia , lookup

Nearest-neighbor chain algorithm wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
American Journal of Sustainable Cities and Society
Available online on http://www.rspublication.com/ajscs/ajsas.html
Issue 3, Vol. 1 January 2014
ISSN 2319 – 7277
Clustering Sentence-Level Text Using a Hierarchical Fuzzy Relational Clustering
Algorithm
AS. Rosemary,
PG student,
Prist University, Pondicherry.
Miss. J.R.Thresphine,
Assistant Professor,
Prist University, Puducherry.
Abstract— In comparison with hard clustering methods, in which a pattern belongs to a
single cluster, hierarchical fuzzy clustering algorithms allow patterns to belong to all clusters
with differing degrees of membership. This is important in domains such as sentence
clustering, since a sentence is likely to be related to more than one theme or topic present
within a document or set of documents. However, because most sentence similarity measures
do not represent sentences in a common metric space, conventional hierarchical fuzzy
clustering approaches based on prototypes or mixtures of Gaussians are generally not
applicable to sentence clustering.
Keywords—hierarchical relational clustering; language processing;
I.
INTRODUCTION
SENTENCE clustering plays an important role in many text processing activities. For
example, various authors have argued that incorporating sentence clustering into extractive
multi document summarization helps avoid problems of content overlap, leading to better
coverage. However, sentence clustering can also be used within more general text mining
tasks. For example, consider web mining, where the specific objective might be to discover
some novel information from a set of documents initially retrieved in response to some query.
By clustering the sentences of those documents we would intuitively expect at least one of the
clusters to be closely related to the concepts described by the query terms; however, other
clusters may contain information pertaining to the query in some way hitherto unknown to us,
and in such a case we would have successfully mined new information.
Irrespective of the specific task (e.g., summarization, text mining, etc.), most documents
will contain interrelated topics or themes, and many sentences will be related to some degree
to a number of these. In this paper is motivated by the belief that successfully being able to
capture such fuzzy relationships will lead to an increase in the breadth and scope of problems
to which sentence clustering can be applied. However, clustering text at the sentence level
poses specific challenges not present when clustering larger segments of text, such as
documents.
We now highlight some important differences between clustering at these two levels, and
examine some existing approaches to hierarchical clustering. Clustering text at the document
level is well established in the Information Retrieval (IR) literature, where documents are
typically represented as data points in a high dimensional vector space in which each
dimension corresponds to a unique keyword, leading to a rectangular representation in which
rows represent documents and columns represent attributes of those documents (e.g., tf-idf
R S. Publication, [email protected]
Page 404
American Journal of Sustainable Cities and Society
Available online on http://www.rspublication.com/ajscs/ajsas.html
Issue 3, Vol. 1 January 2014
ISSN 2319 – 7277
values of the keywords). This type of data, which we refer to as “attribute data,” is amenable
to clustering by a large range of algorithms. Since data points lie in a metric space, we can
readily apply prototype-based algorithms such as k-Means , Isodata , Fuzzy c-Means (FCM) ,
and the closely related mixture model approach , all of which represent clusters in terms of
parameters such as means and covariances, and therefore assume a common metric input
space. Since pairwise similarities or dissimilarities between data points can readily be
calculated from the attribute data using similarity measures such as cosine similarity, we can
also apply relational clustering algorithms such as Spectral Clustering and Affinity
Propagation, which take input data in the form of a square matrix W ¼ fwijg where wij is the
relationship between the ith and jth data object. To distinguish it from attribute data, we refer
to such data as “relational data.” A broad range of hierarchical clustering algorithms can also
be applied.
The vector space model has been successful in IR because it is able to adequately capture
much of the semantic content of document-level text. This is because documents that are
semantically related are likely to contain many words in common, and thus are found to be
similar according to popular vector space measures such as cosine similarity, which are based
on word co-occurrence.
II. PROBLEM DEFINITION
Data mining is the computational process of discovering patterns in large datasets involving
methods at the intersection of artificial intelligence, machine learning, statistics and data base
systems. The goal of data mining process is to extract information from a dataset and
transform it into understandable format. Data mining functionalities are used to specify the
kind of patterns to be found in data mining tasks. It can be classified into two categories such
as descriptive and predictive. Descriptive mining tasks characterize the general properties of
the data in database. Predictive mining tasks perform inference on the current data in order to
make predictions. It is important to have a data mining system that can mine multiple kinds of
patterns to accommodate different user expectation or application. Data mining system should
be able to discover patterns at various granularities i.e. different levels of abstraction.
III. SYSTEM ANALYSIS
a. Existing System:
Clustering text at the document level is well establishedin the Information Retrieval (IR)
literature, where documents are typically represented as data points in a high
dimensionalvector space in which each dimension corresponds to a unique keyword, leading
to a rectangular representation in which rows represent documents and columns represent
attributes of those documents. The vector space model has been successful in IR because it is
able to adequately capture much of the semantic content of document-level text. This is
because documents that are semantically related are likely to contain many words in common,
and thus are found to be similar according to popular vector space measures such as cosine
similarity, which are based on word co-occurrence. However, while the assumption that
(semantic) similarity can be measured in terms of word co-occurrence may be valid at the
level, the assumption does not hold for small-sized text fragments such as sentences, since two
sentences may be semantically related despite having few, if any, words in common.
b. Proposed System:
This section presents the proposed hierarchical fuzzy relational clustering algorithm.
We first describe the use of Page Rank as a general graph centrality measure, and review the
Gaussian mixture model approach. We then describe how Page Rank can be used within an
R S. Publication, [email protected]
Page 405
American Journal of Sustainable Cities and Society
Available online on http://www.rspublication.com/ajscs/ajsas.html
Issue 3, Vol. 1 January 2014
ISSN 2319 – 7277
Expectation-Maximization framework to construct a complete relational fuzzy clustering
algorithm.
Fig:- Text mining process.
The final section discusses issues relating to convergence, duplicate clusters, and
various other implementation issues. Since Page Rank centrality can be viewed as a special
case of eigenvector centrality, we name the algorithm Fuzzy Relational Eigenvector
Centrality-based Clustering Algorithm (FRECCA).
IV FEASIBILTY STUDY

Anomaly detection, the identification of unusual data records that might be interesting or
data errors that require further investigation.

Association rule learning, searches for relationships between variables.

Clustering, is the task of discovering groups and structures in the data that are in some
way or another similar without using known structures in the data.

Classification, is the task of generalizing known structure to apply to new data.

Regression, attempts to find a function which models the data with the least error.
A.
Summarization, providing a more compact representation of the data set, including
visualization and report generationAbbreviations and Acronyms.
R S. Publication, [email protected]
Page 406
American Journal of Sustainable Cities and Society
Available online on http://www.rspublication.com/ajscs/ajsas.html
Issue 3, Vol. 1 January 2014
ISSN 2319 – 7277
ALGORITHM USED:
Fig:- FRECCA Algorithm.
R S. Publication, [email protected]
Page 407
American Journal of Sustainable Cities and Society
Available online on http://www.rspublication.com/ajscs/ajsas.html
Issue 3, Vol. 1 January 2014
ISSN 2319 – 7277
This section presents the proposed clustering algorithm. Fuzzy Relational Eigenvector
Centrality-based hierarchical Clustering Algorithm (FRECCA). We first describe the use
of PageRank as a general graph centrality measure, and review the Gaussian mixture
model approach. We then describe how PageRank can be used within an ExpectationMaximization framework to construct a complete relational hierarchical clustering
algorithm. The final section discusses issues relating to convergence, duplicate clusters,
and various other implementation issues.
CONCLUSION
Graph-based methods are an exciting area of research within the pattern recognition
community. We have already mentioned some of the new work we are conducting in this
area; however, what we are most excited about is extending the technique to perform
hierarchical clustering. The concepts present in natural language documents usually display
some type of hierarchical structure, whereas the algorithm we have presented in this paper
identifies only flat clusters. Our main future objective is to extend these ideas to the
development of a hierarchical fuzzy relational clustering algorithm.
REFERENCES
[1]
Andrew saba[2013],” clustering sentence-level text using a novel fuzzy relational
clustering algorithm”,ieee transactions on knowledge and data engineering,vol 25,no
1,january 2013.
[2] Radamihalcea and courtneycorley[2012],” corpus-based and knowledge-based measures
of text semantic similarity” ieee transactions on national language engineering,vol 8,no 4 may
2012.
[3] Alexander budanitsky, graemehirst [2011],”evaluatingwordnet-based measures of lexical
semantic relatedness”ieee transactions on multimedia systems,vol 5 no 7 march 2011.
[4] Marina meila, jingoshi[2012],” learning segmentation by random walks” future
generation computer systems,vol 9,no 5 Elsevier 2012.
[5] Stella x. yu, jianboshi[2011],”multiclass spectral clustering”ieee transactions on
computer aided design of integrated circuit and systems,vol 12,no 7,december 2011.
[6] Dingdingwang, tao li [2012],”multi-document summarization via sentence-levelsemantic
analysis and symmetric matric factorization”ieee transactions on pattern recognition and
machine learningvol 6,no.5.august 2012.
[7]
Jianboshi and jitendramalik[2013],”normalized cuts and image segmentation”ieee
transactions on pattern analysis and machine intelligence”,vol 22,no.8,august 2013
[8] Yuhua li, davidmclean, zuhair a. bandar, james d. o’shea, and keeleycrockett’[2012],”
sentence similarity based on semantic nets and corpus statistics”,ieee transactions on
knowledge and data engineering,vol 18,no 8,august 2012.
[9]
Andrewrosenberg, juliahirschberg [2011], “v-measure: a conditional entropy-based
external cluster evaluation measure”ieee transactions on knowledge and data engineering vol
20,no.3 september 2011.
[10] Andrew y. ng, michaeli.jordan, yairweiss[2011], “on spectral clustering :analysis and an
algorithm”,ieee transactions on neural information processing system”vol 15 no 12 november
2011.
R S. Publication, [email protected]
Page 408