Download document 8900388

American Journal of Sustainable Cities and Society Available online on http://www.rspublication.com/ajscs/ajsas.html Issue 3, Vol. 1 January 2014 ISSN 2319 – 7277 Clustering Sentence-Level Text Using a Hierarchical Fuzzy Relational Clustering Algorithm AS. Rosemary, PG student, Prist University, Pondicherry. Miss. J.R.Thresphine, Assistant Professor, Prist University, Puducherry. Abstract— In comparison with hard clustering methods, in which a pattern belongs to a single cluster, hierarchical fuzzy clustering algorithms allow patterns to belong to all clusters with differing degrees of membership. This is important in domains such as sentence clustering, since a sentence is likely to be related to more than one theme or topic present within a document or set of documents. However, because most sentence similarity measures do not represent sentences in a common metric space, conventional hierarchical fuzzy clustering approaches based on prototypes or mixtures of Gaussians are generally not applicable to sentence clustering. Keywords—hierarchical relational clustering; language processing; I. INTRODUCTION SENTENCE clustering plays an important role in many text processing activities. For example, various authors have argued that incorporating sentence clustering into extractive multi document summarization helps avoid problems of content overlap, leading to better coverage. However, sentence clustering can also be used within more general text mining tasks. For example, consider web mining, where the specific objective might be to discover some novel information from a set of documents initially retrieved in response to some query. By clustering the sentences of those documents we would intuitively expect at least one of the clusters to be closely related to the concepts described by the query terms; however, other clusters may contain information pertaining to the query in some way hitherto unknown to us, and in such a case we would have successfully mined new information. Irrespective of the specific task (e.g., summarization, text mining, etc.), most documents will contain interrelated topics or themes, and many sentences will be related to some degree to a number of these. In this paper is motivated by the belief that successfully being able to capture such fuzzy relationships will lead to an increase in the breadth and scope of problems to which sentence clustering can be applied. However, clustering text at the sentence level poses specific challenges not present when clustering larger segments of text, such as documents. We now highlight some important differences between clustering at these two levels, and examine some existing approaches to hierarchical clustering. Clustering text at the document level is well established in the Information Retrieval (IR) literature, where documents are typically represented as data points in a high dimensional vector space in which each dimension corresponds to a unique keyword, leading to a rectangular representation in which rows represent documents and columns represent attributes of those documents (e.g., tf-idf R S. Publication, [email protected] Page 404 American Journal of Sustainable Cities and Society Available online on http://www.rspublication.com/ajscs/ajsas.html Issue 3, Vol. 1 January 2014 ISSN 2319 – 7277 values of the keywords). This type of data, which we refer to as “attribute data,” is amenable to clustering by a large range of algorithms. Since data points lie in a metric space, we can readily apply prototype-based algorithms such as k-Means , Isodata , Fuzzy c-Means (FCM) , and the closely related mixture model approach , all of which represent clusters in terms of parameters such as means and covariances, and therefore assume a common metric input space. Since pairwise similarities or dissimilarities between data points can readily be calculated from the attribute data using similarity measures such as cosine similarity, we can also apply relational clustering algorithms such as Spectral Clustering and Affinity Propagation, which take input data in the form of a square matrix W ¼ fwijg where wij is the relationship between the ith and jth data object. To distinguish it from attribute data, we refer to such data as “relational data.” A broad range of hierarchical clustering algorithms can also be applied. The vector space model has been successful in IR because it is able to adequately capture much of the semantic content of document-level text. This is because documents that are semantically related are likely to contain many words in common, and thus are found to be similar according to popular vector space measures such as cosine similarity, which are based on word co-occurrence. II. PROBLEM DEFINITION Data mining is the computational process of discovering patterns in large datasets involving methods at the intersection of artificial intelligence, machine learning, statistics and data base systems. The goal of data mining process is to extract information from a dataset and transform it into understandable format. Data mining functionalities are used to specify the kind of patterns to be found in data mining tasks. It can be classified into two categories such as descriptive and predictive. Descriptive mining tasks characterize the general properties of the data in database. Predictive mining tasks perform inference on the current data in order to make predictions. It is important to have a data mining system that can mine multiple kinds of patterns to accommodate different user expectation or application. Data mining system should be able to discover patterns at various granularities i.e. different levels of abstraction. III. SYSTEM ANALYSIS a. Existing System: Clustering text at the document level is well establishedin the Information Retrieval (IR) literature, where documents are typically represented as data points in a high dimensionalvector space in which each dimension corresponds to a unique keyword, leading to a rectangular representation in which rows represent documents and columns represent attributes of those documents. The vector space model has been successful in IR because it is able to adequately capture much of the semantic content of document-level text. This is because documents that are semantically related are likely to contain many words in common, and thus are found to be similar according to popular vector space measures such as cosine similarity, which are based on word co-occurrence. However, while the assumption that (semantic) similarity can be measured in terms of word co-occurrence may be valid at the level, the assumption does not hold for small-sized text fragments such as sentences, since two sentences may be semantically related despite having few, if any, words in common. b. Proposed System: This section presents the proposed hierarchical fuzzy relational clustering algorithm. We first describe the use of Page Rank as a general graph centrality measure, and review the Gaussian mixture model approach. We then describe how Page Rank can be used within an R S. Publication, [email protected] Page 405 American Journal of Sustainable Cities and Society Available online on http://www.rspublication.com/ajscs/ajsas.html Issue 3, Vol. 1 January 2014 ISSN 2319 – 7277 Expectation-Maximization framework to construct a complete relational fuzzy clustering algorithm. Fig:- Text mining process. The final section discusses issues relating to convergence, duplicate clusters, and various other implementation issues. Since Page Rank centrality can be viewed as a special case of eigenvector centrality, we name the algorithm Fuzzy Relational Eigenvector Centrality-based Clustering Algorithm (FRECCA). IV FEASIBILTY STUDY  Anomaly detection, the identification of unusual data records that might be interesting or data errors that require further investigation.  Association rule learning, searches for relationships between variables.  Clustering, is the task of discovering groups and structures in the data that are in some way or another similar without using known structures in the data.  Classification, is the task of generalizing known structure to apply to new data.  Regression, attempts to find a function which models the data with the least error. A. Summarization, providing a more compact representation of the data set, including visualization and report generationAbbreviations and Acronyms. R S. Publication, [email protected] Page 406 American Journal of Sustainable Cities and Society Available online on http://www.rspublication.com/ajscs/ajsas.html Issue 3, Vol. 1 January 2014 ISSN 2319 – 7277 ALGORITHM USED: Fig:- FRECCA Algorithm. R S. Publication, [email protected] Page 407 American Journal of Sustainable Cities and Society Available online on http://www.rspublication.com/ajscs/ajsas.html Issue 3, Vol. 1 January 2014 ISSN 2319 – 7277 This section presents the proposed clustering algorithm. Fuzzy Relational Eigenvector Centrality-based hierarchical Clustering Algorithm (FRECCA). We first describe the use of PageRank as a general graph centrality measure, and review the Gaussian mixture model approach. We then describe how PageRank can be used within an ExpectationMaximization framework to construct a complete relational hierarchical clustering algorithm. The final section discusses issues relating to convergence, duplicate clusters, and various other implementation issues. CONCLUSION Graph-based methods are an exciting area of research within the pattern recognition community. We have already mentioned some of the new work we are conducting in this area; however, what we are most excited about is extending the technique to perform hierarchical clustering. The concepts present in natural language documents usually display some type of hierarchical structure, whereas the algorithm we have presented in this paper identifies only flat clusters. Our main future objective is to extend these ideas to the development of a hierarchical fuzzy relational clustering algorithm. REFERENCES [1] Andrew saba[2013],” clustering sentence-level text using a novel fuzzy relational clustering algorithm”,ieee transactions on knowledge and data engineering,vol 25,no 1,january 2013. [2] Radamihalcea and courtneycorley[2012],” corpus-based and knowledge-based measures of text semantic similarity” ieee transactions on national language engineering,vol 8,no 4 may 2012. [3] Alexander budanitsky, graemehirst [2011],”evaluatingwordnet-based measures of lexical semantic relatedness”ieee transactions on multimedia systems,vol 5 no 7 march 2011. [4] Marina meila, jingoshi[2012],” learning segmentation by random walks” future generation computer systems,vol 9,no 5 Elsevier 2012. [5] Stella x. yu, jianboshi[2011],”multiclass spectral clustering”ieee transactions on computer aided design of integrated circuit and systems,vol 12,no 7,december 2011. [6] Dingdingwang, tao li [2012],”multi-document summarization via sentence-levelsemantic analysis and symmetric matric factorization”ieee transactions on pattern recognition and machine learningvol 6,no.5.august 2012. [7] Jianboshi and jitendramalik[2013],”normalized cuts and image segmentation”ieee transactions on pattern analysis and machine intelligence”,vol 22,no.8,august 2013 [8] Yuhua li, davidmclean, zuhair a. bandar, james d. o’shea, and keeleycrockett’[2012],” sentence similarity based on semantic nets and corpus statistics”,ieee transactions on knowledge and data engineering,vol 18,no 8,august 2012. [9] Andrewrosenberg, juliahirschberg [2011], “v-measure: a conditional entropy-based external cluster evaluation measure”ieee transactions on knowledge and data engineering vol 20,no.3 september 2011. [10] Andrew y. ng, michaeli.jordan, yairweiss[2011], “on spectral clustering :analysis and an algorithm”,ieee transactions on neural information processing system”vol 15 no 12 november 2011. R S. Publication, [email protected] Page 408

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download document 8900388