Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
College of Science & Technology Dep. Of Computer Science & IT BCs of Information Technology Data Mining Chapter 6_3: Clustering Methods 2013 Prepared by: Mahmoud Rafeek Al-Farra www.cst.ps/staff/mfarra Course’s Out Lines 2 Introduction Data Preparation and Preprocessing Data Representation Classification Methods Evaluation Clustering Methods Mid Exam Association Rules Knowledge Representation Special Case study : Document clustering Discussion of Case studies by students Out Lines 3 Cluster validation Similarity Overall measure Similarity Entropy Examples of Document Clustering algorithm Suffix Tree Clustering algorithm DIG for document representation A SOM-Based document clustering using phrases Text Clustering using SOM on Semantic Graphs Graph-based Growing Hierarchal Self-Organizing Map Cluster validation- intro. 4 The results of any clustering algorithm should be evaluated using an informative quality measure that reflects the “goodness” of the resulting clusters. In addition, it gives us the possibility to compare the different clustering algorithms that approaches usually lead to different clusters. different Cluster validation- intro. 5 Generally, there are two main measures for testing quality of clusters, using of them depends on whether we have labeled data or there is no prior knowledge about the classification of data objects. Measure Type Measure Example Internal quality measure Overall similarity External quality measure Entropy and F-measure Similarity measure 6 Similarity in VSM is based on the distance or angle between two vectors. One of the most widely used distance measure is known as the family of Minkowski distances which described as: xy p n p x i 1 i yi Where: X and Y the vectors of two objects i : the feature, n number of features p: assumes values greater than or equal to 1 p Similarity measure 7 A more common similarity measure that is used specifically in document clustering is the cosine correlation measure, it is defined as follows: x y cos(x, y) x y Where: Where (.) indicates the vector dot product ||.|| and indicates the length of the vector. ||x|| = x.x , x.y = x . y x . y ... 1 1 2 2 Overall Similarity 8 The overall similarity is an internal measure which is used for computing the cluster cohesiveness in the absence of any external knowledge. It uses the weighted similarity of the internal cluster similarity, as in: OverallSim ilarity(Cu ) Where Cu: 1 Cu 2 Sim(O ,O 1 O1 ,O2 2 ) Cu is the cluster under consideration, Sim (O1, O2): is the similarity between the two objects O1 and O2 which are belonging to the cluster Cu. |cu| : number of documents Entropy Measure 9 Entropy is one of the external measure, which provides a measure of “goodness” for un-nested clusters or for the clusters at one level of a hierarchical clustering. Using entropy, the best clustering algorithm is obtained when each cluster contains exactly one data point. While the entropy is decreasing, the quality of the clustering algorithm is better that the best quality using entropy is 0. Suffix Tree Clustering algorithm 10 The STC method basically involves on using of a compact tree structure to represent shared phrases between documents. D1: cat ate cheese D2: mouse ate cheese too D3: cat ate mouse too and Then 2017-05-22 10 Document Index Graph for clustering (DIG) 11 It based on constructing an incremental cumulative graph represents the collection of documents such that each node represents a term and it stores all the required information about this term while the edges represent the relation between terms to represent the phrases. Then use the incremental clustering of documents using a histogram-based method to maximize the tightness of clusters by carefully watching the similarity distribution inside each cluster. A SOM-Based document clustering using phrases 12 This algorithm represents the document as a vector of phrases instead of single terms. Extracting phrases here is achieved by a phrase grammar extraction technique which based on mutual information. Then the documents is represented as a phrase vector space, and then input them to SOM. . . . . . Documents . . . SOM Phrase Grammar Extraction Phrase Feature Vectors Text Clustering using SOM on Semantic Graphs 13 The semantic relations are captured by the Universal Networking Language (UNL) which expresses a document in the form of a semantic graph, with nodes as disambiguated words and semantic relations between them as edges. Then convert these graphs into vectors and applied SOM on them as the clustering method. plt plt coo . . . . Semantic Graph . . Documents . . coo SOM Feature Vectors Graph-based Growing Hierarchal Self-Organizing Map well-structured XML documents 14 D C X L B C A Web Documents 1G 0 Document Clusters One Document Cumulative Document Graph Rep. 0G 2G 2G 1 2 0 0G 1G 1 1G s 1 2G 2G 1 2 Graph Rep. 0G SOM 1G 0 A G G Preprocessing Step B O O S X s 1G 1 2G 2G 1 2 Hierarchy Growing SOM Similarity Measure Next: 15 Association Rules Thanks 16