* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download V. Clustering
Computational electromagnetics wikipedia , lookup
Theoretical computer science wikipedia , lookup
Computational complexity theory wikipedia , lookup
Probabilistic context-free grammar wikipedia , lookup
Travelling salesman problem wikipedia , lookup
Simulated annealing wikipedia , lookup
Computational phylogenetics wikipedia , lookup
Genetic algorithm wikipedia , lookup
Fast Fourier transform wikipedia , lookup
Fisher–Yates shuffle wikipedia , lookup
Multidimensional empirical mode decomposition wikipedia , lookup
Simplex algorithm wikipedia , lookup
Pattern recognition wikipedia , lookup
Smith–Waterman algorithm wikipedia , lookup
Selection algorithm wikipedia , lookup
Algorithm characterizations wikipedia , lookup
Factorization of polynomials over finite fields wikipedia , lookup
V. Clustering 2007.2.10. 인공지능 연구실 이승희 Text: Text mining Page:82-93 Outline V.1 V.2 V.3 V.4 Clustering tasks in text analysis The general clustering problem Clustering algorithm Clustering of textual data Clustering Clustering • An unsupervised process through which objects are classified into groups called cluster. (cf. categorization is a supervised process.) • Data mining, document retrieval, image segmentation, pattern classification. V.1 Clustering tasks in text analysis(1/2) Cluster hypothesis “Relevant documents tend to be more similar to each other than to nonrelevant ones.” If cluster hypothesis holds for a particular document collection, then the clustering of documents may help to improve the search effectiveness. • Improving search recall When a query matches a document its whole cluster can be return • Improving search precision By grouping the document into a much smaller number of groups of related documents V.1 Clustering tasks in text analysis(2/2) • Scatter/gather browsing method Purpose: to enhance the efficiency of human browsing of a document collection when a specific search query cannot be a formulated. Session1: a document collection is scattered into a set of clusters. Sesson2: then the selected clusters are gathered into a new subcollection with which the process may be repeated. 참고사이트 – http://www2.parc.com/istl/projects/ia/sg-background. html • Query-Specific clustering are also possible. the hierarchical clustering is appealing V.2 Clustering problem(1/2) Cluster tasks • • • • • problem representation definition proximity measures actual clustering of objects data abstraction evalutation Problem representation • • • • • Basically, optimization problem. Goal: select the best among all possible groupings of objects Similarity function: clustering quality function. Feature extraction/ feature selection In a vector space model, objects: vectors in the high-dimensional feature space. the similarity function: the distance between the vectors in some metric V.2 Clustering problem(2/2) Similarity Measures • Euclidian distance D(x i , y j ) 2 (x y ) ik jk k • Cosine similarity measure is the most common Sim( xi, xj ) ( xi xj ) x ik x jk k V.3 Clustering algorithm (1/9) • flat clustering: a single partition of a set of objects into disjoint groups. • hierarchical clustering: a nested series of partition. • hard clustering: every objects may belongs to exactly one cluster. • soft clustering: objects may belongs to several clusters with a fractional degree of membership in each. V.3 Clustering algorithm (2/9) • Agglomerative algorithm: begin with each object in a separate cluster and successively merge cluster until a stopping criterion is satisfied. • Divisive algorithm: begin with a single cluster containing all objects and perform splitting until stopping criterion satisfied. • Shuffling algorithm: iteratively redistribute objects in clusters V.3 Clustering algorithm (3/9) k-means algorithm(1/2) • hard, flat, shuffling algorithm V.3 Clustering algorithm (4/9) •example of K-means algorithm V.3 Clustering algorithm (5/9) K-means algorithm(2/2) • • • • Simple, efficient Complexity O(kn) bad initial selection of seeds.-> local optimal. k-means suboptimality is also exist.-> Buckshot algorithm. ISO-DATA algorithm • Maximizes the quality function Q: Q(C1 , C 2 ,..., C k ) Sim( x M Ci xCi i ) V.3 Clustering algorithm (6/9) EM-based probabilistic clustering algorithm(1/2) • Soft, flat, probabilistic V.3 Clustering algorithm (7/9) V.3 Clustering algorithm (8/9) Hierarchical agglomerative Clustering • single-link method • Complete-link method • Average-link method V.3 Clustering algorithm (9/9) Other clustering algorithms minimal spanning tree nearest neighbor clustering Buckshot algorithm V.4 clustering of textual data(1/6) representation of text clustering problem • Objects are very complex and rich internal structure. • Documents must be converted into vectors in the feature space. • Bag-of-words document representation. • Reducing the dimensionality Local method: delete unimportant components from individual document vectors. Global method: latent semantic indexing(LSI) V.4 clustering of textual data(2/6) latent semantic indexing • map N-dimensional feature space F onto a lower dimensional subspace V. • LSI is based upon applying the SVD to the term-document matrix. V.4 clustering of textual data(3/6) Singular value decomposition (SVD) A = UDVT U : column-orthonormal mxr matrix D: diagonal rxr matrix, matrix,digonal elements are the singular values of A V: column-orthonormal nxr UUT = VTV = I Dimension reduction A UDV T A T A VD T U T UDV T VD T DV T V.4 clustering of textual data(4/6) Mediods: actual documents that are most similar to the centroids Using Naïve Bayes Mixture models with the EM clustering algorithm P(C i | x) P(C i ) P( f | Ci) / P(C ) P( f | C) f c f V.4 clustering of textual data(5/6) Data abstraction in text clustering • generating meaningful and concise description of the cluster. • method of generating the label automatically a title of the medoid document several words common to the cluster documents can be shown. a distinctive noun phrase. V.4 clustering of textual data(6/6) Evaluation of text clustering - the quality of the result? • purity assume {L1,L2,...,Ln} are the manually labeled classes of documents, {C1,C2,...,Cm} the clusters returned by the clustering process • entropy, mutual information between classes Purity (Ci ) max j | L j Ci | / | Ci | and clusters