Download V. Clustering

V. Clustering 2007.2.10. 인공지능 연구실 이승희 Text: Text mining Page:82-93 Outline V.1 V.2 V.3 V.4 Clustering tasks in text analysis The general clustering problem Clustering algorithm Clustering of textual data Clustering Clustering • An unsupervised process through which objects are classified into groups called cluster. (cf. categorization is a supervised process.) • Data mining, document retrieval, image segmentation, pattern classification. V.1 Clustering tasks in text analysis(1/2)  Cluster hypothesis “Relevant documents tend to be more similar to each other than to nonrelevant ones.”  If cluster hypothesis holds for a particular document collection, then the clustering of documents may help to improve the search effectiveness. • Improving search recall  When a query matches a document its whole cluster can be return • Improving search precision  By grouping the document into a much smaller number of groups of related documents V.1 Clustering tasks in text analysis(2/2) • Scatter/gather browsing method  Purpose: to enhance the efficiency of human browsing of a document collection when a specific search query cannot be a formulated.  Session1: a document collection is scattered into a set of clusters.  Sesson2: then the selected clusters are gathered into a new subcollection with which the process may be repeated.  참고사이트 – http://www2.parc.com/istl/projects/ia/sg-background. html • Query-Specific clustering are also possible. the hierarchical clustering is appealing V.2 Clustering problem(1/2)  Cluster tasks • • • • • problem representation definition proximity measures actual clustering of objects data abstraction evalutation  Problem representation • • • • • Basically, optimization problem. Goal: select the best among all possible groupings of objects Similarity function: clustering quality function. Feature extraction/ feature selection In a vector space model,  objects: vectors in the high-dimensional feature space.  the similarity function: the distance between the vectors in some metric V.2 Clustering problem(2/2)  Similarity Measures • Euclidian distance D(x i , y j )  2 (x  y )  ik jk k • Cosine similarity measure is the most common Sim( xi, xj )  ( xi  xj )   x ik  x jk k V.3 Clustering algorithm (1/9)  • flat clustering: a single partition of a set of objects into disjoint groups. • hierarchical clustering: a nested series of partition.  • hard clustering: every objects may belongs to exactly one cluster. • soft clustering: objects may belongs to several clusters with a fractional degree of membership in each. V.3 Clustering algorithm (2/9)  • Agglomerative algorithm: begin with each object in a separate cluster and successively merge cluster until a stopping criterion is satisfied. • Divisive algorithm: begin with a single cluster containing all objects and perform splitting until stopping criterion satisfied. • Shuffling algorithm: iteratively redistribute objects in clusters V.3 Clustering algorithm (3/9)  k-means algorithm(1/2) • hard, flat, shuffling algorithm V.3 Clustering algorithm (4/9) •example of K-means algorithm V.3 Clustering algorithm (5/9)  K-means algorithm(2/2) • • • • Simple, efficient Complexity O(kn) bad initial selection of seeds.-> local optimal. k-means suboptimality is also exist.-> Buckshot algorithm. ISO-DATA algorithm • Maximizes the quality function Q: Q(C1 , C 2 ,..., C k )    Sim( x  M Ci xCi i ) V.3 Clustering algorithm (6/9)  EM-based probabilistic clustering algorithm(1/2) • Soft, flat, probabilistic V.3 Clustering algorithm (7/9) V.3 Clustering algorithm (8/9)  Hierarchical agglomerative Clustering • single-link method • Complete-link method • Average-link method V.3 Clustering algorithm (9/9) Other clustering algorithms minimal spanning tree nearest neighbor clustering Buckshot algorithm V.4 clustering of textual data(1/6)  representation of text clustering problem • Objects are very complex and rich internal structure. • Documents must be converted into vectors in the feature space. • Bag-of-words document representation. • Reducing the dimensionality  Local method: delete unimportant components from individual document vectors.  Global method: latent semantic indexing(LSI) V.4 clustering of textual data(2/6) latent semantic indexing • map N-dimensional feature space F onto a lower dimensional subspace V. • LSI is based upon applying the SVD to the term-document matrix. V.4 clustering of textual data(3/6)  Singular value decomposition (SVD) A = UDVT U : column-orthonormal mxr matrix D: diagonal rxr matrix, matrix,digonal elements are the singular values of A V: column-orthonormal nxr UUT = VTV = I  Dimension reduction A   UDV T A  T A  VD T U T UDV T  VD T DV T V.4 clustering of textual data(4/6)  Mediods: actual documents that are most similar to the centroids  Using Naïve Bayes Mixture models with the EM clustering algorithm P(C i | x)  P(C i )  P( f | Ci) /  P(C ) P( f | C) f c f V.4 clustering of textual data(5/6)  Data abstraction in text clustering • generating meaningful and concise description of the cluster. • method of generating the label automatically  a title of the medoid document  several words common to the cluster documents can be shown.  a distinctive noun phrase. V.4 clustering of textual data(6/6)  Evaluation of text clustering - the quality of the result? • purity  assume {L1,L2,...,Ln} are the manually labeled classes of documents, {C1,C2,...,Cm} the clusters returned by the clustering process • entropy, mutual information between classes Purity (Ci )  max j | L j  Ci | / | Ci | and clusters

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download V. Clustering