Download WWW-newsgroup-document Clustering by Means of

WWW-newsgroup-document Clustering by Means of Dynamic Self-organizing Neural Networks Marian B. GorzaÃlczany and Filip Rudziński Department of Electrical and Computer Engineering Kielce University of Technology Al. 1000-lecia P.P. 7, 25-314 Kielce, Poland {m.b.gorzalczany, f.rudzinski}@tu.kielce.pl Abstract. The paper presents a clustering technique based on dynamic self-organizing neural networks and its application to a large-scale and highly multidimensional WWW-newsgroup-document clustering problem. The collection of 19 997 documents (e-mail messages of different Usenet-News newsgroups) available at WWW server of the School of Computer Science, Carnegie Mellon University (www.cs.cmu.edu/ TextLearning/datasets.html) has been the subject of clustering. A broad comparative analysis with nine alternative clustering techniques has also been carried out demonstrating the superiority of the proposed approach in the considered problem. 1 Introduction The rapidly increasing volume of electronically available World-Wide-Web resources makes more and more important the issue of helping users to locate and access relevant information as well as to organize it in an intelligible way. Since text- and hypertext documents belong to the most important and available online WWW resources, text processing techniques play the central role in this field. In turn, among them, thematic WWW-document clustering techniques (thematic text clustering techniques) are of special interest. In general, given a collection of WWW documents, the task of document clustering is to group documents together in such a way that the documents within each cluster are as ”similar” as possible to each other and as ”dissimilar” as possible from those of the other clusters. This paper presents a clustering technique based on dynamic self-organizing neural networks (introduced by the same authors in [7], [8] and some earlier papers such as [5], [6]) and its application to a large-scale and highly multidimensional WWW-newsgroup-document clustering problem. First, the paper presents the concept of a dynamic self-organizing neural network for WWW-document clustering. Then, a Vector-Space-Model representation of WWW documents is outlined as well as some approaches to its dimensionality reduction are briefly presented. In turn, the application of the proposed technique to clustering of the collection of 19 997 documents (e-mail messages of different Usenet-News newsgroups) available at WWW server of the School of Computer Science, Carnegie Mellon University (www.cs.cmu.edu/ TextLearning/datasets.html) is presented. Finally, a broad comparative analysis with several alternative document clustering techniques is carried out. 2 The concept of a Dynamic Self-Organizing Neural Network for WWW-Document Clustering A dynamic self-organizing neural network is a generalization of the conventional self-organizing neural network with one-dimensional neighbourhood. Consider the latter case of the network that has n inputs x1 , x2 , . . . , xn and consists ofPm neurons arranged in a chain; their outputs are y1 , y2 , . . . , ym , where n yj = i=1 wji xi , j = 1, 2, . . . , m and wji are weights connecting the output of j-th neuron with i-th input of the network. Using vector notation (x = (x1 , x2 , . . . , xn )T , w j = (wj1 , wj2 , . . . , wjn )T ), yj = w Tj x . The learning data consists of L input vectors x l (l = 1, 2, . . . , L). The first stage of any WinnerTakes-Most (WTM) learning algorithm that can be applied to the considered network, consists in determining the neuron jx winning in the competition of neurons when learning vector x l is presented to the network. Assuming the normalization of learning vectors, the winning neuron jx is selected such that d(x l , w jx ) = min j=1,2,...,m d(x l , w j ), (1) where d(x l , w j ) is a distance measure between x l and w j ; throughout this paper, a distance measure d based on the cosine similarity function S (most often used for determining similarity of text documents [1]) will be applied: Pn (xli wji ) x Tl w j , (2) d(x l , w j ) = 1 − S(x l , w j ) = 1 − = 1 − qP i=1 P n n kx l kkw j k 2 2 i=1 wji i=1 xli (k.k are Euclidean norms). The WTM learning rule can be formulated as follows: w j (k + 1) = w j (k) + ηj (k)N (j, jx , k)[x (k) − w j (k)], (3) where k is the iteration number, ηj (k) is the learning coefficient, and N (j, jx , k) is the neighbourhood function. In this paper, the Gaussian-type neighbourhood function will be used: (j−j )2 x − N (j, jx , k) = e 2λ2 (k) , (4) where λ(k) is the ”radius” of neighbourhood (the width of the Gaussian ”bell”). After each learning epoch, five successive operations are activated (under some conditions) [7]: 1) the removal of single, low-active neurons, 2) the disconnection of a neuron chain, 3) the removal of short neuron sub-chains, 4) the insertion of additional neurons into the neighbourhood of high-active neurons, and 5) the reconnection of two selected sub-chains of neurons. The operations nos. 1, 3, and 4 are the components of the mechanism for automatic adjustment of the number of neurons in the chain, whereas the operations nos. 2 and 5 govern the disconnection and reconnection mechanisms, respectively. Based on experimental investigations, the following conditions for particular operations have been formulated (numberings of conditions and operations are the same). Possible operation takes place between neuron no. i and neuron no. i + 1; i ∈ {1, 2, . . . , r − 1}, where r is the number of neurons in the original neuron chain or a given sub-chain. Condition 1: wini < β1 , where wini is the number of wins of i-th neuron and β1 is experimentally selected parameter (usually, for complex multidimensional WWW-document clustering, β1 assumes the value around 50). This condition allows to remove single neuron whose activity (measured by the number of its wins) is below an assumed level represented by parameter β1 . Pr−1 dj,j+1 Condition 2: di,i+1 > α1 j=1r , where di,i+1 is the distance between the neurons no. i and no. i + 1 (see [1] for details) and α1 is experimentally selected parameter (usually, for complex problems α1 ∈ [1, 10]). This condition prevents the excessive disconnection of the neuron chain or sub-chain by allowing to disconnect only relatively remote neurons. Condition 3: rS < β2 , where rS is the number of neurons in sub-chain S and β2 is experimentally selected parameter (usually β2 ∈ {3, 4}). This condition allows to remove rS -element neuron sub-chain S that length is shorter than assumed acceptable value β2 . The operation of the insertion of additional neurons into the neighbourhood of high-active neurons in order to take over some of their activities covers 3 cases denoted by 4a, 4b, and 4c, respectively. Condition 4a (the insertion of new neuron (n) between two neighbouring high-active neurons no. i and no. i + 1): IF wini > β3 AND wini+1 > β3 THEN weight vector w (n) of new neuron (n) is calculated as follows: w (n) = w i +2w i+1 , where wini , wini+1 are as in Condition 1 and β3 is experimentally selected parameter (usually β3 is comparable to β1 that governs Condition 1). Conditions 4b (the replacement of high-active neuron no. i - accompanied by low-active neurons no. i−1 and no. i+1 - by two new neurons: (n) and (n+1)): IF wini > β3 AND wini−1 < β3 AND wini+1 < β3 THEN weight vectors w (n) and w (n+1) of new neurons (n) and (n+1) are calculated as follows: w (n) = w i−12+w i and w (n+1) = w i +2w i+1 (β3 - as in Condition 4a). Condition 4c (the insertion of new neuron in the neighbourhood of an endchain high-active neuron accompanied by low-active neighbour; r-th neuron case will be considered; 1st neuron case is analogous): IF winr > β3 AND winr−1 < β3 THEN weight vector w r+1 of new neuron (r + 1) is calculated as follows: Pr−1 w r−1 d , where d 1 w r+1 = w r + w rd− avr avr = r−1 j=1 dj,j+1 (β3 - as in Condition r,r−1 4a and dj,j+1 - as in Condition 2). PrS1 −1 PrS2 −1 dj,j+1 dj,j+1 Condition 5: deS1 ,eS2 < α2 [ 21 ( j=1rS1 + j=1rS2 )] , where S1 and S2 are two sub-chains (containing rS1 and rS2 neurons, respectively) whose appropriate ends eS1 ∈ {1, rS1 } and eS2 ∈ {1, rS2 } are closest to each other; subchains S1 and S2 are the candidates for the connection by combining their ends eS1 and eS2 (dj,j+1 - as in Condition 2, α2 - experimentally selected parameter (usually α2 is comparable to α1 that governs Condition 2). This condition allows to connect two sub-chains not only with closest ends but also with relatively close to each other neighbouring neurons that correspond to compact pieces of the same cluster of data. 3 Vector Space Model of WWW Documents - an Outline [8] Consider a collection of L WWW documents. In the Vector Space Model (V SM ) [3, 4, 9, 12, 13], every document in the collection is represented by vector x l = (xl1 , xl2 , . . . , xln )T (l = 1, 2, . . . , L). Component xli (i = 1, 2, . . . , n) of such a vector represents i-th key word or term that occurs in l-th document. The value of xli depends on the degree of relationship between i-th term and l-th document. Among various schemes for measuring this relationship (very often referred to as term weighting), three are the most popular: a) binary term-weighting: xli = 1 when i-th term occurs in l-th document and xli = 0 otherwise, b) tf -weighting (tf stands for term f requency): xli = tfli where tfli denotes how many times i-th term occurs in l-th document, and c) tf-idf -weighting (tf-idf stands for term f requency - i nverse d ocument f requency): xli = tfli log (L/dfi ) where tfli is the term frequency as in tf -weighting, dfi denotes the number of documents in which i-th term appears, and L is the total number of documents in the collection. In this paper tf -weighting will be applied. Once the way of determining xli is selected, the Vector Space Model can be formulated in a matrix form: V SM(n×L) = X (n×L) = [x l ]l=1,2,...,L = [xli ]Tl=1,2,...,L; i=1,2,...,n (5) where index (n×L) represents its dimensionality. The V SM -dimensionality-reduction issues are of essential significance as far as practical usage of V SM s is concerned. There are two main classes of techniques for V SM -dimensionality reduction [11]: a) feature selection methods, and b) feature transformation methods. Among the techniques that can be included into the afore-mentioned class a) are: filtering, stemming, stop-word removal, and the proper feature-selection methods sorting terms and then eliminating some of them on the basis of some numerical measures computed from the considered collection of documents. The first three techniques sometimes are classified as text-preprocessing methods, however - since they significantly contribute to V SM -dimensionality reduction - here, for simplicity, they have been included into afore-mentioned class a). During filtering (and tokenization) special characters, such as %, #, $, etc., are removed from the original text as well as word- and sentence boundaries are identified in it. As a result of that, initial V SM(nini ×L) is obtained where nini is the number of different words isolated from all documents. During stemming all words in initial model are replaced by their respective stems (a stem is a portion of a word left after removing its suffixes and prefixes). As a result of that, V SM(nstem ×L) is obtained where nstem < nini . During stop-word removal (removing words from a so-called stop list), words that on their own do not have identifiable meanings and therefore are of little use in various text processing tasks are eliminated from the model. As a result of that, V SM(nstpl ×L) is obtained where nstpl < nstem . Feature selection methods usually operated on term quality qi , i = 1, 2, . . . , nstpl defined for each term occurring in the latest V SM . Terms characterized by qi < qtres where qtres is a pre-defined threshold value are removed from the model. In this paper, the document-frequency-based method will be used to determine qi , that is qi = dfi where dfi is the number of documents in which i-th term occurs. As a result of that, final V SM(nf in ×L) is obtained where nf in < nstpl . 4 Application to Complex, Multidimensional WWW-newsgroup-document Clustering Problem The proposed clustering technique based on the dynamic self-organizing neural networks will now be applied to real-life WWW-newsgroup-document clustering problem, that is, to clustering of the multidimensional, large-scale collection of 19 997 documents (e-mail messages of different Usenet-News newsgroups) available at WWW server of the School of Computer Science, Carnegie Mellon University (www.cs.cmu.edu/ TextLearning/datasets.html). Henceforward, the collection will be called 20 newsgroups. The considered collection is partitioned (nearly) evenly across 20 different newsgroups, each corresponding to a different topic. It is worth emphasizing that some of the newsgroups are very closely related to each other, while others are highly unrelated (see the subsequent part of the paper). Since the assignments of documents to newsgroups are known here, it allows us for direct verification of the results obtained. Obviously, the knowledge about the newsgroup assignments by no means will be used by the clustering system (it works in a fully unsupervised way). The process of dimensionality reduction of the initial V SM for 20 newsgroups document collection has been presented in Table 1 with the use of notations introduced in Section 3 (additionally, in square brackets, the overall numbers of occurrences of all terms in all documents of the collection are presented). For the considered document collection, two final numerical models (identified in Table 1 as ”Small” and ”Large” data sets) have been obtained. For this purpose two values of threshold parameter qtres have been considered: qtres = 1 000 - to get the model of reduced dimensionality (”Small”-type data sets) and qtres = 400 to get the model of higher dimensionality but also of higher accuracy (”Large”type data sets). Table 1. The dimensionality reduction of the initial V SM for 20 newsgroups document collection V SM VSM(nini ×L) VSM(nstem ×L) VSM(nstpl ×L) VSM(nf in ×L) Dimensionality of V SM for 20 newsgroups document collection: (nini ×L) = (122 005×19 997) [2 644 002] (nstem ×L) = (99 072×19 997) [2 526 731] (nstpl ×L) = (98 599×19 997) [1 677 316] 20 newsgroups”Small” 20 newsgroups”Large” (qtres = 1 000) (qtres = 400) (nf in ×L) = (232×19 997) (nf in ×L) = (725×19 997) [461 512] [524 321] Figs. 1 and 2 present the performance of the proposed clustering technique for 20 newsgroups”Small” and ”Large” numerical models of 20 newsgroups collection of documents. As the learning progresses, both systems adjust the overall numbers of neurons in their networks (Figs. 1a and 2a) that finally are equal to 273 and 251, respectively, and the numbers of sub-chains (Figs. 1b and 2b) finally achieving the values equal to 11 in both cases; the number of sub-chains is equal to the number of clusters detected in a given numerical model of the document collection. The envelopes of the nearness histograms for the routes in the attribute spaces of 20 newsgroups”Small” and ”Large” data sets (Figs. 1c and 2c) reveal perfectly clear images of the cluster distributions in them, including the numbers of clusters and the cluster boundaries (indicated by 10 local minima on the plots of Figs. 1c and 2c). After performing the so-called calibration of both neural networks, class labels (represented by letters ’A’ through ’K’) can be assigned to particular sub-chains of the networks as shown in Figs. 1c and 2c. The difference between the number of detected clusters and the number of newsgroups results from the above-mentioned fact that some of the newsgroups are very closely related to each other, and, thus, they are perceived by the clustering system as one cluster (details on the calibration of both neural networks are presented below). It is worth mentioning that both systems detect the same number of clusters, which confirms the internal consistency of the proposed approach. Since the newsgroup assignments are known in the original collection of documents, a direct verification of the obtained results is also possible. Detailed numerical results of the clustering and network calibration have been collected in Tables 2 and 3 for 20 newsgroups”Small” and ”Large” data sets, respectively. These tables show, for every newsgroup, numbers of its documents that have been assigned by the clustering system to particular neuron sub-chains (classes) labelled by letters ’A’ through ’K’. The biggest number of documents (denoted in Tables 2 and 3 by boldface) of a given newsgroup assigns that newsgroup to appropriate sub-chain (class). b) 350 60 300 50 Number of sub-chains Number of neurons a) 250 200 150 100 40 30 20 10 50 0 0 0 20 40 60 Epoch number 80 100 0 20 G H 40 60 Epoch number 80 100 Envelope of nearness histogram c) 20 15 A B D C E F I J K 10 5 0 0 34 68 102 136 170 204 Numbers of neurons along the route 238 272 Figure 1. The plots of number of neurons (a) and number of sub-chains (b) vs. epoch number, and c) the envelope of nearness histogram for the route in the attribute space of 20 newsgroups”Small” data set a) b) 56 400 48 Number of sub-chains Number of neurons 320 240 160 40 32 24 16 80 8 0 0 0 20 40 60 Epoch number 80 100 0 20 40 60 Epoch number 80 100 Envelope of nearness histogram c) 2.0 1.5 K B J A I F G C E D H 1.0 0.5 0.0 0 21 42 63 84 105 126 147 168 189 Numbers of neurons along the route 210 231 252 Figure 2. The plots of number of neurons (a) and number of sub-chains (b) vs. epoch number, and c) the envelope of nearness histogram for the route in the attribute space of 20 newsgroups”Large” data set Table 2. Clustering results for 20 newsgroups”Small” data set Newsgroup name Number of decisions for sub-chain (class) labelled: A B C D E F G H misc.forsale 5 15 12 23 105 4 39 2 comp.graphics 65 11 123 0 141 14 0 612 comp.windows.x 9 36 101 12 47 31 comp.os.mswindows.misc 9 0 122 7 56 21 comp.sys.mac. hardware 60 12 632 4 15 comp.sys.ibm.pc. hardware 11 34 599 6 67 sci.electronics 39 rec.motocycles 8 rec.autos J K 729 12 54 729 721 72.90 6 11 612 388 61.20 45 633 22 35 29 633 367 63.30 23 586 15 158 3 586 414 58.60 25 28 22 78 87 37 632 368 63.20 25 6 25 90 131 6 599 401 59.90 8 [%] 258 2 391 44 58 18 38 116 0 391 609 39.10 602 36 41 146 0 0 30 67 29 41 602 398 60.20 15 411 198 21 84 11 47 61 84 43 25 411 589 41.10 talk.politics.mideast 43 36 I NCD1 NWD2 PCD3 0 105 20 53 130 487 72 21 67 2 487 513 48.70 talk.politics.guns 21 13 17 21 76 225 583 0 19 9 16 583 417 58.30 talk.politics.misc 0 0 99 13 9 340 394 1 60 81 3 394 606 39.40 rec.sport.hockey 11 3 181 17 15 19 59 26 37 54 578 578 422 57.80 rec.sport.baseball 0 11 137 23 27 48 35 0 21 17 681 681 319 68.10 sci.crypt 0 9 48 512 92 17 110 11 60 106 35 512 488 51.20 sci.space 538 5 33 89 61 25 61 29 34 35 90 538 462 53.80 sci.med 42 15 36 0 8 23 50 17 304 429 76 429 571 42.90 alt.atheism 0 5 54 47 5 533 111 32 70 120 23 533 467 53.30 talk.religion.misc 5 21 27 79 41 562 93 11 56 20 85 562 438 56.20 soc.religion.christian 6 2 34 15 30 735 39 41 25 47 23 735 262 73.72 8770 56.14 ALL: 887 1241 2852 952 1469 2832 2268 2229 1838 1602 1827 11227 1 NCD = Number of correct decisions, 2 NWD = Number of wrong decisions, 3 PCD = Percentage of correct decisions. The detailed results of Tables 2 and 3 have been summarized in Tables 4 and 5. Table 4 presents the overall numbers of documents that have been assigned to particular sub-chains for both ”Small” and ”Large” data sets. Additionally, numbers of neurons belonging to particular sub-chains of the overall chains (see also Figs. 1c and 2c for both data sets have been included. In turn, Table 5 presents the assignments of particular newsgroups to successive sub-chains (classes). It is worth emphasizing that the proposed clustering technique not only detects the same number of clusters in both ”Small” and ”Large” data sets (it was already mentioned earlier in this section) but also assigns the same newsgroups to appropriate clusters in both data sets. It is another confirmation of the internal consistency of the proposed clustering technique. An important issue of the accuracy of the proposed technique will be considered in the framework of a broad comparative analysis with several alternative clustering methods applied to 20 newsgroups”Small” and ”Large” data sets as well as to some modifications of the original 20 newsgroups document collection. In Part I of Table 6 the results of comparative analysis with three alternative approaches (they are listed under Part I of Table 6) applied to ”Small” and ”Large” data sets are presented. In order to carry out the clustering of the 20 newsgroups”Small” and ”Large” data sets with the use of the afore-mentioned techniques, WEKA (Waikato Environment for Knowledge Analysis) application that implements them has been used. The WEKA application as well as details on the clustering techniques can be found on WWW site of the University of Waikato, New Zealand (www.cs.waikato.ac.nz/ml/weka). Table 3. Clustering results for 20 newsgroups”Large” data set Newsgroup name Number of decisions for sub-chain (class) labelled: K B J I A F G 794 12 C E D NCD1 NWD2 PCD3 H [%] misc.forsale 5 71 0 2 33 59 4 16 4 794 206 79.40 comp.graphics 11 43 30 19 22 4 5 152 5 7 702 702 298 70.20 comp.windows.x 1 51 45 20 6 50 23 120 4 30 650 650 350 65.00 comp.os.mswindows.misc 6 10 21 80 16 19 21 211 14 11 591 591 409 59.10 comp.sys.mac. hardware 8 39 11 42 4 42 11 644 10 0 189 644 356 64.40 comp.sys.ibm.pc. hardware 4 54 15 86 29 15 0 613 18 6 160 613 387 61.30 sci.electronics 4 40 rec.motocycles 2 24 34 14 4 25 132 405 32 286 405 595 40.50 623 55 114 33 5 2 13 47 71 35 623 377 62.30 15 432 14 168 1 50 16 138 12 4 150 432 568 43.20 talk.politics.mideast 8 21 18 45 0 198 503 24 3 7 173 503 497 50.30 talk.politics.guns 18 1 11 23 5 261 609 21 27 19 5 609 391 60.90 talk.politics.misc 3 12 5 56 17 358 412 51 10 6 70 412 588 41.20 rec.sport.hockey 601 4 9 50 14 70 38 170 13 11 20 601 399 60.10 690 29 0 25 23 35 22 141 18 14 3 690 310 69.00 10 523 92 523 477 52.30 rec.autos rec.sport.baseball sci.crypt 8 21 3 112 85 11 134 1 sci.space 17 14 4 41 543 70 28 23 57 1 202 543 457 54.30 sci.med 6 16 437 174 50 34 11 5 257 437 563 43.70 4 6 alt.atheism 13 25 3 53 7 547 54 69 15 73 141 547 453 54.70 talk.religion.misc 13 22 1 125 23 579 98 23 34 77 5 579 421 57.90 soc.religion.christian 1 16 7 32 15 798 28 19 23 26 32 798 199 80.04 8317 58.41 ALL: 1434 1544 713 2093 798 3198 1989 2791 731 939 3767 11680 1 NCD = Number of correct decisions, 2 NWD = Number of wrong decisions, 3 PCD = Percentage of correct decisions. In order to extend the comparative-analysis aspects of this paper, in Part II of Table 6 the results of the clustering of some modifications of the original 20 newsgroups document collection that are reported in the literature have been included. This time the operation of the dynamic self-organizing neural network clustering technique has been compared with seven alternative approaches that are listed under Part II of Table 6. Table 4. Sub-chain (class) labels, numbers of documents assigned to particular sub-chains and numbers of neurons in the chains for 20 newsgroups”Small” (a) and 20 newsgroups”Large” (b) data sets a) b) Sub-chain Number of Number of Sub-chain Number of Number of (class) label documents neuron in (class) label documents neuron in assigned to the chain assigned to the chain sub-chain sub-chain A 887 1 − 12 K 1 434 1 − 16 B 1 241 13 − 29 B 1 544 17 − 35 C 2 852 30 − 68 J 713 36 − 44 D 952 69 − 81 I 2 093 45 − 72 E 1 469 82 − 101 A 798 73 − 82 F 2 832 102 − 140 F 3 198 83 − 123 G 2 268 141 − 171 G 1 989 124 − 148 H 2 229 172 − 201 C 2 791 149 − 183 I 1 838 202 − 226 E 731 184 − 192 J 1 602 227 − 248 D 939 193 − 204 K 1 827 249 − 273 H 3 767 205 − 251 Table 5. Assignments of particular newsgroups to sub-chains (classes) for 20 newsgroups”Small” (a) and 20 newsgroups”Large” (b) data sets a) b) Sub-chain Name(s) of newsgroup(s) Sub-chain label assigned to sub-chain label A sci.space K B rec.motorcycles rec.autos B C comp.sys.mac.hardware J comp.sys.ibm.pc.hardware D sci.crypt I E sci.electronics A F alt.atheism F talk.religion.misc soc.religion.chrystian G talk.politics.mideast G talk.politics.guns talk.politics.misc H comp.graphics C comp.windows.x comp.os.ms-windows.misc E I misc.forsale D J sci.med H K rec.sport.hockey rec.sport.baseball Name(s) of newsgroup(s) assigned to sub-chain rec.sport.hockey rec.sport.baseball rec.motorcycles rec.autos sci.med misc.forsale sci.space alt.atheism talk.religion.misc soc.religion.chrystian talk.politics.mideast talk.politics.guns talk.politics.misc comp.sys.mac.hardware comp.sys.ibm.pc.hardware sci.electronics sci.crypt comp.graphics comp.windows.x comp.os.ms-windows.misc Table 6. Results of comparative analysis for 20 newsgroups numerical models Part I Clustering method DSONN EM FFTA k -means Percentage of correct decisions 20 newsgroups”Small” 20 newsgroups”Large” (dimensionality of V SM : (dimensionality of V SM : (n×L) = (232×19 997)) (n×L) = (725×19 997))) 56.14% 58.41% 47.52% 49.12% 27.60% 33.98% 42.59% 48.12% DSONN = Dynamic Self-Organizing Neural Network, EM = Expectation Maximization method, FFTA = Farthest First Traversal Algorithm Part II Clustering method COS BOW sIB sL1 sKL k -means sk -means Modifications of 20 newsgroups document collection Dimensionality (n×L) of Percentage of correct V SM decisions 28 101×∼19 216 38.28% 28 101×∼19 216 33.97% 2 000×17 446 57.50% 2 000×17 446 15.30% 2 000×17 446 28.80% 2 000×17 446 53.40% 2 000×17 446 54.10% COS = COncept Space representation for paragraphs method [2], BOW = simple Bag-Of-Words characterization paragraphs method [2], sIB = sequential Information Bottleneck approach [10], sL1 and sKL = variations of sIB approach [10], sk-means = sequential k-means (presented with k-means algorithm in [10]). Taking into account the results that have been reported in this paper, it is clear that the clustering technique based on the dynamic self-organizing neural networks is a powerful tool for large-scale and highly multidimensional clusteranalysis problems such as WWW-newsgroup-document clustering. It provides better or much better accuracy of clustering than other alternative techniques applied in this field. Moreover, it is extremely important that the proposed technique automatically determines (adjusts in the course of learning) the number of clusters in a given data set. All the alternative approaches can operate under the condition that the number of clusters is set in advance. 5 Conclusions The application of the clustering technique based on the dynamic self-organizing neural networks (introduced by the same authors in [7], [8] and some earlier papers) to the large-scale and highly multidimensional WWW-newsgroupdocument clustering task has been reported in this paper. The collection of 19 997 documents (e-mail messages of different Usenet-News newsgroups) available at WWW server of the School of Computer Science, Carnegie Mellon University (www.cs.cmu.edu/ TextLearning/datasets.html) has been the subject of clustering. A broad comparative analysis with nine alternative clustering techniques has also been carried out demonstrating the superiority of the proposed approach in the considered task. Especially, it is worth emphasizing the ability of the proposed technique to automatically determine the number of clusters in the considered data set and high accuracy of clustering. The proposed technique has already been successfully applied to the clustering of other WWW-document collection [8] as well as several multidimensional data sets [7]. References 1. Berry M.W.: Survey of Text Mining, Springer Verlag, New York, 2004. 2. Caillet M., Pessiot J., Amini M., Gallinari P.: Unsupervised Learning with Term Clustering For Thematic Segmentation of Texts, Proc. of RIAO 2004 (Recherche d’Information Assiste par Ordinateur), Toulouse, France, 2004. 3. Chakrabarti S.: Mining the Web: Analysis of Hypertext and Semi Structured Data, Morgan Kaufmann Publishers, San Francisco, 2002. 4. Franke J., Nakhaeizadeh G., Renz I. (Eds.): Text Mining: Theoretical Aspects and Applications, Physica Verlag/Springer Verlag, Heidelberg, 2003. 5. GorzaÃlczany M.B., Rudziński F.: Application of Genetic Algorithms and Kohonen Networks to Cluster Analysis, in L. Rutkowski, J. Siekmann, R. Tadeusiewicz, L.A. Zadeh (Eds.), Artificial Intelligence and Soft Computing - ICAISC 2004, Proc. of 7th Int. Conference, Lecture Notes in Artificial Intelligence 3070, Springer-Verlag, Berlin, Heidelberg, New York, 2004, pp. 556-561. 6. GorzaÃlczany M.B., Rudziński F.: Modified Kohonen Networks for Complex Clusteranalysis Problems, in L. Rutkowski, J. Siekmann, R. Tadeusiewicz, L.A. Zadeh (Eds.), Artificial Intelligence and Soft Computing - ICAISC 2004, Proc. of 7th Int. Conference, Lecture Notes in Artificial Intelligence 3070, Springer-Verlag, Berlin, Heidelberg, New York, 2004, pp. 562-567. 7. GorzaÃlczany M.B., Rudziński F.: Cluster Analysis via Dynamic Self-organizing Neural Networks, in L. Rutkowski, R. Tadeusiewicz, L.A. Zadeh, J. Zurada (Eds.), Artificial Intelligence and Soft Computing - ICAISC 2006, Proc. of 8th Int. Conference, Lecture Notes in Artificial Intelligence 4029, Springer-Verlag, Berlin, Heidelberg, New York, 2006, pp. 593-602. 8. GorzaÃlczany M.B., Rudziński F.: Application of dynamic self-organizing neural networks to WWW-document clustering, International Journal of Information Technology and Intelligent Computing, Vol. 1, No. 1, 2006, pp. 89-101 (also presented at 8th Int. Conference on Artificial Intelligence and Soft Computing ICAISC 2006, Zakopane). 9. Salton G., McGill M.J.: Introduction to Modern Information Retrieval, McGrawHill Book Co., New York, 1983. 10. Slonim N., Friedman N., Tishby N.: Unsupervised Document Classification using Sequential Informaiton Maximization, Proc. of the Twenty-Fifth Annual International ACM SIGIR Conference, Tampere, Finland, 2002, pp.129-136. 11. Tang B., Shepherd M., Milios E., Heywood M.I.: Comparing and combining dimension reduction techniques for efficient text clustering, in Proc. of Int. Workshop on Feature Selection and Data Mining, Newport Beach, 2005. 12. Weiss S., Indurkhya N., Zhang T., Damerau F.: Text Mining: Predictive Methods for Analyzing Unstructured Information, Springer, New York, 2004. 13. Zanasi A. (Ed.): Text Mining and its Applications to Intelligence, CRM and Knowledge Management, WIT Press, Southampton, 2005.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download WWW-newsgroup-document Clustering by Means of