Download WWW-newsgroup-document Clustering by Means of

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Data (Star Trek) wikipedia , lookup

Machine learning wikipedia , lookup

Neural modeling fields wikipedia , lookup

Pattern recognition wikipedia , lookup

Latent semantic analysis wikipedia , lookup

Time series wikipedia , lookup

Convolutional neural network wikipedia , lookup

Transcript
WWW-newsgroup-document Clustering by
Means of Dynamic Self-organizing Neural
Networks
Marian B. GorzaÃlczany and Filip Rudziński
Department of Electrical and Computer Engineering
Kielce University of Technology
Al. 1000-lecia P.P. 7, 25-314 Kielce, Poland
{m.b.gorzalczany, f.rudzinski}@tu.kielce.pl
Abstract. The paper presents a clustering technique based on dynamic
self-organizing neural networks and its application to a large-scale and
highly multidimensional WWW-newsgroup-document clustering problem. The collection of 19 997 documents (e-mail messages of different
Usenet-News newsgroups) available at WWW server of the School of
Computer Science, Carnegie Mellon University (www.cs.cmu.edu/ TextLearning/datasets.html) has been the subject of clustering. A broad comparative analysis with nine alternative clustering techniques has also been
carried out demonstrating the superiority of the proposed approach in
the considered problem.
1
Introduction
The rapidly increasing volume of electronically available World-Wide-Web resources makes more and more important the issue of helping users to locate
and access relevant information as well as to organize it in an intelligible way.
Since text- and hypertext documents belong to the most important and available online WWW resources, text processing techniques play the central role
in this field. In turn, among them, thematic WWW-document clustering techniques (thematic text clustering techniques) are of special interest. In general,
given a collection of WWW documents, the task of document clustering is to
group documents together in such a way that the documents within each cluster
are as ”similar” as possible to each other and as ”dissimilar” as possible from
those of the other clusters.
This paper presents a clustering technique based on dynamic self-organizing
neural networks (introduced by the same authors in [7], [8] and some earlier papers such as [5], [6]) and its application to a large-scale and highly multidimensional WWW-newsgroup-document clustering problem. First, the paper presents
the concept of a dynamic self-organizing neural network for WWW-document
clustering. Then, a Vector-Space-Model representation of WWW documents is
outlined as well as some approaches to its dimensionality reduction are briefly
presented. In turn, the application of the proposed technique to clustering of the
collection of 19 997 documents (e-mail messages of different Usenet-News newsgroups) available at WWW server of the School of Computer Science, Carnegie
Mellon University (www.cs.cmu.edu/ TextLearning/datasets.html) is presented.
Finally, a broad comparative analysis with several alternative document clustering techniques is carried out.
2
The concept of a Dynamic Self-Organizing Neural
Network for WWW-Document Clustering
A dynamic self-organizing neural network is a generalization of the conventional self-organizing neural network with one-dimensional neighbourhood. Consider the latter case of the network that has n inputs x1 , x2 , . . . , xn and consists ofPm neurons arranged in a chain; their outputs are y1 , y2 , . . . , ym , where
n
yj =
i=1 wji xi , j = 1, 2, . . . , m and wji are weights connecting the output
of j-th neuron with i-th input of the network. Using vector notation (x =
(x1 , x2 , . . . , xn )T , w j = (wj1 , wj2 , . . . , wjn )T ), yj = w Tj x . The learning data
consists of L input vectors x l (l = 1, 2, . . . , L). The first stage of any WinnerTakes-Most (WTM) learning algorithm that can be applied to the considered
network, consists in determining the neuron jx winning in the competition of
neurons when learning vector x l is presented to the network. Assuming the normalization of learning vectors, the winning neuron jx is selected such that
d(x l , w jx ) =
min
j=1,2,...,m
d(x l , w j ),
(1)
where d(x l , w j ) is a distance measure between x l and w j ; throughout this paper,
a distance measure d based on the cosine similarity function S (most often used
for determining similarity of text documents [1]) will be applied:
Pn
(xli wji )
x Tl w j
, (2)
d(x l , w j ) = 1 − S(x l , w j ) = 1 −
= 1 − qP i=1 P
n
n
kx l kkw j k
2
2
i=1 wji
i=1 xli
(k.k are Euclidean norms).
The WTM learning rule can be formulated as follows:
w j (k + 1) = w j (k) + ηj (k)N (j, jx , k)[x (k) − w j (k)],
(3)
where k is the iteration number, ηj (k) is the learning coefficient, and N (j, jx , k)
is the neighbourhood function. In this paper, the Gaussian-type neighbourhood
function will be used:
(j−j )2
x
−
N (j, jx , k) = e 2λ2 (k) ,
(4)
where λ(k) is the ”radius” of neighbourhood (the width of the Gaussian ”bell”).
After each learning epoch, five successive operations are activated (under
some conditions) [7]: 1) the removal of single, low-active neurons, 2) the disconnection of a neuron chain, 3) the removal of short neuron sub-chains, 4) the
insertion of additional neurons into the neighbourhood of high-active neurons,
and 5) the reconnection of two selected sub-chains of neurons.
The operations nos. 1, 3, and 4 are the components of the mechanism for
automatic adjustment of the number of neurons in the chain, whereas the operations nos. 2 and 5 govern the disconnection and reconnection mechanisms,
respectively. Based on experimental investigations, the following conditions for
particular operations have been formulated (numberings of conditions and operations are the same). Possible operation takes place between neuron no. i and
neuron no. i + 1; i ∈ {1, 2, . . . , r − 1}, where r is the number of neurons in the
original neuron chain or a given sub-chain.
Condition 1: wini < β1 , where wini is the number of wins of i-th neuron and
β1 is experimentally selected parameter (usually, for complex multidimensional
WWW-document clustering, β1 assumes the value around 50). This condition
allows to remove single neuron whose activity (measured by the number of its
wins) is below an assumed level represented by parameter β1 .
Pr−1
dj,j+1
Condition 2: di,i+1 > α1 j=1r
, where di,i+1 is the distance between
the neurons no. i and no. i + 1 (see [1] for details) and α1 is experimentally
selected parameter (usually, for complex problems α1 ∈ [1, 10]). This condition
prevents the excessive disconnection of the neuron chain or sub-chain by allowing
to disconnect only relatively remote neurons.
Condition 3: rS < β2 , where rS is the number of neurons in sub-chain S and
β2 is experimentally selected parameter (usually β2 ∈ {3, 4}). This condition
allows to remove rS -element neuron sub-chain S that length is shorter than
assumed acceptable value β2 .
The operation of the insertion of additional neurons into the neighbourhood
of high-active neurons in order to take over some of their activities covers 3 cases
denoted by 4a, 4b, and 4c, respectively.
Condition 4a (the insertion of new neuron (n) between two neighbouring
high-active neurons no. i and no. i + 1): IF wini > β3 AND wini+1 > β3 THEN
weight vector w (n) of new neuron (n) is calculated as follows: w (n) = w i +2w i+1 ,
where wini , wini+1 are as in Condition 1 and β3 is experimentally selected
parameter (usually β3 is comparable to β1 that governs Condition 1).
Conditions 4b (the replacement of high-active neuron no. i - accompanied by
low-active neurons no. i−1 and no. i+1 - by two new neurons: (n) and (n+1)): IF
wini > β3 AND wini−1 < β3 AND wini+1 < β3 THEN weight vectors w (n) and
w (n+1) of new neurons (n) and (n+1) are calculated as follows: w (n) = w i−12+w i
and w (n+1) = w i +2w i+1 (β3 - as in Condition 4a).
Condition 4c (the insertion of new neuron in the neighbourhood of an endchain high-active neuron accompanied by low-active neighbour; r-th neuron case
will be considered; 1st neuron case is analogous): IF winr > β3 AND winr−1 <
β3 THEN weight vector w r+1 of new neuron (r + 1) is calculated as follows:
Pr−1
w r−1 d , where d
1
w r+1 = w r + w rd−
avr
avr = r−1
j=1 dj,j+1 (β3 - as in Condition
r,r−1
4a and dj,j+1 - as in Condition 2).
PrS1 −1
PrS2 −1
dj,j+1
dj,j+1
Condition 5: deS1 ,eS2 < α2 [ 21 ( j=1rS1
+ j=1rS2
)] , where S1 and
S2 are two sub-chains (containing rS1 and rS2 neurons, respectively) whose
appropriate ends eS1 ∈ {1, rS1 } and eS2 ∈ {1, rS2 } are closest to each other; subchains S1 and S2 are the candidates for the connection by combining their ends
eS1 and eS2 (dj,j+1 - as in Condition 2, α2 - experimentally selected parameter
(usually α2 is comparable to α1 that governs Condition 2). This condition allows
to connect two sub-chains not only with closest ends but also with relatively
close to each other neighbouring neurons that correspond to compact pieces of
the same cluster of data.
3
Vector Space Model of WWW Documents - an Outline
[8]
Consider a collection of L WWW documents. In the Vector Space Model (V SM )
[3, 4, 9, 12, 13], every document in the collection is represented by vector x l =
(xl1 , xl2 , . . . , xln )T (l = 1, 2, . . . , L). Component xli (i = 1, 2, . . . , n) of such a
vector represents i-th key word or term that occurs in l-th document. The value
of xli depends on the degree of relationship between i-th term and l-th document.
Among various schemes for measuring this relationship (very often referred to as
term weighting), three are the most popular: a) binary term-weighting: xli = 1
when i-th term occurs in l-th document and xli = 0 otherwise, b) tf -weighting
(tf stands for term f requency): xli = tfli where tfli denotes how many times
i-th term occurs in l-th document, and c) tf-idf -weighting (tf-idf stands for term
f requency - i nverse d ocument f requency): xli = tfli log (L/dfi ) where tfli is the
term frequency as in tf -weighting, dfi denotes the number of documents in which
i-th term appears, and L is the total number of documents in the collection.
In this paper tf -weighting will be applied. Once the way of determining xli is
selected, the Vector Space Model can be formulated in a matrix form:
V SM(n×L) = X (n×L) = [x l ]l=1,2,...,L = [xli ]Tl=1,2,...,L;
i=1,2,...,n
(5)
where index (n×L) represents its dimensionality.
The V SM -dimensionality-reduction issues are of essential significance as far
as practical usage of V SM s is concerned. There are two main classes of techniques for V SM -dimensionality reduction [11]: a) feature selection methods, and
b) feature transformation methods. Among the techniques that can be included
into the afore-mentioned class a) are: filtering, stemming, stop-word removal, and
the proper feature-selection methods sorting terms and then eliminating some
of them on the basis of some numerical measures computed from the considered collection of documents. The first three techniques sometimes are classified
as text-preprocessing methods, however - since they significantly contribute to
V SM -dimensionality reduction - here, for simplicity, they have been included
into afore-mentioned class a).
During filtering (and tokenization) special characters, such as %, #, $, etc.,
are removed from the original text as well as word- and sentence boundaries are
identified in it. As a result of that, initial V SM(nini ×L) is obtained where nini
is the number of different words isolated from all documents.
During stemming all words in initial model are replaced by their respective
stems (a stem is a portion of a word left after removing its suffixes and prefixes).
As a result of that, V SM(nstem ×L) is obtained where nstem < nini .
During stop-word removal (removing words from a so-called stop list), words
that on their own do not have identifiable meanings and therefore are of little
use in various text processing tasks are eliminated from the model. As a result
of that, V SM(nstpl ×L) is obtained where nstpl < nstem .
Feature selection methods usually operated on term quality qi , i = 1, 2, . . . , nstpl
defined for each term occurring in the latest V SM . Terms characterized by
qi < qtres where qtres is a pre-defined threshold value are removed from the
model. In this paper, the document-frequency-based method will be used to determine qi , that is qi = dfi where dfi is the number of documents in which
i-th term occurs. As a result of that, final V SM(nf in ×L) is obtained where
nf in < nstpl .
4
Application to Complex, Multidimensional
WWW-newsgroup-document Clustering Problem
The proposed clustering technique based on the dynamic self-organizing neural
networks will now be applied to real-life WWW-newsgroup-document clustering problem, that is, to clustering of the multidimensional, large-scale collection of 19 997 documents (e-mail messages of different Usenet-News newsgroups)
available at WWW server of the School of Computer Science, Carnegie Mellon
University (www.cs.cmu.edu/ TextLearning/datasets.html). Henceforward, the
collection will be called 20 newsgroups. The considered collection is partitioned
(nearly) evenly across 20 different newsgroups, each corresponding to a different
topic. It is worth emphasizing that some of the newsgroups are very closely related to each other, while others are highly unrelated (see the subsequent part
of the paper). Since the assignments of documents to newsgroups are known
here, it allows us for direct verification of the results obtained. Obviously, the
knowledge about the newsgroup assignments by no means will be used by the
clustering system (it works in a fully unsupervised way).
The process of dimensionality reduction of the initial V SM for 20 newsgroups
document collection has been presented in Table 1 with the use of notations
introduced in Section 3 (additionally, in square brackets, the overall numbers of
occurrences of all terms in all documents of the collection are presented). For the
considered document collection, two final numerical models (identified in Table
1 as ”Small” and ”Large” data sets) have been obtained. For this purpose two
values of threshold parameter qtres have been considered: qtres = 1 000 - to get
the model of reduced dimensionality (”Small”-type data sets) and qtres = 400 to get the model of higher dimensionality but also of higher accuracy (”Large”type data sets).
Table 1. The dimensionality reduction of the initial V SM for 20 newsgroups
document collection
V SM
VSM(nini ×L)
VSM(nstem ×L)
VSM(nstpl ×L)
VSM(nf in ×L)
Dimensionality of V SM for 20 newsgroups document
collection:
(nini ×L) = (122 005×19 997)
[2 644 002]
(nstem ×L) = (99 072×19 997)
[2 526 731]
(nstpl ×L) = (98 599×19 997)
[1 677 316]
20 newsgroups”Small”
20 newsgroups”Large”
(qtres = 1 000)
(qtres = 400)
(nf in ×L) = (232×19 997) (nf in ×L) = (725×19 997)
[461 512]
[524 321]
Figs. 1 and 2 present the performance of the proposed clustering technique
for 20 newsgroups”Small” and ”Large” numerical models of 20 newsgroups collection of documents. As the learning progresses, both systems adjust the overall
numbers of neurons in their networks (Figs. 1a and 2a) that finally are equal
to 273 and 251, respectively, and the numbers of sub-chains (Figs. 1b and 2b)
finally achieving the values equal to 11 in both cases; the number of sub-chains is
equal to the number of clusters detected in a given numerical model of the document collection. The envelopes of the nearness histograms for the routes in the
attribute spaces of 20 newsgroups”Small” and ”Large” data sets (Figs. 1c and
2c) reveal perfectly clear images of the cluster distributions in them, including
the numbers of clusters and the cluster boundaries (indicated by 10 local minima on the plots of Figs. 1c and 2c). After performing the so-called calibration
of both neural networks, class labels (represented by letters ’A’ through ’K’) can
be assigned to particular sub-chains of the networks as shown in Figs. 1c and 2c.
The difference between the number of detected clusters and the number of newsgroups results from the above-mentioned fact that some of the newsgroups are
very closely related to each other, and, thus, they are perceived by the clustering
system as one cluster (details on the calibration of both neural networks are presented below). It is worth mentioning that both systems detect the same number
of clusters, which confirms the internal consistency of the proposed approach.
Since the newsgroup assignments are known in the original collection of documents, a direct verification of the obtained results is also possible. Detailed
numerical results of the clustering and network calibration have been collected
in Tables 2 and 3 for 20 newsgroups”Small” and ”Large” data sets, respectively.
These tables show, for every newsgroup, numbers of its documents that have
been assigned by the clustering system to particular neuron sub-chains (classes)
labelled by letters ’A’ through ’K’. The biggest number of documents (denoted
in Tables 2 and 3 by boldface) of a given newsgroup assigns that newsgroup to
appropriate sub-chain (class).
b)
350
60
300
50
Number of sub-chains
Number of neurons
a)
250
200
150
100
40
30
20
10
50
0
0
0
20
40
60
Epoch number
80
100
0
20
G
H
40
60
Epoch number
80
100
Envelope of nearness histogram
c)
20
15
A
B
D
C
E
F
I
J
K
10
5
0
0
34
68
102
136
170
204
Numbers of neurons along the route
238
272
Figure 1. The plots of number of neurons (a) and number of sub-chains (b) vs.
epoch number, and c) the envelope of nearness histogram for the route in the
attribute space of 20 newsgroups”Small” data set
a)
b)
56
400
48
Number of sub-chains
Number of neurons
320
240
160
40
32
24
16
80
8
0
0
0
20
40
60
Epoch number
80
100
0
20
40
60
Epoch number
80
100
Envelope of nearness histogram
c)
2.0
1.5
K
B
J
A
I
F
G
C
E D
H
1.0
0.5
0.0
0
21
42
63
84
105 126 147 168 189
Numbers of neurons along the route
210
231
252
Figure 2. The plots of number of neurons (a) and number of sub-chains (b) vs.
epoch number, and c) the envelope of nearness histogram for the route in the
attribute space of 20 newsgroups”Large” data set
Table 2. Clustering results for 20 newsgroups”Small” data set
Newsgroup
name
Number of decisions for sub-chain (class)
labelled:
A
B
C
D
E
F
G
H
misc.forsale
5
15
12
23
105
4
39
2
comp.graphics
65
11
123
0
141
14
0
612
comp.windows.x
9
36
101
12
47
31
comp.os.mswindows.misc
9
0
122
7
56
21
comp.sys.mac.
hardware
60
12 632
4
15
comp.sys.ibm.pc.
hardware
11
34 599
6
67
sci.electronics
39
rec.motocycles
8
rec.autos
J
K
729 12
54
729
721
72.90
6
11
612
388
61.20
45 633 22
35
29
633
367
63.30
23 586 15
158
3
586
414
58.60
25
28
22
78
87
37
632
368
63.20
25
6
25
90
131
6
599
401
59.90
8
[%]
258
2
391 44
58
18
38
116
0
391
609
39.10
602 36
41
146
0
0
30
67
29
41
602
398
60.20
15 411 198
21
84
11
47
61
84
43
25
411
589
41.10
talk.politics.mideast 43
36
I
NCD1 NWD2 PCD3
0
105
20
53
130 487 72
21
67
2
487
513
48.70
talk.politics.guns
21
13
17
21
76
225 583
0
19
9
16
583
417
58.30
talk.politics.misc
0
0
99
13
9
340 394
1
60
81
3
394
606
39.40
rec.sport.hockey
11
3
181
17
15
19
59
26
37
54 578
578
422
57.80
rec.sport.baseball
0
11
137
23
27
48
35
0
21
17 681
681
319
68.10
sci.crypt
0
9
48 512 92
17
110
11
60
106
35
512
488
51.20
sci.space
538
5
33
89
61
25
61
29
34
35
90
538
462
53.80
sci.med
42
15
36
0
8
23
50
17
304 429 76
429
571
42.90
alt.atheism
0
5
54
47
5
533 111
32
70
120
23
533
467
53.30
talk.religion.misc
5
21
27
79
41 562 93
11
56
20
85
562
438
56.20
soc.religion.christian
6
2
34
15
30 735 39
41
25
47
23
735
262
73.72
8770
56.14
ALL:
887 1241 2852 952 1469 2832 2268 2229 1838 1602 1827 11227
1
NCD = Number of correct decisions, 2 NWD = Number of wrong decisions,
3
PCD = Percentage of correct decisions.
The detailed results of Tables 2 and 3 have been summarized in Tables 4 and
5. Table 4 presents the overall numbers of documents that have been assigned to
particular sub-chains for both ”Small” and ”Large” data sets. Additionally, numbers of neurons belonging to particular sub-chains of the overall chains (see also
Figs. 1c and 2c for both data sets have been included. In turn, Table 5 presents
the assignments of particular newsgroups to successive sub-chains (classes). It is
worth emphasizing that the proposed clustering technique not only detects the
same number of clusters in both ”Small” and ”Large” data sets (it was already
mentioned earlier in this section) but also assigns the same newsgroups to appropriate clusters in both data sets. It is another confirmation of the internal
consistency of the proposed clustering technique.
An important issue of the accuracy of the proposed technique will be considered in the framework of a broad comparative analysis with several alternative
clustering methods applied to 20 newsgroups”Small” and ”Large” data sets as
well as to some modifications of the original 20 newsgroups document collec-
tion. In Part I of Table 6 the results of comparative analysis with three alternative approaches (they are listed under Part I of Table 6) applied to ”Small”
and ”Large” data sets are presented. In order to carry out the clustering of the
20 newsgroups”Small” and ”Large” data sets with the use of the afore-mentioned
techniques, WEKA (Waikato Environment for Knowledge Analysis) application
that implements them has been used. The WEKA application as well as details
on the clustering techniques can be found on WWW site of the University of
Waikato, New Zealand (www.cs.waikato.ac.nz/ml/weka).
Table 3. Clustering results for 20 newsgroups”Large” data set
Newsgroup
name
Number of decisions for sub-chain (class)
labelled:
K
B
J
I
A
F
G
794 12
C
E
D
NCD1 NWD2 PCD3
H
[%]
misc.forsale
5
71
0
2
33
59
4
16
4
794
206
79.40
comp.graphics
11
43
30
19
22
4
5
152
5
7
702
702
298
70.20
comp.windows.x
1
51
45
20
6
50
23
120
4
30 650
650
350
65.00
comp.os.mswindows.misc
6
10
21
80
16
19
21
211
14
11 591
591
409
59.10
comp.sys.mac.
hardware
8
39
11
42
4
42
11 644 10
0
189
644
356
64.40
comp.sys.ibm.pc.
hardware
4
54
15
86
29
15
0
613 18
6
160
613
387
61.30
sci.electronics
4
40
rec.motocycles
2
24
34
14
4
25
132 405 32
286
405
595
40.50
623 55
114
33
5
2
13
47
71
35
623
377
62.30
15 432 14
168
1
50
16
138
12
4
150
432
568
43.20
talk.politics.mideast
8
21
18
45
0
198 503 24
3
7
173
503
497
50.30
talk.politics.guns
18
1
11
23
5
261 609 21
27
19
5
609
391
60.90
talk.politics.misc
3
12
5
56
17
358 412 51
10
6
70
412
588
41.20
rec.sport.hockey
601
4
9
50
14
70
38
170
13
11
20
601
399
60.10
690 29
0
25
23
35
22
141
18
14
3
690
310
69.00
10
523 92
523
477
52.30
rec.autos
rec.sport.baseball
sci.crypt
8
21
3
112
85
11
134
1
sci.space
17
14
4
41 543 70
28
23
57
1
202
543
457
54.30
sci.med
6
16 437 174
50
34
11
5
257
437
563
43.70
4
6
alt.atheism
13
25
3
53
7
547 54
69
15
73
141
547
453
54.70
talk.religion.misc
13
22
1
125
23 579 98
23
34
77
5
579
421
57.90
soc.religion.christian
1
16
7
32
15 798 28
19
23
26
32
798
199
80.04
8317
58.41
ALL:
1434 1544 713 2093 798 3198 1989 2791 731 939 3767 11680
1
NCD = Number of correct decisions, 2 NWD = Number of wrong decisions,
3
PCD = Percentage of correct decisions.
In order to extend the comparative-analysis aspects of this paper, in Part
II of Table 6 the results of the clustering of some modifications of the original
20 newsgroups document collection that are reported in the literature have been
included. This time the operation of the dynamic self-organizing neural network
clustering technique has been compared with seven alternative approaches that
are listed under Part II of Table 6.
Table 4. Sub-chain (class) labels, numbers of documents assigned to particular
sub-chains and numbers of neurons in the chains for 20 newsgroups”Small” (a)
and 20 newsgroups”Large” (b) data sets
a)
b)
Sub-chain Number of Number of Sub-chain Number of Number of
(class) label documents neuron in (class) label documents neuron in
assigned to the chain
assigned to the chain
sub-chain
sub-chain
A
887
1 − 12
K
1 434
1 − 16
B
1 241
13 − 29
B
1 544
17 − 35
C
2 852
30 − 68
J
713
36 − 44
D
952
69 − 81
I
2 093
45 − 72
E
1 469
82 − 101
A
798
73 − 82
F
2 832
102 − 140
F
3 198
83 − 123
G
2 268
141 − 171
G
1 989
124 − 148
H
2 229
172 − 201
C
2 791
149 − 183
I
1 838
202 − 226
E
731
184 − 192
J
1 602
227 − 248
D
939
193 − 204
K
1 827
249 − 273
H
3 767
205 − 251
Table 5. Assignments of particular newsgroups to sub-chains (classes) for 20 newsgroups”Small”
(a) and 20 newsgroups”Large” (b) data sets
a)
b)
Sub-chain Name(s) of newsgroup(s) Sub-chain
label
assigned to sub-chain
label
A
sci.space
K
B
rec.motorcycles rec.autos
B
C
comp.sys.mac.hardware
J
comp.sys.ibm.pc.hardware
D
sci.crypt
I
E
sci.electronics
A
F
alt.atheism
F
talk.religion.misc
soc.religion.chrystian
G
talk.politics.mideast
G
talk.politics.guns
talk.politics.misc
H
comp.graphics
C
comp.windows.x
comp.os.ms-windows.misc
E
I
misc.forsale
D
J
sci.med
H
K
rec.sport.hockey
rec.sport.baseball
Name(s) of newsgroup(s)
assigned to sub-chain
rec.sport.hockey
rec.sport.baseball
rec.motorcycles rec.autos
sci.med
misc.forsale
sci.space
alt.atheism
talk.religion.misc
soc.religion.chrystian
talk.politics.mideast
talk.politics.guns
talk.politics.misc
comp.sys.mac.hardware
comp.sys.ibm.pc.hardware
sci.electronics
sci.crypt
comp.graphics
comp.windows.x
comp.os.ms-windows.misc
Table 6. Results of comparative analysis for 20 newsgroups numerical models
Part I
Clustering
method
DSONN
EM
FFTA
k -means
Percentage of correct decisions
20 newsgroups”Small”
20 newsgroups”Large”
(dimensionality of V SM : (dimensionality of V SM :
(n×L) = (232×19 997))
(n×L) = (725×19 997)))
56.14%
58.41%
47.52%
49.12%
27.60%
33.98%
42.59%
48.12%
DSONN = Dynamic Self-Organizing Neural Network, EM = Expectation
Maximization method, FFTA = Farthest First Traversal Algorithm
Part II
Clustering
method
COS
BOW
sIB
sL1
sKL
k -means
sk -means
Modifications of 20 newsgroups document collection
Dimensionality (n×L) of
Percentage of correct
V SM
decisions
28 101×∼19 216
38.28%
28 101×∼19 216
33.97%
2 000×17 446
57.50%
2 000×17 446
15.30%
2 000×17 446
28.80%
2 000×17 446
53.40%
2 000×17 446
54.10%
COS = COncept Space representation for paragraphs method [2], BOW = simple
Bag-Of-Words characterization paragraphs method [2], sIB = sequential Information Bottleneck approach [10], sL1 and sKL = variations of sIB approach [10],
sk-means = sequential k-means (presented with k-means algorithm in [10]).
Taking into account the results that have been reported in this paper, it is
clear that the clustering technique based on the dynamic self-organizing neural
networks is a powerful tool for large-scale and highly multidimensional clusteranalysis problems such as WWW-newsgroup-document clustering. It provides
better or much better accuracy of clustering than other alternative techniques
applied in this field. Moreover, it is extremely important that the proposed technique automatically determines (adjusts in the course of learning) the number
of clusters in a given data set. All the alternative approaches can operate under
the condition that the number of clusters is set in advance.
5
Conclusions
The application of the clustering technique based on the dynamic self-organizing
neural networks (introduced by the same authors in [7], [8] and some earlier papers) to the large-scale and highly multidimensional WWW-newsgroupdocument clustering task has been reported in this paper. The collection of 19 997
documents (e-mail messages of different Usenet-News newsgroups) available at
WWW server of the School of Computer Science, Carnegie Mellon University
(www.cs.cmu.edu/ TextLearning/datasets.html) has been the subject of clustering. A broad comparative analysis with nine alternative clustering techniques
has also been carried out demonstrating the superiority of the proposed approach in the considered task. Especially, it is worth emphasizing the ability of
the proposed technique to automatically determine the number of clusters in the
considered data set and high accuracy of clustering. The proposed technique has
already been successfully applied to the clustering of other WWW-document
collection [8] as well as several multidimensional data sets [7].
References
1. Berry M.W.: Survey of Text Mining, Springer Verlag, New York, 2004.
2. Caillet M., Pessiot J., Amini M., Gallinari P.: Unsupervised Learning with Term
Clustering For Thematic Segmentation of Texts, Proc. of RIAO 2004 (Recherche
d’Information Assiste par Ordinateur), Toulouse, France, 2004.
3. Chakrabarti S.: Mining the Web: Analysis of Hypertext and Semi Structured Data,
Morgan Kaufmann Publishers, San Francisco, 2002.
4. Franke J., Nakhaeizadeh G., Renz I. (Eds.): Text Mining: Theoretical Aspects and
Applications, Physica Verlag/Springer Verlag, Heidelberg, 2003.
5. GorzaÃlczany M.B., Rudziński F.: Application of Genetic Algorithms and Kohonen
Networks to Cluster Analysis, in L. Rutkowski, J. Siekmann, R. Tadeusiewicz, L.A.
Zadeh (Eds.), Artificial Intelligence and Soft Computing - ICAISC 2004, Proc. of
7th Int. Conference, Lecture Notes in Artificial Intelligence 3070, Springer-Verlag,
Berlin, Heidelberg, New York, 2004, pp. 556-561.
6. GorzaÃlczany M.B., Rudziński F.: Modified Kohonen Networks for Complex Clusteranalysis Problems, in L. Rutkowski, J. Siekmann, R. Tadeusiewicz, L.A. Zadeh
(Eds.), Artificial Intelligence and Soft Computing - ICAISC 2004, Proc. of 7th Int.
Conference, Lecture Notes in Artificial Intelligence 3070, Springer-Verlag, Berlin,
Heidelberg, New York, 2004, pp. 562-567.
7. GorzaÃlczany M.B., Rudziński F.: Cluster Analysis via Dynamic Self-organizing
Neural Networks, in L. Rutkowski, R. Tadeusiewicz, L.A. Zadeh, J. Zurada (Eds.),
Artificial Intelligence and Soft Computing - ICAISC 2006, Proc. of 8th Int. Conference, Lecture Notes in Artificial Intelligence 4029, Springer-Verlag, Berlin, Heidelberg, New York, 2006, pp. 593-602.
8. GorzaÃlczany M.B., Rudziński F.: Application of dynamic self-organizing neural networks to WWW-document clustering, International Journal of Information Technology and Intelligent Computing, Vol. 1, No. 1, 2006, pp. 89-101 (also presented
at 8th Int. Conference on Artificial Intelligence and Soft Computing ICAISC 2006,
Zakopane).
9. Salton G., McGill M.J.: Introduction to Modern Information Retrieval, McGrawHill Book Co., New York, 1983.
10. Slonim N., Friedman N., Tishby N.: Unsupervised Document Classification using
Sequential Informaiton Maximization, Proc. of the Twenty-Fifth Annual International ACM SIGIR Conference, Tampere, Finland, 2002, pp.129-136.
11. Tang B., Shepherd M., Milios E., Heywood M.I.: Comparing and combining dimension reduction techniques for efficient text clustering, in Proc. of Int. Workshop on
Feature Selection and Data Mining, Newport Beach, 2005.
12. Weiss S., Indurkhya N., Zhang T., Damerau F.: Text Mining: Predictive Methods
for Analyzing Unstructured Information, Springer, New York, 2004.
13. Zanasi A. (Ed.): Text Mining and its Applications to Intelligence, CRM and Knowledge Management, WIT Press, Southampton, 2005.