Download Text Documents Clustering

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Human genetic clustering wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

K-means clustering wikipedia , lookup

Nearest-neighbor chain algorithm wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
Text Documents Clustering
Greta Ciganaitė
Aušra Mackutė-Varoneckienė
Tomas Krilavičius
Faculty of Informatics
Vytautas Magnus University
Kaunas, Lithuania
El. paštas: [email protected]
Faculty of Informatics
Vytautas Magnus University
Kaunas, Lithuania
El. paštas: [email protected]
Vytautas Magnus University
Baltics Institute of Advanced Technology
Vilnius, Lithuania
El. paštas: [email protected]
Abstract— Big amounts of textual information are generated
every day, and existing techniques can hardly deal with such
information flow. However, users expect fast and exact
information management and retrieval tools. Clustering is a well
known technique for grouping similar data and in such a way
making it more manageable and usable. Text clustering is an
adaptation of clustering for a very specific data - documents.
However, it is not transferable directly to any language, i.e.
specifics of language influence performance quite a lot, as shows
results for English and other well investigated languages. In this
paper we apply different distances and clustering approaches for
Lithuanian data, discuss results and provide recommendations
for documents in Lithuanian clustering.
Keywords—
measures.
text
I.
document
clustering,
II.
CLUSTERING
Clustering is an unsupervised learning technique, which
essence is to divide set of documents to the smaller sets
according to the meaning of each document. The sets that are
made up of many documents with similar meaning are called
clusters. The main properties of clusters are that each cluster
with any other cluster must be dissimilar as possible and
documents in one cluster must be as similar as possible [1, 2].
We can perform clustering in two methods: hard and soft
clustering [1]. Hard clustering means that each document must
belong to only one cluster (K-means, see subsection II.B),
when in soft clustering (EM, see subsection II.C), each
document can belong to more than one cluster, but in every
cluster each document have probability which notes how
strong are dependence for specific cluster.
similarity
INTRODUCTION
Documents data clustering process consists of steps as
follows:
Big amounts of textual information are generated every
day, and existing techniques can hardly deal with such
information flow. However, users expect fast and exact
information management and retrieval tools. Clustering is a
well known technique for grouping similar data and in such a
way making it more manageable and usable. Text clustering is
an adaptation of clustering for a very specific data documents. However, it is not transferable directly to any
language, i.e. specifics of language influence performance
quite a lot, as shows results for English and other well
investigated languages.
1.
2.
3.
4.
Documents preparation.
Similarity measure choice.
Clustering algorithm choice.
Method evaluation.
A. Similarity
Generally, similarity measure defines similarity between
two objects. In case of clustering problem, similarity measure
is used to describe the homogeneity between two clusters. It is
very important part of clustering, because, as was mentioned
before, documents in the same cluster must be as homogenic
as possible and clusters have to be as heterogenic as possible
[1, 2].
There is quite a number of different clustering techniques
[1, 2], which can be used with different similarity measure [1,
2] as well. In this paper we apply the most common similarity
measure (Euclidean and cosine) to textual data. Moreover, we
experiment with well known K-means [1, 2] and Expectation
Maximization clustering approaches. These approaches as
well as similarity measures were well investigated with
English language [1, 2], therefore it is reasonable to try to
apply to the Lithuanian language as well.
There are many ways to measure distance. The most
popular distance measure is Euclidean and additionally we
extended research to use cosine distance.
Euclidean distance [3, 4] which we can interpret
geometrically as straight line between two objects:
The paper is organized as follows. In second section we
discuss different clustering techniques and evaluation
measures. Then we continue with experiments and conclude
paper in section IV.
d x,y =
90
2
∑ (x
n
i =1
− yi )
2
i
(1)
where x and y are two objects, the ith component defines
attribute of an object, n is the number of attributes –
dimensions.
3.
In EM all completed data must be generated by one
distribution (e.g. Gaussian) (hard EM version) or more
distributions (soft EM version) [8].
Cosine similarity [3,4] measures the cosine between two
documents:
cos( d i , d j ) =
di ⋅d
di ⋅ d
j
EM algorithm is quite precise, but the problem is that it is
not very suitable for big corpora. As will be shown in
experiments that were performed, execution time for big sets
of data is very long and results are not very satisfactory.
(2)
j
where di and dj are two different documents.
D. Evaluation
External evaluation of clustering model uses a confusion
matrix (it is a result of clustering). Confusion matrix is a table
which shows a number of documents that were subsumed in
clusters rightly and wrongly. The following contingency table
is computed in accordance with the confusion matrix [1, 2]:
Also, it is possible to measure distance between two
clusters. Distance between two clusters can be comprehensible
as average distance between all documents in those clusters, as
distance between the nearest documents in those clusters or as
distance between the furthest documents in those clusters [4].
B. K-means
K-means is a simple unsupervised learning algorithm
which is used to solve clustering problem. Its essence is to reelect centroids of clusters when all objects which are not
centroids has to be assigned to the nearest centroid’s cluster.
In each iteration method calculate distances between centroids
and each object. Object is assigned to that centroid where the
distance is shortest [1, 3, 4].
TABLE I.
Retrieved
Not retrieved
To fix number of clusters.
2.
To identify initial centroids.
3.
To measure distances between centroid and each
element.
4.
To identify new centroids in each cluster.
5.
To re-elect each document to that cluster where
distance is shortest.
6.
End if centroids after re-identifying are the same and
each element is to the nearest centroid, else go to step
three.
P =
2.
(M – step) Computing new parameters, which
maximize the likelihood of the completed data.
No relevant
False positives (fp)
True negatives (tn)
tp
tp + fp
(3)
Recall [1, 2] is a measure which describes ratio of value of
relevant information which was retrieved of information
which was not retrieved, in other words, it is average
probability of ended retrieval.
R =
tp
tp + fn
(4)
F-measure (F1) [1, 2] is a measure whose value is
weighted average of precision and recall:
F1 =
2 tp
2 PR
=
P+R
2 tp + fn + fp
(5)
Inherently these measures belong to the interval [0; 1], or
converted to percentage scale, it is between 0 and 100.
Various users require different values: one user, such as
students who want to get precise information, usually require
high precision value than recall and conversely another one
user, such as professional researchers, prefer high recall value
than precision [1].
C. EM (Expectation-Maximization) [5]
This algorithm is recurrent algorithm for likelihood
maximization in problem with incomplete data. EM consists
of steps as follows [5]:
(E – step) The filling of lacking data.
Relevant
True positives (tp)
False negatives (fn)
Precision [1, 2] is a proportion which expresses the value
of retrieved relevant information among retrieved irrelevant
information or simply average probability of correct retrieval.
This algorithm is very simple and fits for big sets of data
but there is one big problem with it: the number of clusters.
This number should be selected before first K-means iteration
and it cannot be changed during all the process. It is important
that there is no one general method to set number of clusters.
The simplest way to do this is to experiment with different
number of clusters and compare the results of some
experiments and then select the best result.
1.
CONTINGENCY TABLE
External evaluation of model values (precision, recall and
F-measure) is computed by using values of contingency table
(Table I).
To sum up, k-means algorithm step by step:
1.
Method works until convergence (in practice it
converges only in local maximum).
91
III.
EXPERIMENTS
Analysis of confusion matrices we have got after
clustering with each chosen algorithm is presented in the result
table (Table IV) (all values are approximated):
We use two different algorithms (K-means (with Euclidean
distance and cosine similarity) and EM) to cluster internet
daily newspaper articles from the Lrytas.lt and information
from largest internet forum‘s for woman - supermama.lt. The
stemming and Bag Of Words (BOW) techniques were used for
document pre-processing. Stemming is the process which
reduces words to their stem (e.g. pienas, pieno, pieną would be
stemmed as pien) [6]. Bag Of Words (BOW) is the technique
which split text to distinct words, disregarding grammar or
word order [9]. For clustering we have got an AttributeRelation File Format (.ARFF) document which contained
attributes and data for each attribute [7]. In this case, attributes
are words. The articles from Lrytas.lt are classified into 3
groups marked by “1”, ”3” and ”5” labels which corresponds
to real classes “Aktualijos”, “Įvykiai” and “Komentarai”.
Information from supermama.lt is classified into 14 groups
labeled by the numbers by 1 to 14. These Labels also
corresponds to real classes: “Apie nėštumą ir gimdymą“,
”Bendras”, “Grožis ir sveikata”, “Konsultacijų centras”,
“Mamų bendravimas, susitikimai, akcijos”, “Motinystė ir
tėvystė”, “Mūsų namai”, “Naudingi patarimai”, “Paramos
skyrelis”, “Poilsis, pomėgiai ir šventės”, “Skelbmų lenta”,
“Socialinis gyvenimas”, “Tarp mūsų, mergaičių”, “Vaikų
auklėjimas ir ugdymas”. Data set statistics after pre-processing
is presents in Table II.
TABLE II.
Method
K-means
(cosine)
K-means
(Euclidean)
EM
Method
K-means
(cosine)
K-means
(Euclidean)
EM
Algorithm
K-means
(cosine)
K-means
(Euclidean)
EM
# of attributes
3841
# of data objects
11353
# of data objects
1249
1
481
2
511
3
987
4
416
5
1552
6
186
7
1536
8
416
9
620
10
1387
11
1304
12
712
13
824
14
412
Classes
6886
1
402
2
286
3
561
Incorrect,
%
P, %
R, %
F1, %
K=3
35.87
57.26
53.6
51.98
K=3
73.42
24.01
24.58
23.88
K=3
37.55
59.00
59.50
58.41
SUPERMAMA.LT CLUSTERING RESULT TABLE
# of
clusters
Incorrect,
%
P, %
R, %
F1, %
K=14
91,06
7,93
11,44
6,47
K=14
94,79
3,39
7,01
1,98
K=14
95,04
7,24
4,37
2,83
Lrytas.lt, min
Supermama.lt, min
~ 1,5
~149
~ 0,05
~48
~ 3,5
~200
The computer used is DELL XPS with these specifications:
1. Intel® Core™ i7-3537U processor,
2. 8,00 GB RAM,
3. Windows 8 64-bit Operating System.
In this case, percentage values of F1 and incorrect clustered
documents are very important for evaluating clustering
effectiveness. The fact, that K-means and EM are not suitable
for all sizes of corpora. For small corpora EM is more
effective than K-means while for big corpora EM works really
slow and not so effective. Comparing F1 values and knowing
that clustering is best when it achieves the highest value
(F1=1=100%), we can affirm that the best clustering results
with Lrytas.lt data set were reached by EM, while with
supermama.lt the best method was K-means with cosine
similarity. It is important to note, that with Lrytas.lt data set
K-means algorithm with cosine similarity have reached almost
the same values of F1 and incorrect clustered documents but
despite small difference between F1 values and evaluating the
time taken for clustering (Table V), K-means with cosine
similarity results can be considered well because of
considerably short execution time. In supermama.lt case
situation is a little bit different: K-means, as was expected,
reached the best results and comparing with EM, execution
time is shorter ~51 min.
Data set
# of attributes
# of
clusters
TABLE V. THE EXECUTION TIME TAKEN TO BUILD EACH MODEL:
Lrytas.lt
Data set
LRYTAS.LT CLUSTERING RESULT TABLE
TABLE IV.
NUMBER OF DATA OBJECTS AND CLASS DISTRIBUTIONS
Supermama.lt
Classes
TABLE III.
92
Fig. 1. Clustering results
REFERENCES
IV.
CONCLUSION
[1]
Documents data clustering techniques were analyzed, all
steps for documents data clustering (similarity measures,
clustering algorithms and clustering models evaluating) were
presented and experimental analysis was performed. This
analysis has shown the differences between clustering
algorithms. K-means is suitable for big corpora because it
works quite fast, but it is important to choose suitable distance
measure. In our case, despite the fact that execution time taken
to build K-means with Euclidean distance is significantly
shorter than with cosine similarity (Table V), K-means with
cosine similarity is more effective than K-means with
Euclidean distance (Table III, Table IV). EM algorithm is
more precise than K-means (Table III), but when we have
large set of attributes, it works very slowly (Table V) and then
precision is lower than K-means with cosine similarity (Table
IV).
[2]
[3]
[4]
[5]
[6]
[7]
We are planning to experiment with different clustering
techniques, e.g. hierarchical clustering, as well as experiment
with different similarity measures, and dimensionality
reduction techniques to minimize number of features to speed
–up clustering.
[8]
[9]
Research was funded by ESFA (VP1-3.1-ŠMM-10-V-02025).
93
C.D.Manning, P. Raghavan, and A. Schütze, An Introduction to
Information Retrieval. Cambridge University Press: England, 2008.
P.-N. Tan, M.S. V. Kumar, Introduction to Data Mining. Pearson
International ed.: Pearson Education, Inc., 2006.
N. Sandhya, Y.Sri Lalitha, Dr.A.Govardhan, Dr.K.Anuradha, “ Analysis
of Similarity Measures for Text Clustering“, International Journal of
Data Engineering, vol 2, no. 4, July 2011.
Improved
Outcomes
Software
Inc.,
Web
page:
http://www.improvedoutcomes.com/docs/WebSiteDocs/Clustering/Clust
ering_Parameters/Distance_Metrics_Overview.htm (last visited: March
8, 2014).
B. Liu, W. S. Lee, P. S. Yu, X. Li, “Partially Supervised Classification
of Text Documents”, in Proceedings of International Conference on
Machine Learning (ICML 2002), pp. 387-394, 2002.
Stemming [interactive]:
http://www.comp.lancs.ac.uk/computing/research/stemming/general/
(last visited: March 8, 2014).
Attribute-Relation file format [interactive].
http://www.cs.waikato.ac.nz/ml/weka/arff.html (last visited: March 9,
2014)
Expectation Maximization. Hierarchical Clustering, [interactive]:
http://www.facweb.iitkgp.ernet.in/~sudeshna/courses/ML06/mllecture19.pdf, 2005.
Bag of words [interactive]: http://en.wikipedia.org/wiki/Bag-ofwords_model (last visited: March 20, 2014)