Download a practical case study on the performance of text classifiers

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Human genetic clustering wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Nearest-neighbor chain algorithm wikipedia , lookup

K-means clustering wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
BULETINUL INSTITUTULUI POLITEHNIC DIN IAŞI
Publicat de
Universitatea Tehnică „Gheorghe Asachi” din Iaşi
Tomul LVII (LXI), Fasc. 1, 2011
SecŃia
AUTOMATICĂ şi CALCULATOARE
A PRACTICAL CASE STUDY ON THE PERFORMANCE
OF TEXT CLASSIFIERS
BY
MIRCEA IONUł ASTRATIEI∗ and ALEXANDRU ARCHIP
“Gheorghe Asachi” Technical University of Iaşi,
Faculty of Automatic Control and Computer Engineering
Received: January 11, 2011
Accepted for publication: March 14, 2011
Abstract. This paper aims to improve the K-Means clustering and K-NN
classification algorithms results in order to aid the human expert in choosing the
number of clusters and theirs initials centers for K-Means algorithm and the
variable K for the K-NN algorithm. We present a set of comparative results
between classifications that are performed using only the human expert as trainer
and an automatic approach that uses clustering results as training sets for the
classification of text documents.
Key words: Data mining, K-Means, K-Nearest-Neighborn, clustering,
classification, English/Romanian documents.
2000 Mathematics Subject Classification: 53B25, 53C15.
1. Introduction
Data mining is the process of extracting patterns from data and
establishing relationships. Data mining is used in many areas like mathematics,
cybernetics, genetics, marketing, web search engines, etc. The classification of
web search results represents only one of the newest applications for data
∗
Corresponding author; e-mail: [email protected]
104
Mircea IonuŃ Astratiei and Alexandru Archip
mining: web documents are clustered/classified based on their content and
relevant search results with respect to a given set of keywords are better
presented to the client (Grossman & Frieder, 2004).
Considering the huge amount of data gathered from the Internet is allimportant to aid the human expert in applying the data mining algorithms. The
present paper introduces an automated method to classify the collected documents
based on text analysis and similarity determination using Cosine Similarity Metric.
Finally the paper compares the results of generic and improved KMeans and K-NN algorithms and highlights the fact that a human expert cannot
face the huge amount of data, unlike the computer which might determine the
similarity more accurately using formulas.
2. Clustering and Classification – Generic Data Mining Algorithms
2.1. Clustering
Clustering, represents the process of grouping the data into classes
called clusters based on some attribute values that are characteristic for all the
objects. The elements in one cluster must have similar attributes values and
must be different from the elements of the other classes. This similarity is often
calculated using distances between the attributes that define the objects.
Clustering methods can be organized in some categories: partitioning
methods, hierarchical methods, density-based methods, grid-based methods,
model-based methods, methods for high-dimensional data (such as frequent
pattern–based methods), and constraint-based clustering according to (Han &
Kamber, 2006; Hornick et al., 2007). Partitioning methods classify the given
data in a known number of groups that represent the clusters. A cluster must
satisfy two conditions: it has to contain at least one object and each object must
belong to one cluster. Partitioning methods involve determining all clusters
during the first iteration and then improving the results by moving objects
between clusters at each new iteration. Final results are given when no other
efficient change can be done. Through out each iteration, the clusters are
defined by a dominant object or by a new object obtained from processing the
values of the characteristic attributes.
Hierarchical methods can be either agglomerative or divisive, also
called the bottom-up and top-down strategies, and consist in building a
hierarchy of clusters. Data is not partitioned into a cluster in a single step, but
instead a series of partitions takes places. The agglomerative methods proceed
by merging each object into groups. On the other hand the divisive approach
starts with groups, or a number of groups, that will be successively separated
into a larger number of classes.
Density-based methods focus on the notion of compactness and the
main idea is to start from a cluster already created. Each new object is added to
Bul. Inst. Polit. Iaşi, t. LVII (LXI), f. 1, 2011
105
the cluster that contains the closest neighborhood. One of the partitioning
methods used in statistics and machine learning is the K–Means algorithm
which aims to create a group of k clusters from an initial set of n objects, each
object being assigned to the cluster with the closest mean. Centroid-based
technique for K-Means uses a number n of objects and a constant K which
represent the number of the classes in which the human expert desires to split
the data (Han & Kamber, 2006; Hornick et al., 2007).
Brief description of how the algorithm works:
For the given K clusters, a number of k objects are needed in order to
represent the centers of the respective clusters. These initial centroids are in
most cases chosen randomly from the n lot. The remaining n-K objects will be
assigned to the most similar cluster with respect to the used metric function.
Usually the Euclidian metric is employed to compute the distance between the
center of the cluster and an object. After all the n objects are assigned, a new
center for each cluster is determined. This new center may be the most
representative object in the lot or it can be defined as a new object which is
obtained by computing the mean values for the significant attributes of all
members (Han & Kamber, 2006; Hornick et al., 2007; Moore, 2001). The
algorithm reaches a state of convergence when no objects are moved between
the clusters or when the centroids of the clusters no longer change their
attribute values.
2.2. Classification
Classification is a form of data analysis where a model is used to
predict the category to which a new object belongs to. The main difference
between clustering and classification is that in classification the number of
classes and theirs labels are predetermined.
This process has two steps. The first step, also known as learning step,
consists of analyzing a predetermined training set of data classes or concepts.
Unlike the clustering process, this first stage consists of a supervised learning
strategy. For the second step, the classifier estimates whether the new data
belongs to one of the classes already known, based on the rules obtained at the
first step. There are many classification algorithms: decision tree, Bayesian
Classification, Rule-Based Classification, Learning from Your Neighbors.
Decision tree algorithms are based on a tree structure where internal
nodes represent test conditions for various attributes. For each such a node, an
edge represents the decision taken to reach another node. Finally the leaves
indicate the classes determined based on a set of possible decisions (Han &
Kamber, 2006; Hornick et al., 2007). The same idea is used in rule-based
classification algorithms where the learning model is represented by a set of IF
THEN rules (Han & Kamber, 2006; Hornick et al., 2007). Bayesian
classification consists on statistical data and probabilities computed by Bayes
106
Mircea IonuŃ Astratiei and Alexandru Archip
Theorem (Han & Kamber, 2006). Lazy Learners (or Learning from Your
Neighbors methods) construct a general model from the training data set.
Brief description of how the algorithm works:
A lazy learner algorithm is K-Nearest-Neighbor, its efficiency being
demonstrated on large amount of data. The principle on K-NN classification
consists in comparing the new data with the training set and learning from this
analogy. Given a new object, the algorithm searches within the training data set
for the K most similar objects (similarity is again considered with respect to a
given metric function).The result is given by the simple majority of the similar
objects. K represents the number of training data objects that will be used to
compare the new object. K must be a positive non null integer number. The
value of K is important and this choice influences the predictions. For K=1 the
unclassified object is assigned to the class with the most similar object in the
training data (Han & Kamber, 2006; Hornick et al., 2007; Moore, 2001).
We choose these two form of data analysis to group a lot of diverse text
documents into clusters based on their semantic meaning. The results can be
used with succes for organize the documents extracted by a web crawler. The
semiautomatic method is very useful when the amount of data is very large, like
in a web crawler case when milions of pages are brouthg every day.
3. Improvements
3.1. K-MEANS
Our implementation of K-Means algorithm eliminates the need of a
human expert for choosing the number of clusters and the initial centroids.
Furthermore after the clusters are obtained, a re-clusterization is made for the
documents that belong to none of the clusters already obtained. Similar work is
done in (Pelleg & Moore, 2000; Muhr & Granitzer, 2009), but unlike (Pelleg &
Moore, 2000; Muhr & Granitzer, 2009) our approach determines, in a
semiautomatic way the variable K and also the initial centers for each one of the
K clusters.
The number of clusters and initial centroids
We suppose that there is at least one cluster and we randomly choose
the first center for the first cluster. Then the distance between the centroid and
the remaining documents is computed and the most similar are eliminated. We
consider similar those documents if the distance do not exceed a threshold. In
our case this threshold is obtained by observation of the distances between all
the documents we have tested (detalies in (Astratiei & Archip, 2010)).The
threshold can vary if a more refined or compact classification is needed.
Once we have a centroid we compute the average distance, using
Cosine Similarity (Astratiei & Archip, 2010), from the centroid and the
documents that are most similar to it. The average is used to determine the next
Bul. Inst. Polit. Iaşi, t. LVII (LXI), f. 1, 2011
107
center for the next cluster. The farthermost distance, computed from the last
centroid determined and the remaining documents, distance referred to the
average already calculated, indicates the nest centroids. These steps are repeated
until there are no documents remaining. Finally we obtain a number of centroids
that represents the input data for the K-Means algorithm, data obtained
automatically.
Having the number of clusters and the initial centroids we can apply the
classic K-Means algorithm. A number of iterations are done until there are no
changes between the clusters members. After each iteration, a new center is
computed. As a new centroid we choose the document that is closer to all the
other ones in the same cluster.
The results are then filtered and if there is at least one document in a
cluster that is farther from the final centroid that the threshold imputed, there
will be a re-clusterization that applies the same steps for the lot of “rejected”
files. After a re-clusterization at least one new cluster must appear. This
condition reduces the probability of including documents in clusters that contain
dissimilar documents. The improvement for the algorithm aids the human
expert for a better detection of the described fields (Salvador & Chan, 2003; Li
et al., 2003; Matveeva, 2006; Arthur & Vassilvitskii, 2007).
3.2. K-NN
The classic K-NN algorithm is independent from other data mining
methods and needs a set of input data to learn from. Input data for K-NN
algorithm are the variable K and the training set which represents the model for
learning. K is generally determined experimentally starting with K = 1 and
observing the error rate. This determination is also a human influence; therefore
for an automatic classification algorithm we choose the K value depending on
the number of files in the clusters. For a minimum error rate we consider K
double of the minimum length of the clusters.
If we associate these two data mining methods, clustering and
classification, we will obtain an automatic training data set for K-NN. This set
represents the results of the K-Means clusterization. Therefore the classes
obtained will represent the training model for a classifier; the new documents
brought being associated with the clusters to which they are most similar.
After testing over 90 files in Romanian and 90 files in English using
this association of algorithms and the automatic detection of input data for KMeans, we managed to improve the precision.
4. Results
The test application has been developed using C ANSI language. Tests
were performed on total of 180 text documents, 90 for Romanian language and
108
Mircea IonuŃ Astratiei and Alexandru Archip
90 for English. Tests for K-NN classification were performed on 52 new text
documents for English and 46 new text documents for Romanian. For testing
the Romanian documents we chose 12 fields as history, physics, animals, gothic
art, heart diseases, psychology, computer science, chemistry, philosophy,
economy, astrology and geography. English document belonged to the
following fields: advertising, AIDS, London, Cold War, cloning, Shakespeare,
music, pollution, racism, operating systems, Greek mythology, internet, drugs in
sports. In Tables 1,…,4 the columns marked with “*” denote the choice made
from the human expert perspective.
4.1. K-Means Tests
For K-Means we performed 6 test: one for our automatic implementation
and 5 for the classic method with randomly chosen centers and centers chosen
from human expert perspective. We observed for the total number of classes
resulted, the incorrect number of clusters and finally we calculated the
percentage of correct clusters reported to the total number.
Results interpretation
For the Romanian documents there is a significant difference between
the automatic approach, that in our case had correctly classified all the files, and
the classic K-Means approach where the percentage of correct clusters, in
randomly chosen centroids case, is around 50~60%. This method gives
incorrect results in tests, so can’t be a reliable solution. We have obtained better
results when we have chosen the centroids, but this would imply the existence
of a good expert in a real world scenario. During testing we have discovered
some cases where the expert tends to extend the meaning of the document to a
larger class because of the ideas exposed in the file. More detailes in (Astratiei
& Archip, 2010).
Table 1
Results of K-MEANS Clustering for Romanian Documents
Classic
Auto
Centroids
Random
Chosen*
No 1
No 1
No 2
No 3
No 4
No 5
Total clusters
20
15
15
15
15
15
Correct clusters
20
9
8
9
11
11
Incorrect clusters
0
6
7
6
4
4
Percentage of correct
clusters
100%
60%
53.33%
60%
73.33%
73.33%
109
Bul. Inst. Polit. Iaşi, t. LVII (LXI), f. 1, 2011
Table 2
Results of K-MEANS Clustering for English Documents
Classic
Auto
Centroids
Random
Chosen*
No 1
No 1
No 2
No 3
No 4
No 5
Total clusters
21
14
14
14
14
14
Correct clusters
20
11
10
6
10
13
Incorrect clusters
1
3
4
8
4
1
Percentage of correct
clusters
95.23%
57.14%
71.42%
92.85%
78.57% 71.42%
4.2. K-NN Tests
For the K-NN we performed 4 tests using the automatic K-Means results
as training set, classic K-Means (centroids randomly chosen and from the human
expert perspective) results as training set and a training set chosen from the
human expert perspective. We observed the total number of documents to classify
the number of training set clusters, the number of incorrect classified documents
and finally we calculated the percentage of correct classified documents. We can
not rely on the randomly chosen centroids K-Means method because a
classification using those classes for training data will give ambiguous results.
For better results the new files used in testing the K-NN algorithm
belonged to the same fields like the documents that we have been clustering
with K-Means. Test results for K-NN are presented detailed in Table 3 and 4
and in (Astratiei & Archip, 2010).
Table 3
Results of K-NN Classification for Romanian Documents
Auto
K-Means
results
Classic K-Means results
Chosen
centroids*
Random
centroids
Chosen
clusters*
Total training clusters
20
15
15
15
Number of documents
46
46
46
46
Correct classified
45
41
42
44
Incorrect classified
1
5
4
2
Correct classified, [%]
97.82
89.13
91.3
95.65
110
Mircea IonuŃ Astratiei and Alexandru Archip
Table 4
Results of K-NN Classification for English Documents
Auto
K-Means
results
Classic K-Means results
Chosen
centroids*
Random
centroids
Chosen
clusters*
Total training clusters
21
14
14
14
Number of documents
52
52
52
52
Correct classified
50
48
45
45
Incorrect classified
2
4
7
7
Correct classified, [%]
96.15
92.30
86.53
86.53
5. Practical Case Study
This two data mining algorithms can be used in many practical
pourposes like for example in games, in bussines where can be determined new
marketing solutions by analysis of customer data set. Olso, data mining has
been widely used in area of science and engineering, such as bioinformatics,
genetics, medicine, electrical power engineering.
In this paper we will present a practical case study that involves the
K-Means algorithm and techniques for extracting information from the web.
Most of the search engines extract information from the World Wide
Web and offer the to users in a mixed fashion depending on the importance of
the resource that contained the searched information.
To test our approach for the K-Means algorithm we have done a real time
test by applying the algorithm over a set of documents feched by a web crawler.
The crawler started from a seed list containing links to a websites
describing the life of the animal jaguar and links sending to pages describing the
car Jaguar. The crawler fetched about 600 pages from both domaines cars and
wild animals.
The results, after applying the semiautomatic K-Means were
interesting. We have obtained four different clusters that included all the 600
documents:
− the first cluster contained links sending to pages describing the car
Jaguar, the performances of the car, the features available and the prices;
− the second one contained links to pages describing accidents
involving the Jaguar cars;
− the third cluster was composed of links to pages describing the life of
the animal jaguar in the wild;
− the last cluster was composed from link to pages listing the dealer for
the car Jaguar.
Bul. Inst. Polit. Iaşi, t. LVII (LXI), f. 1, 2011
111
Other practical examples on testing the algorithm are some documents
concerning operating systems. After reading all the documents we have
classified them as one cluster labeled Operating Systems and we have chosen
one initial centroid. The automatic method had split the documents in two
clusters. Indeed the documents contained parallels between Windows and Unix
or Mac or they were describing one of the OS, apart from some that explained
how to configure an OS. Those files wore clustered in a separate class, a benefic
thing because even if the same field was reached in all files, the idea was
different. For better understanding of the use of a fine clusterization imagine
this case: you want to configure Windows and the search for information leads
you to a document where a parallel is made between Windows an Linux, instead
of leading you directly to the desired information.
We manage to create a simple search engine that, using data mining
algorithms, can offer to the user a set o results ordered in an intelligent and
intuitive manner. This can be a very powerful tool considering the huge amount
of information that can be found on the web.
6. Conclusion and Future Work
In this paper we have presented a method of clustering and
classification that aids the human expert in choosing the initial number of
clusters and the initial centroids for K-Means algorithm and training data set
for the K-NN algorithm. We have focused on combining two different data
mining techniques (a descriptive method as trainer – K-Means Clustering and a predictive technique – K-NN classifier) in order to improve the accuracy
of a classifier. In order to obtain more precise results we intent to make an
analysis that will automatically detect derived and composed words. This
analysis should take into consideration other descriptive data mining methods
(such as frequent sequence miners) in order to further improve a potential
classifier. One such improvement consists in determining dominant phrases
for a set of computed clusters. Future development also involves addressing
larger datasets.
REFERENCES
Arthur D., Vassilvitskii S., K-Means++: the Advantages of Careful Seeding. In SODA
’07: Proceedings of the eighteenth annual ACMSIAM symposium on Discrete
algorithms. Philadelphia, PA, USA: Society for Industrial and Applied
Mathematics, 2007, 1027–1035.
Astratiei M.I., Archip A., A Case Study on Improving the Performance of Text
Classifiers. Proceedins of 14th International Conference of System Theory and
Control Sinaia 2010.
112
Mircea IonuŃ Astratiei and Alexandru Archip
Garcia E., An Information Retrieval Tutorial on Cosine Similarity Measures,
Dot Products and Term Weight Calculations. 2006, http://www.miislita.
com/information-retrievaltutorial/cosine-similarity-tutorial.html
Grossman D.A., Frieder O., Information Retrieval - Algorithms and Heuristics. Second
Edition, Springer, 2004.
Han J., Kamber M., Data Mining - Concepts and Techniques. Morgan Kaufman
Publications, 2006.
Hornick M.F., Marcad E., Venkayala S., Java Data Mining Strategy. Standard and
Practice. Morgan Kaufman Publications, 2007.
Li B., Yu S., Lu Q., An Improved k-Nearest Neighbor Algorithm for Text
Categorization. CoRR, vol. cs.CL/0306099, 2003.
Matveeva I., Document Representation and Multilevel Measures of Document
Similarity. In Proceedings of the 2006 Conference of the North American
Chapter of the Association for Computational Linguistics on Human Language
Technology. Morristown, NJ, USA: Association for Computational
Linguistics, 2006, 235–238.
Moore A.W., K-Means and Hierarchical Clustering. 2001. Available:
http://www.cs.cmu.edu/afs/cs/user/awm/web/tutorials/kmeans11.pdf
Muhr M., Granitzer M., Automatic Cluster Number Selection Using a Split and Merge
k-Means Approach. Database and Expert Systems Applications, International
Workshop on, 363–367, 2009.
Pelleg D., Moore A., X-Means: Extending k-Means with Efficient Estimation of the
Number of Clusters. in Proceedings of the Seventeenth International
Conference on Machine Learning, San Francisco: Morgan Kaufmann, 2000,
727–734.
Salvador S., Chan P., Determining the Number of Clusters/Segments in Hierarchical
Clustering/Segmentation Algorithms. Dept. of Computer Sciences Florida
Institute of Technology Melbourne, 2003.
STUDIU DE CAZ PRACTIC PRIVIND PERFORMANłELE
CLASIFICATORILOR DOCUMENTELOR TEXT
(Rezumat)
Prezenta lucrare popune o soluŃie privind îmbunătăŃirea clasificatorilor
documentelor text prin abordarea unor variante ale algoritmilor de clusterizare şi
clasificare. Astfel s-a demonstrat experimental faptul că implementarea unor noi
abordări pentru K-Means si K-NN, se obŃin rezultate evident îmbunatăŃite.
Adaptarea algoritmului K-Means pentru a putea fi aplicat asupra documentelor
de tip text implică o alegere iniŃială a centrilor, varianta alegerii aleatoare nefiind o
soluŃie în acest caz. Centrii iniŃiali trebuiesc aleşi în mod eficient pentru a realiza o
clusterizare corectă a documentelor spre deosebire de cazul în care obiectele de
clusterizat au valori caracteristice numerice şi recalcularea centrilor se realizează cu o
precizie mai mare, fiind posibil cazul în care se obŃin obiecte noi, obiecte de referinŃă în
ceea ce priveşte cel mai reprezentativ element al mulŃimii.
Bul. Inst. Polit. Iaşi, t. LVII (LXI), f. 1, 2011
113
Adaptarea algoritmului K-NN constă în alegerea unei metrici adecvate
contextului care să poată calcula similaritatea între două documente în scopul gasirii
celor mai apropiate în raport cu setul de antrenament. Roluri importanate în precizia
rezultatelor le au şi alegerea setului de antrenament precum şi a variabilei K. La fel ca şi
în cazul clusterizării inferenŃa expertului uman define dificilă pe un set mare de date. În
acest caz, setul de antrenament trebuie să fie bine organizat şi structurat în funcŃie de
domeniul de activitate la care se face referire, această organizare implicând un volum
mare de muncă din partea unui factor uman, a căror cunoştiinŃe tenhice trebuie să facă
referire la contextul utilizat.