Download FullText - Brunel University Research Archive

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

K-means clustering wikipedia , lookup

Transcript
Enhanced Topic Identification Algorithm for Arabic Corpora
Amal Alsaad
Maysam Abbod
Department of Electronic & Computer Engineering
Brunel University London
London, UK
[email protected]
Department of Electronic & Computer Engineering
Brunel University London
London, UK
[email protected]
Abstract - During the past few years, the construction of
digitalized content is rapidly increasing, raising the demand of
information retrieval, data mining and automatic data tagging
applications. There are few researches in this field for Arabic
data due to the complex nature of Arabic language and the lack
of standard corpora. In addition, most work focuses on
improving Arabic stemming algorithms, or topic identification
and classification methods and experiments. No work has been
conducted to include an efficient stemming method within the
classification algorithm, which would lead to more efficient
outcome. In this paper, we propose a new approach to identify
significant keywords for Arabic corpora. That is done by
implementing advanced stemming and root extraction
algorithm, as well as Term Frequency/Inverse Document
Frequency (TFIDF) topic identification method. Our results
show that combining advanced stemming, root extraction and
TFIDF techniques, lead to extracting a highly significant terms
represented by Arabic roots. These roots weights higher TFIDF
values than terms extracted without the use of advanced
stemming and root extraction methods. Decreasing the size of
indexed words and improving the feature selection process.
Keywords - root extraction; feature selection; topic
identification; natural language processing; data mining; text
mining
have been conducted to get more efficient data mining results
in Arabic Information Retrieval systems, such as text
stemming and text classification algorithms [2, 3].
1000
800
600
400
200
0
Figure 1. Top Ten Languages in the Internet 2013 – in millions of users.
6000
5000
4000
3000
I. INTRODUCTION
2000
Digitalized textual data and the use of Internet have been
vastly increasing during the past few years. It was estimated
that the online language population is 2,802,478,934 people
in December 2013 by Internet World Stats (www.
internetworldstats.com). Internet growth between 2000 and
2014 has reached 676.3%, delivering the idea about how
great is the amount of data on the web nowadays. Internet
World Stats shows that Arabic is the 4th most used language
with more than 135.6 millions of users, as seen in Fig. 1. As
well, Arabic is on the top of the list as the fastest growing
language on the web, with a growth rate of 5,296.6% between
2000 and 2014, as shown in Fig. 2. This indicates that the
construction of Arabic data is growing rapidly, rising the
need for standard data mining and natural language
processing tools to represent and analyse this data in the most
efficient way.
In contract, there are no standard Arabic text mining and
text classification tools until recently [1]. This is due to its
complex structure and difficult linguistic rules compared to
English and other languages. Nevertheless, many studies
1000
0
Figure 2. Percentages of Users Growth in the Internet by Language
between 2000–2013.
Text classification, which is also known as text
categorization or topic identification, is the assignment of
discovering if a piece of text belongs to any of a predefined
set of classes [3]. Another definition states that, the goal of
text classification is to learn classification methods which can
be implemented to classify documents automatically [4].
Text classification requires the use of text pre-processing
methods to represent the text before processing text
classification. Such methods are referred to as text stemming
methods, which include removal of insignificant characters,
affixes and stop words.
Text stemming, is a significant approach for text
representation in the fields of text mining and natural
language processing. For Arabic text data mining, the main
two stemming schemes consist of light stemming and rootbased stemming [5]. Light stemming methods are used to
remove prefixes, suffixes to generate a stem. They are used
to derive a more efficient form of representative indexing for
words [6]. In root-based stemming, roots of the words are
extracted by defining morphological analysis techniques,
where the roots are extracted, the words are then grouped
accordingly.
Research that has been done on Arabic information
retrieval algorithms focuses on improving root-based
stemming techniques, or on comparing and improving text
classification algorithms using light stemming for text preprocessing. There are no work done that includes enhanced
root stemming within text classification combined. In this
paper, we present a new approach to identify keywords
representing features of Arabic text collections. The
keywords are used to create features to be used while
classifying unclassified text documents. Our approach
comprises implementing an advanced root-stemming
algorithm combined with Term Frequency/Inverse
Document Frequency (TFIDF) topic identification method.
II. RELATED W ORK
Many methods and algorithms were developed for Arabic
text mining the fields of natural language processing and
information retrieval. In this section, we discuss related work
that has been conducted on Arabic text stemming and Arabic
topic identification and feature selection consecutively.
A. Arabic Text Stemming Methods
Light stemmers and root-based stemmers are the
fundamental two approaches of Arabic text stemming [5].
Light stemmers are used mainly in information retrieval. The
main concept of light stemmers is to remove prefixes and
suffixes from a word and then create a stem. In this process,
the ideal forms of representative indexing for words are
derived [6]. An example of overall steps of Arabic light
stemming algorithms represented in Fig. 3.
1. Normalize word
 Remove diacritics
 Replace ‫ آ‬،‫ أ‬،‫ إ‬with ‫ا‬
 Replace ‫ ة‬with ‫ه‬
 Replace ‫ ى‬with ‫ي‬
2. Stem prefixes
 Remove prefixes: ‫ و‬، ‫ للـ‬، ‫ فالـ‬، ‫ كالـ‬، ‫ بالـ‬، ‫ والـ‬، ‫الـ‬
3. Stem prefixes
 Remove suffixes: ‫ ي‬، ‫ ه‬، ‫ ية‬، ‫ ين‬، ‫ ون‬، ‫ ات‬، ‫ ان‬، ‫ها‬
Figure 3. Steps of Arabic Light Stemming Algorithms [1].
The second approach of Arabic text stemming is rootbased stemming. In this approach, roots of the words are
extracted by defining morphological analysis techniques. As
the roots are extracted, the words are then grouped
accordingly [5]. There are two main steps of root-based
stemmers approach. First step is removing prefixes and
suffixes, and secondly extracting the roots by analysing the
words depending on their morphological components. Root
based stemmers take into consideration the great amount of
lexical variation that caused by Arabic complex morphology.
As mentioned previously, this concern would enlarge the
indexing structure volume and reduce the performance of the
system. The other reason of using root-based stemmers is that
words which are entered as user query in information
retrieval systems are not exactly matching those included in
the relevant documents [2].
Khoja’s stemming algorithm is considered as one of the
earliest and highly accurate Arabic root-based stemmers in
the literature [7]. Khoja designed an algorithm which
eliminates the longest suffix and prefix. Next the algorithm
extracts the root by matching the rest of the word against a
list of verbal and noun patterns to extract the root. Checking
the root against a list of roots is an important performed step
of Khoja’s stemming algorithm. This step is important to
check the correctness of the root. Once the extracted root is
found, the algorithm assigns it as the root of the word. The
algorithm utilizes several linguistic data files that contain lists
of all diacritic characters, punctuation characters, definite
article, and stop words. In addition, the stemmer handles
some cases of Arabic tri-literal words that are weak,
hamzated, geminated or eliminated-long-vowel.
However, Khoja’s algorithm has a number of weaknesses.
One of the issues is that the word (‫ )منظمات‬which means
(organizations) is stemmed to the root (‫ )ظمآ‬which means (he
became thirsty) instead of the correct root (‫)نظم‬. One more
issue about this algorithm is happened when the word is
deducted to a tri-literal word, the weak letter is deleted in the
first place, and then the last letter is doubled, or another weak
letter or an alif is added to the word. This issue leads to
extracting a root of another word which may not be related to
the word. As an example, the root of the word (‫ )روايات‬is
extracted as (‫)ريي‬, where the correct root is (‫)روي‬. Another
example of Khoja’s algorithm weaknesses is extracting the
root for the word (‫ )آخر‬as (‫)خرر‬, where the correct root is
(‫)أخر‬.
Alsaad and Abbod [8] proposed an enhanced root-based
stemming algorithm that considers weak, hamzated,
eliminated-long-vowel and two-letter geminated words, as
well as tackling the main issues which occurred in Khoja’s
stemming algorithm. For example, the shortest affix/prefix is
eliminated from a word, and then the word is checked against
Arabic word patterns according to its length. If no match is
found, it is checked if there are more suffixes/prefixes to
eliminate, and so on. That is because, while removing the
longest suffixes and prefixes from the word in the first place,
a constant letter or more could be eliminated from the word,
leading to extracting the root incorrectly, such as extracting
the root (‫ )ظمآ‬of (‫)منظمات‬. Also, in Khoja’s algorithm,
incorrect root extraction may occur in weak words, because
the weak letter of the word is deleted without implementing
the litter exchange process, or incorrect replacement of the
weak letter. However, in [8] root-based stemmer takes in
account of the letter exchange rule, and the popularity weak
root types in the Arabic language. Overall, the root-based
stemmer outperforms Khoja’s stemmer extracting more
accurate roots as seen in Fig. 4.
3500
3000
2500
Khoja's
stemmer
2000
1500
Alsaad's
stemmer
1000
500
0
Roots
Extracted
Accurate
Roots
Inaccurate
Roots
Figure 4. Alsaad’s Root-stemming vs Khoja’s Testing Result – in number
extracted roots [8].
B. Topic Identification and Feature Selection
Many research studies for topic classification have been
carried out. Most approaches to discover the topic of a
document or a group of documents are based on clustering
algorithms. The term clustering or cluster analysis is the
assignment of a set of observations into subsets, referred to
as clusters, so that observations in the same cluster are related
to each other in a way [9].
Another approach is to determine a text-document topic by
using self-organizing networks. For example, a new method
was presented in [10] to identify a text-document topic based
on self-organizing neural networks studies. The authors in
[10] have exploited a class of self-organizing neural netstudies. This class is called Adaptive Resonance Theory
(ART) networks studies. Fuzzy ART incorporates
computations from fuzzy set theory into ART network
studies.
A three-step algorithm is developed to identify a Web
document topic in [11]. This algorithm is succeeded to
identify the topic with an accuracy rate reaching maximum
69.8%. The algorithm first extracts extract the text part form
web document based on predefined tags. The second step of
the algorithm is to run a mapping module. The purpose of
using this module is to map the extracted keywords on the
words of ontology concepts that have been stemmed and
sense-tagged. The module exploits Yahoo ontology and
Word Net as extended ontology database. The algorithm
finally runs an optimization module to shrink the ontology
tree into an optimized tree where only active concepts and the
intermediate active concepts are chosen. An algorithm was
developed also in [9] to search for a node with the greatest
accumulated mixture distribution among the optimized tree.
This algorithm helps determining the most suspicious nodes
to be the topic.
III. P ROPOSED ALGORITHM
In this work, we present a new Arabic text feature
selection method by implementing Alsaad and Abbod’s rootbased stemming algorithm [8], as well as the TFIDF
algorithm to extract the significant terms in the data set.
Alsaad and Abbod’s root extraction algorithm is
constructed by implementing three main phases. The first
phase concentrates on eliminating suffixes and prefixes
depending on the length of the processed word, as well as
employing a pattern matching function to eliminate infixes
and extract the root of the word. The words are matched
against patterns of similar length after every prefix/suffix
deletion, as it can be seen in Table I. The reason of that is to
advance the speed of root extraction and to evade removing
original letters of the word, which are equal to a suffix/prefix.
After that, if the word root is still not found, the word is
passed to the second phase of the algorithm where it is
decided to remove suffixes and prefixes that are of one letter
as long as the word is more than three letters long. If the word
is three letters long, it is then processed depending on it being
hamzated, weak, geminated, or a word with eliminated long
vowel. Finally, if the word is of two letters, it is processed
depending on its being a geminated or a long-voweleliminated word.
TABLE I.
Length of Patterns/Roots
Length 4
ARABIC PATTERNS AND ROOTS
Patterns
‫ فعلل‬،‫ فعلة‬،‫ مفعل‬،‫ أفعل‬،‫ فعول‬،‫ فعيل‬،‫ فعال‬،‫فاعل‬
Length 5 patterns of
tri-literal roots
،‫ تفوعل‬،‫ تفاعل‬،‫ متفعل‬،‫ افعال‬،‫ مفاعل‬،‫ تفعلة‬،‫تفعيل‬
،‫ مفعال‬،‫ فعالل فعالء‬،‫ فعالن‬،‫ مفتعل‬،‫ افتعل‬،‫انفعل‬
،‫ فاعول‬،‫ تفعيل‬،‫ مفعول‬،‫ أفاعل‬،‫ فعائل‬،‫ فعيلة‬،‫فواعل‬
‫ أفعلة‬،‫ مفعلة‬،‫ تفعلة‬،‫ فعالة‬،‫فعلى‬
Length 5 patterns of
quad roots
‫ فعالل‬،‫ فعللة‬،‫ مفعلل‬،‫تفعلل‬
Length 6 patterns of
tri-literal roots
،‫ مفاعلة‬،‫ متفاعل‬،‫ انفعال‬،‫ افتعال‬،‫ مستفعل‬،‫استفعل‬
‫ مفعوعل‬،‫ افعوعل‬،‫ أفاعيل‬،‫ أفعالء‬،‫مفاعيل‬
Length 6 patterns of quad
roots
Length 6 or more
‫ متفعلل‬،‫ افتعلل‬،‫ افعالل‬،‫فعاليل‬
‫ افعيعالل‬،‫استفعال‬
After stemming and indexing the data set, the TFIDF
algorithm is used to calculate the weights of the roots to
identify the highly significant topics. A prototype vector is
then built using a training data set of a particular topic. The
weight for each element in the vector is obtained as the
combination of the term frequency TF (w,d), that is the
number of times the word w is repeated in the document d,
and IDF (w), which is the inverse document frequency [12,
13]. The weight of the word 𝑤𝑖 in document d is called 𝑑𝑖 and
is obtained using the following equation:
𝑑𝑖 = 𝑇𝐹(𝑤, 𝑑) × 𝐼𝐷𝐹(𝑤)
(1)
The IDF (w), which is the inverse document frequency of the
term, is calculated by applying equation (2) below:
𝐼𝐷𝐹(𝑤) = log (
𝑁
𝐷𝐹(𝑤)
)
root-based stemming as well as TFIDF gives a result of less
indexed terms and a more efficient terms weighting than
when using light stemming methods.
TABLE II.
(2)
where N is the total number of documents and DF is the
number of documents in which the term has occurred in [14].
As we calculate the TFIDF value for each term in the data set,
we extract the terms with the highest TFIDF to present the
highly significant terms of the topic.
IV. EXPERIMENTS AND RESULTS
Until recently, there are no standard Arabic text corpora
for data mining and classification research purposes.
However, a number of studies are trying to scientifically
compile representative training data sets for Arabic text
classification, which cover different text topics that can be
used in future as a benchmark [15]. Therefore, many works
dealing with topic identification or text categorization for
Arabic language were conducted out using non representative
and small corpora. In order to support and test our algorithm,
we selected a text corpus of 1000 articles which corresponded
to thousands of words. The corpus was retrieved from an
online Arabic database resource providing thousands of
Arabic
online
newspaper
articles
(www.sourceforge.net/projects/arabiccorpus). The text we
chose to process belongs to the ‘Culture and Education’
category (‫ )المقاالت الثقافية‬and should extract culture related
terms as the highly significant topics.
In our experiment, a data pre-processing step was
conducted before the stemming and weighting stage. Every
article was processed to remove punctuation marks and digits
and eliminate stop words. After that, we implement Alsaad
and Abbod’s [8] root-based stemmer to extract and index the
words as roots, reducing the number of indexed terms and to
achieve a better result covering the main terms of the
category avoiding repetitive listing of words that belongs to
the same morpheme. Sequentially, the TFIDF values are
calculated to extract the highly significant terms.
As we calculated the TFIDF values, we extracted the top
ten terms to represent the category, as shown in Table II.
Subsequently, the extracted terms are compared with the top
TFIDF terms extracted from the same category articles where
a light stemmer is implemented to stem the words [16], as
shown in Table III. We can see that the terms represented by
roots in our results have a higher TFIDF value as more than
one world relates to the same morpheme. For example, our
results list the root (‫ )فنن‬of the noun (‫ )فن‬instead of listing
more than one word that belongs to the same noun, such as
(‫)فنية‬, (‫ (فن‬and (‫)فنان‬. As well as for the words (‫ )عربية‬and
(‫)عربي‬, the term extracted using our method is ‫عرب‬.
Therefore, implementing a feature selection algorithm with
TABLE III.
TERMS EXTRACTED VIA ROOT-STEMMING AND TFIDF
TFIDF
Extracted Term
4318
3891
3816
3306
3182
3160
3303
2334
2267
2156
‫علم‬
‫عمل‬
‫عرض‬
‫كتب‬
‫عرب‬
‫شعر‬
‫فنن‬
‫سرح‬
‫ثقف‬
‫حدث‬
TERMS EXTRACTED VIA LIGHT STEMMING AND TFIDF
TFIDF
Extracted Term
1867
‫عربية‬
1308
‫عربي‬
1192
‫عالم‬
1179
‫عمل‬
1063
‫كتاب‬
1056
‫فنية‬
1053
‫فن‬
1042
‫ثقافة‬
1030
‫معرض‬
948
‫فنان‬
V. CONCLUSIONS
In this paper, we presented an enhanced approach to
identify significant keywords for Arabic corpora. That is
achieved by combining advanced stemming and root
extraction algorithm, and the well-known Term
Frequency/Inverse Document Frequency (TFIDF) topic
classification method. A root-based stemmer is implemented
which handles the problems of infixes removal by
eliminating prefixes, suffixes while checking the word
against a predefined list of patterns. Also, the problem of
extracting the roots of weak, hamzated, eliminated-longvowel and tow-letter geminated words has been resolved. Our
results show that combining advanced stemming, root
extraction and TFIDF techniques, lead to extracting a highly
significant terms representation of Arabic roots. The
proposed algorithm leads to a smaller index size and a better
vectors selection of terms for each topic, representing the
most important aspects of the topic and avoiding term
repetition of similar words, or words belonging to the same
root.
[4]
VI. FUTURE W ORK
This work is to be developed to comprise documents
classification. Where more documents of six different
categories will be processed to generate vectors for each
topic. Consequently, a group of unclassified documents will
be classified by, first processing them using our root-based
stemmer, calculating the document’s terms frequencies, and
then finding the similarity measure between these terms and
the terms of each topic vector. The document topic is then
selected depending on the topic vector with the highest
similarity measure. There are different equations where the
similarity measure is calculated [17]. In this work, the
similarity measure is to be calculated using cosine function
as in (3):
[5]
[6]
Y. Kadri and J.Y. Nie, “Effective stemming for Arabic Information
Retrieval,” International conference at the British Computer Society,
London, 23 Oct. 2006, pp. 68-74.
[7]
S. Khoja and R. Garside. ‘Stemming Arabic text’, Computer Science
Department, Lancaster University, Lancaster, UK, 1999.
[8]
A. Alsaad and M. Abbod, “Arabic Text Root Extraction via
Morphological Analysis and Linguistic Constraints,” UKSIM '14
Proceedings of the 2014 UKSim-AMSS 16th International Conference
on Computer Modelling and Simulation, Cambridge, UK, 2014, pp.
125-130.
[9]
H. S. Baghdadi and B Ranaivo-Malancon, “An Automatic Topic
Identification Algorithm,” Journal of Computer Science, vol 7, no. 9,
2011, pp. 1363-1367.
K. Rajaraman and A. H. Tan, “Topic detection, tracking, and trend
analysis using self-organizing neural networks”, Advanced Knowledge
Discovery Data Mining, 2001.
S. Tiun, R. Abdullah and T.E. Kong, “Automatic topic identification
using ontology hierarchy,” Comput. Linguistic Intell, 2010.
T. Joachims, ‘A Probabilistic Analysis of the Rocchio Algorithm with
TFIDF for Text Categorization’, Technical report, School of Computer
Science Carnegie, Mellon University, Pittsburgh, 1996.
G. Salton, ‘Developments in Automatic Text Retrieval’, Science 253,
1991, pp. 974–979.
K. Seymore and R. Rosenfeld. “Using Story Topics for Language
Model Adaptation”. In Proceeding of the European Conference on
Speech Communication and Technology, 1997.
S. Al-Harbi and A. Almuhareb, A. Al-Thubaity, M. S. Khorsheed & A.
Al-Rajeh, “Automatic Arabic Text Classification”, 9es Journees
Internationales d’Analyse Statistique des Donnees Textuelles (JADT),
2008, pp. 77-83, Lyon, France.
M. Abbas, K. Smaili and D. Berkani, “Evaluation of Topic
Identification Methods on Arabic Corpora,” Journal of Digital
Information Management, vol. 9, no. 5, 2011, pp. 185-192.
H. Ayeldeen, M. Mahmood and A. Hassanien, “Effective
Classification and Categorization for Categorical Sets: Distance
Similarity Measures,” In Proceedings of Proceedings of Second
International Conference, India 2015, Advances in Intelligent Systems
and Computing, vol. 1, no. 339, 2015.
|𝒗|
𝑺𝒊𝒎(𝑫𝒋 , 𝑫𝒊 ) =
∑𝒌=𝟏 𝒅𝒋𝒌 𝒅𝒊𝒌
√∑|𝒗| (𝒅𝒋𝒌 )𝟐 ∑|𝒗| (𝒅𝒊𝒌 )𝟐
𝒌=𝟏
𝒌=𝟏
(3)
[10]
After obtaining the results, they will be compared with the
results obtained using the same classification method, yet a
light stemming approach, which is implemented by Abbas et
al [16]. In their work, Abbas et al. applies the TFIDF
weighting method via light stemming and the similarity
measure above, over news articles of six categories attained
from Arabic online news articles database, achieving the
results shown in Table IV below [16]. The proposed
classification method is to be implemented over the same
news articles, while the results projected to outperform those
achieved by Abbas et al [16].
TABLE IV.
Recall (%)
Precision (%)
Culture
71.33
88.43
Religion
93.33
86.95
Economy
83.33
80.64
80
76.92
93.33
84.33
Local
International
[1]
[11]
[12]
[13]
[14]
[15]
RESULTS OF ABBAS’S TFIDF CLASSIFICATION APPROACH
Topic
Sports
94
100
average
85.88
86.21
REFERENCES
M. K. Saad and W. Ashour, “Arabic morphological tools for text
mining,” Proc. of the 6th EEECS International Conference on Electrical
and Computer Systems, Nov. 2010.
[2]
S. Ghwanmeh, R. Al-Shalabi, G. Kanaan and S. Rabab’ah, “Enhanced
algorithm for extracting the root of Arabic words,” Proc. of the 6th IEEE
International Conference Computer Graphics and Visualization, 2009,
pp. 388-391.
[3]
G. Kanaan, R. Al-Shalabi, S. Ghwanmeh and H. Al-Ma’adeed, “A
Comparison of Text-Classification Techniques Applied to Arabic
Text,” Journal of the American Society for Information Science and
Technology, vol. 60, issue. 9, 2009, pp. 1836-1844.
G. Guo, H. Wang, D. Bell, Y. Bi, and K. Greer, “A KNN model based
approach and its application in text categorization,” In G. Goos, J.
Hartmanis, & J. van Leeuwen (Eds.), Lecture Notes in Computer
Science, Vol. 2945: Computational Linguistics and Intelligent Text
Processing, 5th International Conference, Berlin, 2004, pp. 559–570.
R. Alshalabi, “Pattern-based stemmer for finding Arabic roots,”
Information Technology Journal, vol. 4, no. I, 2005, pp. 38- 43.
[16]
[17]