Download FullText - Brunel University Research Archive

Enhanced Topic Identification Algorithm for Arabic Corpora Amal Alsaad Maysam Abbod Department of Electronic & Computer Engineering Brunel University London London, UK [email protected] Department of Electronic & Computer Engineering Brunel University London London, UK [email protected] Abstract - During the past few years, the construction of digitalized content is rapidly increasing, raising the demand of information retrieval, data mining and automatic data tagging applications. There are few researches in this field for Arabic data due to the complex nature of Arabic language and the lack of standard corpora. In addition, most work focuses on improving Arabic stemming algorithms, or topic identification and classification methods and experiments. No work has been conducted to include an efficient stemming method within the classification algorithm, which would lead to more efficient outcome. In this paper, we propose a new approach to identify significant keywords for Arabic corpora. That is done by implementing advanced stemming and root extraction algorithm, as well as Term Frequency/Inverse Document Frequency (TFIDF) topic identification method. Our results show that combining advanced stemming, root extraction and TFIDF techniques, lead to extracting a highly significant terms represented by Arabic roots. These roots weights higher TFIDF values than terms extracted without the use of advanced stemming and root extraction methods. Decreasing the size of indexed words and improving the feature selection process. Keywords - root extraction; feature selection; topic identification; natural language processing; data mining; text mining have been conducted to get more efficient data mining results in Arabic Information Retrieval systems, such as text stemming and text classification algorithms [2, 3]. 1000 800 600 400 200 0 Figure 1. Top Ten Languages in the Internet 2013 – in millions of users. 6000 5000 4000 3000 I. INTRODUCTION 2000 Digitalized textual data and the use of Internet have been vastly increasing during the past few years. It was estimated that the online language population is 2,802,478,934 people in December 2013 by Internet World Stats (www. internetworldstats.com). Internet growth between 2000 and 2014 has reached 676.3%, delivering the idea about how great is the amount of data on the web nowadays. Internet World Stats shows that Arabic is the 4th most used language with more than 135.6 millions of users, as seen in Fig. 1. As well, Arabic is on the top of the list as the fastest growing language on the web, with a growth rate of 5,296.6% between 2000 and 2014, as shown in Fig. 2. This indicates that the construction of Arabic data is growing rapidly, rising the need for standard data mining and natural language processing tools to represent and analyse this data in the most efficient way. In contract, there are no standard Arabic text mining and text classification tools until recently [1]. This is due to its complex structure and difficult linguistic rules compared to English and other languages. Nevertheless, many studies 1000 0 Figure 2. Percentages of Users Growth in the Internet by Language between 2000–2013. Text classification, which is also known as text categorization or topic identification, is the assignment of discovering if a piece of text belongs to any of a predefined set of classes [3]. Another definition states that, the goal of text classification is to learn classification methods which can be implemented to classify documents automatically [4]. Text classification requires the use of text pre-processing methods to represent the text before processing text classification. Such methods are referred to as text stemming methods, which include removal of insignificant characters, affixes and stop words. Text stemming, is a significant approach for text representation in the fields of text mining and natural language processing. For Arabic text data mining, the main two stemming schemes consist of light stemming and rootbased stemming [5]. Light stemming methods are used to remove prefixes, suffixes to generate a stem. They are used to derive a more efficient form of representative indexing for words [6]. In root-based stemming, roots of the words are extracted by defining morphological analysis techniques, where the roots are extracted, the words are then grouped accordingly. Research that has been done on Arabic information retrieval algorithms focuses on improving root-based stemming techniques, or on comparing and improving text classification algorithms using light stemming for text preprocessing. There are no work done that includes enhanced root stemming within text classification combined. In this paper, we present a new approach to identify keywords representing features of Arabic text collections. The keywords are used to create features to be used while classifying unclassified text documents. Our approach comprises implementing an advanced root-stemming algorithm combined with Term Frequency/Inverse Document Frequency (TFIDF) topic identification method. II. RELATED W ORK Many methods and algorithms were developed for Arabic text mining the fields of natural language processing and information retrieval. In this section, we discuss related work that has been conducted on Arabic text stemming and Arabic topic identification and feature selection consecutively. A. Arabic Text Stemming Methods Light stemmers and root-based stemmers are the fundamental two approaches of Arabic text stemming [5]. Light stemmers are used mainly in information retrieval. The main concept of light stemmers is to remove prefixes and suffixes from a word and then create a stem. In this process, the ideal forms of representative indexing for words are derived [6]. An example of overall steps of Arabic light stemming algorithms represented in Fig. 3. 1. Normalize word  Remove diacritics  Replace ‫ آ‬،‫ أ‬،‫ إ‬with ‫ا‬  Replace ‫ ة‬with ‫ه‬  Replace ‫ ى‬with ‫ي‬ 2. Stem prefixes  Remove prefixes: ‫ و‬، ‫ للـ‬، ‫ فالـ‬، ‫ كالـ‬، ‫ بالـ‬، ‫ والـ‬، ‫الـ‬ 3. Stem prefixes  Remove suffixes: ‫ ي‬، ‫ ه‬، ‫ ية‬، ‫ ين‬، ‫ ون‬، ‫ ات‬، ‫ ان‬، ‫ها‬ Figure 3. Steps of Arabic Light Stemming Algorithms [1]. The second approach of Arabic text stemming is rootbased stemming. In this approach, roots of the words are extracted by defining morphological analysis techniques. As the roots are extracted, the words are then grouped accordingly [5]. There are two main steps of root-based stemmers approach. First step is removing prefixes and suffixes, and secondly extracting the roots by analysing the words depending on their morphological components. Root based stemmers take into consideration the great amount of lexical variation that caused by Arabic complex morphology. As mentioned previously, this concern would enlarge the indexing structure volume and reduce the performance of the system. The other reason of using root-based stemmers is that words which are entered as user query in information retrieval systems are not exactly matching those included in the relevant documents [2]. Khoja’s stemming algorithm is considered as one of the earliest and highly accurate Arabic root-based stemmers in the literature [7]. Khoja designed an algorithm which eliminates the longest suffix and prefix. Next the algorithm extracts the root by matching the rest of the word against a list of verbal and noun patterns to extract the root. Checking the root against a list of roots is an important performed step of Khoja’s stemming algorithm. This step is important to check the correctness of the root. Once the extracted root is found, the algorithm assigns it as the root of the word. The algorithm utilizes several linguistic data files that contain lists of all diacritic characters, punctuation characters, definite article, and stop words. In addition, the stemmer handles some cases of Arabic tri-literal words that are weak, hamzated, geminated or eliminated-long-vowel. However, Khoja’s algorithm has a number of weaknesses. One of the issues is that the word (‫ )منظمات‬which means (organizations) is stemmed to the root (‫ )ظمآ‬which means (he became thirsty) instead of the correct root (‫)نظم‬. One more issue about this algorithm is happened when the word is deducted to a tri-literal word, the weak letter is deleted in the first place, and then the last letter is doubled, or another weak letter or an alif is added to the word. This issue leads to extracting a root of another word which may not be related to the word. As an example, the root of the word (‫ )روايات‬is extracted as (‫)ريي‬, where the correct root is (‫)روي‬. Another example of Khoja’s algorithm weaknesses is extracting the root for the word (‫ )آخر‬as (‫)خرر‬, where the correct root is (‫)أخر‬. Alsaad and Abbod [8] proposed an enhanced root-based stemming algorithm that considers weak, hamzated, eliminated-long-vowel and two-letter geminated words, as well as tackling the main issues which occurred in Khoja’s stemming algorithm. For example, the shortest affix/prefix is eliminated from a word, and then the word is checked against Arabic word patterns according to its length. If no match is found, it is checked if there are more suffixes/prefixes to eliminate, and so on. That is because, while removing the longest suffixes and prefixes from the word in the first place, a constant letter or more could be eliminated from the word, leading to extracting the root incorrectly, such as extracting the root (‫ )ظمآ‬of (‫)منظمات‬. Also, in Khoja’s algorithm, incorrect root extraction may occur in weak words, because the weak letter of the word is deleted without implementing the litter exchange process, or incorrect replacement of the weak letter. However, in [8] root-based stemmer takes in account of the letter exchange rule, and the popularity weak root types in the Arabic language. Overall, the root-based stemmer outperforms Khoja’s stemmer extracting more accurate roots as seen in Fig. 4. 3500 3000 2500 Khoja's stemmer 2000 1500 Alsaad's stemmer 1000 500 0 Roots Extracted Accurate Roots Inaccurate Roots Figure 4. Alsaad’s Root-stemming vs Khoja’s Testing Result – in number extracted roots [8]. B. Topic Identification and Feature Selection Many research studies for topic classification have been carried out. Most approaches to discover the topic of a document or a group of documents are based on clustering algorithms. The term clustering or cluster analysis is the assignment of a set of observations into subsets, referred to as clusters, so that observations in the same cluster are related to each other in a way [9]. Another approach is to determine a text-document topic by using self-organizing networks. For example, a new method was presented in [10] to identify a text-document topic based on self-organizing neural networks studies. The authors in [10] have exploited a class of self-organizing neural netstudies. This class is called Adaptive Resonance Theory (ART) networks studies. Fuzzy ART incorporates computations from fuzzy set theory into ART network studies. A three-step algorithm is developed to identify a Web document topic in [11]. This algorithm is succeeded to identify the topic with an accuracy rate reaching maximum 69.8%. The algorithm first extracts extract the text part form web document based on predefined tags. The second step of the algorithm is to run a mapping module. The purpose of using this module is to map the extracted keywords on the words of ontology concepts that have been stemmed and sense-tagged. The module exploits Yahoo ontology and Word Net as extended ontology database. The algorithm finally runs an optimization module to shrink the ontology tree into an optimized tree where only active concepts and the intermediate active concepts are chosen. An algorithm was developed also in [9] to search for a node with the greatest accumulated mixture distribution among the optimized tree. This algorithm helps determining the most suspicious nodes to be the topic. III. P ROPOSED ALGORITHM In this work, we present a new Arabic text feature selection method by implementing Alsaad and Abbod’s rootbased stemming algorithm [8], as well as the TFIDF algorithm to extract the significant terms in the data set. Alsaad and Abbod’s root extraction algorithm is constructed by implementing three main phases. The first phase concentrates on eliminating suffixes and prefixes depending on the length of the processed word, as well as employing a pattern matching function to eliminate infixes and extract the root of the word. The words are matched against patterns of similar length after every prefix/suffix deletion, as it can be seen in Table I. The reason of that is to advance the speed of root extraction and to evade removing original letters of the word, which are equal to a suffix/prefix. After that, if the word root is still not found, the word is passed to the second phase of the algorithm where it is decided to remove suffixes and prefixes that are of one letter as long as the word is more than three letters long. If the word is three letters long, it is then processed depending on it being hamzated, weak, geminated, or a word with eliminated long vowel. Finally, if the word is of two letters, it is processed depending on its being a geminated or a long-voweleliminated word. TABLE I. Length of Patterns/Roots Length 4 ARABIC PATTERNS AND ROOTS Patterns ‫ فعلل‬،‫ فعلة‬،‫ مفعل‬،‫ أفعل‬،‫ فعول‬،‫ فعيل‬،‫ فعال‬،‫فاعل‬ Length 5 patterns of tri-literal roots ،‫ تفوعل‬،‫ تفاعل‬،‫ متفعل‬،‫ افعال‬،‫ مفاعل‬،‫ تفعلة‬،‫تفعيل‬ ،‫ مفعال‬،‫ فعالل فعالء‬،‫ فعالن‬،‫ مفتعل‬،‫ افتعل‬،‫انفعل‬ ،‫ فاعول‬،‫ تفعيل‬،‫ مفعول‬،‫ أفاعل‬،‫ فعائل‬،‫ فعيلة‬،‫فواعل‬ ‫ أفعلة‬،‫ مفعلة‬،‫ تفعلة‬،‫ فعالة‬،‫فعلى‬ Length 5 patterns of quad roots ‫ فعالل‬،‫ فعللة‬،‫ مفعلل‬،‫تفعلل‬ Length 6 patterns of tri-literal roots ،‫ مفاعلة‬،‫ متفاعل‬،‫ انفعال‬،‫ افتعال‬،‫ مستفعل‬،‫استفعل‬ ‫ مفعوعل‬،‫ افعوعل‬،‫ أفاعيل‬،‫ أفعالء‬،‫مفاعيل‬ Length 6 patterns of quad roots Length 6 or more ‫ متفعلل‬،‫ افتعلل‬،‫ افعالل‬،‫فعاليل‬ ‫ افعيعالل‬،‫استفعال‬ After stemming and indexing the data set, the TFIDF algorithm is used to calculate the weights of the roots to identify the highly significant topics. A prototype vector is then built using a training data set of a particular topic. The weight for each element in the vector is obtained as the combination of the term frequency TF (w,d), that is the number of times the word w is repeated in the document d, and IDF (w), which is the inverse document frequency [12, 13]. The weight of the word 𝑤𝑖 in document d is called 𝑑𝑖 and is obtained using the following equation: 𝑑𝑖 = 𝑇𝐹(𝑤, 𝑑) × 𝐼𝐷𝐹(𝑤) (1) The IDF (w), which is the inverse document frequency of the term, is calculated by applying equation (2) below: 𝐼𝐷𝐹(𝑤) = log ( 𝑁 𝐷𝐹(𝑤) ) root-based stemming as well as TFIDF gives a result of less indexed terms and a more efficient terms weighting than when using light stemming methods. TABLE II. (2) where N is the total number of documents and DF is the number of documents in which the term has occurred in [14]. As we calculate the TFIDF value for each term in the data set, we extract the terms with the highest TFIDF to present the highly significant terms of the topic. IV. EXPERIMENTS AND RESULTS Until recently, there are no standard Arabic text corpora for data mining and classification research purposes. However, a number of studies are trying to scientifically compile representative training data sets for Arabic text classification, which cover different text topics that can be used in future as a benchmark [15]. Therefore, many works dealing with topic identification or text categorization for Arabic language were conducted out using non representative and small corpora. In order to support and test our algorithm, we selected a text corpus of 1000 articles which corresponded to thousands of words. The corpus was retrieved from an online Arabic database resource providing thousands of Arabic online newspaper articles (www.sourceforge.net/projects/arabiccorpus). The text we chose to process belongs to the ‘Culture and Education’ category (‫ )المقاالت الثقافية‬and should extract culture related terms as the highly significant topics. In our experiment, a data pre-processing step was conducted before the stemming and weighting stage. Every article was processed to remove punctuation marks and digits and eliminate stop words. After that, we implement Alsaad and Abbod’s [8] root-based stemmer to extract and index the words as roots, reducing the number of indexed terms and to achieve a better result covering the main terms of the category avoiding repetitive listing of words that belongs to the same morpheme. Sequentially, the TFIDF values are calculated to extract the highly significant terms. As we calculated the TFIDF values, we extracted the top ten terms to represent the category, as shown in Table II. Subsequently, the extracted terms are compared with the top TFIDF terms extracted from the same category articles where a light stemmer is implemented to stem the words [16], as shown in Table III. We can see that the terms represented by roots in our results have a higher TFIDF value as more than one world relates to the same morpheme. For example, our results list the root (‫ )فنن‬of the noun (‫ )فن‬instead of listing more than one word that belongs to the same noun, such as (‫)فنية‬, (‫ (فن‬and (‫)فنان‬. As well as for the words (‫ )عربية‬and (‫)عربي‬, the term extracted using our method is ‫عرب‬. Therefore, implementing a feature selection algorithm with TABLE III. TERMS EXTRACTED VIA ROOT-STEMMING AND TFIDF TFIDF Extracted Term 4318 3891 3816 3306 3182 3160 3303 2334 2267 2156 ‫علم‬ ‫عمل‬ ‫عرض‬ ‫كتب‬ ‫عرب‬ ‫شعر‬ ‫فنن‬ ‫سرح‬ ‫ثقف‬ ‫حدث‬ TERMS EXTRACTED VIA LIGHT STEMMING AND TFIDF TFIDF Extracted Term 1867 ‫عربية‬ 1308 ‫عربي‬ 1192 ‫عالم‬ 1179 ‫عمل‬ 1063 ‫كتاب‬ 1056 ‫فنية‬ 1053 ‫فن‬ 1042 ‫ثقافة‬ 1030 ‫معرض‬ 948 ‫فنان‬ V. CONCLUSIONS In this paper, we presented an enhanced approach to identify significant keywords for Arabic corpora. That is achieved by combining advanced stemming and root extraction algorithm, and the well-known Term Frequency/Inverse Document Frequency (TFIDF) topic classification method. A root-based stemmer is implemented which handles the problems of infixes removal by eliminating prefixes, suffixes while checking the word against a predefined list of patterns. Also, the problem of extracting the roots of weak, hamzated, eliminated-longvowel and tow-letter geminated words has been resolved. Our results show that combining advanced stemming, root extraction and TFIDF techniques, lead to extracting a highly significant terms representation of Arabic roots. The proposed algorithm leads to a smaller index size and a better vectors selection of terms for each topic, representing the most important aspects of the topic and avoiding term repetition of similar words, or words belonging to the same root. [4] VI. FUTURE W ORK This work is to be developed to comprise documents classification. Where more documents of six different categories will be processed to generate vectors for each topic. Consequently, a group of unclassified documents will be classified by, first processing them using our root-based stemmer, calculating the document’s terms frequencies, and then finding the similarity measure between these terms and the terms of each topic vector. The document topic is then selected depending on the topic vector with the highest similarity measure. There are different equations where the similarity measure is calculated [17]. In this work, the similarity measure is to be calculated using cosine function as in (3): [5] [6] Y. Kadri and J.Y. Nie, “Effective stemming for Arabic Information Retrieval,” International conference at the British Computer Society, London, 23 Oct. 2006, pp. 68-74. [7] S. Khoja and R. Garside. ‘Stemming Arabic text’, Computer Science Department, Lancaster University, Lancaster, UK, 1999. [8] A. Alsaad and M. Abbod, “Arabic Text Root Extraction via Morphological Analysis and Linguistic Constraints,” UKSIM '14 Proceedings of the 2014 UKSim-AMSS 16th International Conference on Computer Modelling and Simulation, Cambridge, UK, 2014, pp. 125-130. [9] H. S. Baghdadi and B Ranaivo-Malancon, “An Automatic Topic Identification Algorithm,” Journal of Computer Science, vol 7, no. 9, 2011, pp. 1363-1367. K. Rajaraman and A. H. Tan, “Topic detection, tracking, and trend analysis using self-organizing neural networks”, Advanced Knowledge Discovery Data Mining, 2001. S. Tiun, R. Abdullah and T.E. Kong, “Automatic topic identification using ontology hierarchy,” Comput. Linguistic Intell, 2010. T. Joachims, ‘A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization’, Technical report, School of Computer Science Carnegie, Mellon University, Pittsburgh, 1996. G. Salton, ‘Developments in Automatic Text Retrieval’, Science 253, 1991, pp. 974–979. K. Seymore and R. Rosenfeld. “Using Story Topics for Language Model Adaptation”. In Proceeding of the European Conference on Speech Communication and Technology, 1997. S. Al-Harbi and A. Almuhareb, A. Al-Thubaity, M. S. Khorsheed & A. Al-Rajeh, “Automatic Arabic Text Classification”, 9es Journees Internationales d’Analyse Statistique des Donnees Textuelles (JADT), 2008, pp. 77-83, Lyon, France. M. Abbas, K. Smaili and D. Berkani, “Evaluation of Topic Identification Methods on Arabic Corpora,” Journal of Digital Information Management, vol. 9, no. 5, 2011, pp. 185-192. H. Ayeldeen, M. Mahmood and A. Hassanien, “Effective Classification and Categorization for Categorical Sets: Distance Similarity Measures,” In Proceedings of Proceedings of Second International Conference, India 2015, Advances in Intelligent Systems and Computing, vol. 1, no. 339, 2015. |𝒗| 𝑺𝒊𝒎(𝑫𝒋 , 𝑫𝒊 ) = ∑𝒌=𝟏 𝒅𝒋𝒌 𝒅𝒊𝒌 √∑|𝒗| (𝒅𝒋𝒌 )𝟐 ∑|𝒗| (𝒅𝒊𝒌 )𝟐 𝒌=𝟏 𝒌=𝟏 (3) [10] After obtaining the results, they will be compared with the results obtained using the same classification method, yet a light stemming approach, which is implemented by Abbas et al [16]. In their work, Abbas et al. applies the TFIDF weighting method via light stemming and the similarity measure above, over news articles of six categories attained from Arabic online news articles database, achieving the results shown in Table IV below [16]. The proposed classification method is to be implemented over the same news articles, while the results projected to outperform those achieved by Abbas et al [16]. TABLE IV. Recall (%) Precision (%) Culture 71.33 88.43 Religion 93.33 86.95 Economy 83.33 80.64 80 76.92 93.33 84.33 Local International [1] [11] [12] [13] [14] [15] RESULTS OF ABBAS’S TFIDF CLASSIFICATION APPROACH Topic Sports 94 100 average 85.88 86.21 REFERENCES M. K. Saad and W. Ashour, “Arabic morphological tools for text mining,” Proc. of the 6th EEECS International Conference on Electrical and Computer Systems, Nov. 2010. [2] S. Ghwanmeh, R. Al-Shalabi, G. Kanaan and S. Rabab’ah, “Enhanced algorithm for extracting the root of Arabic words,” Proc. of the 6th IEEE International Conference Computer Graphics and Visualization, 2009, pp. 388-391. [3] G. Kanaan, R. Al-Shalabi, S. Ghwanmeh and H. Al-Ma’adeed, “A Comparison of Text-Classification Techniques Applied to Arabic Text,” Journal of the American Society for Information Science and Technology, vol. 60, issue. 9, 2009, pp. 1836-1844. G. Guo, H. Wang, D. Bell, Y. Bi, and K. Greer, “A KNN model based approach and its application in text categorization,” In G. Goos, J. Hartmanis, & J. van Leeuwen (Eds.), Lecture Notes in Computer Science, Vol. 2945: Computational Linguistics and Intelligent Text Processing, 5th International Conference, Berlin, 2004, pp. 559–570. R. Alshalabi, “Pattern-based stemmer for finding Arabic roots,” Information Technology Journal, vol. 4, no. I, 2005, pp. 38- 43. [16] [17]

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download FullText - Brunel University Research Archive