Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Stemming Arabic Conjunctions and Prepositions Abdusalam F.A. Nwesri, S.M.M. Tahaghoghi, and Falk Scholer School of Computer Science and Information Technology, RMIT University, GPO Box 2476V, Melbourne 3001, Australia {nwesri, saied, fscholer}@cs.rmit.edu.au Abstract. Arabic is the fourth most widely spoken language in the world, and is characterised by a high rate of inflection. To cater for this, most Arabic information retrieval systems incorporate a stemming stage. Most existing Arabic stemmers are derived from English equivalents; however, unlike English, most affixes in Arabic are difficult to discriminate from the core word. Removing incorrectly identified affixes sometimes results in a valid but incorrect stem, and in most cases reduces retrieval precision. Conjunctions and prepositions form an interesting class of these affixes. In this work, we present novel approaches for dealing with these affixes. Unlike previous approaches, our approaches focus on retaining valid Arabic core words, while maintaining high retrieval performance. 1 Introduction Arabic is a Semitic language, with a morphology based on building inflected words from roots that have three, four, or sometimes five letters. For example, each verb can be written in sixty-two different forms [14]. Words are inflected and morphologically marked according to gender (masculine and feminine); case (nominative, genitive, and accusative); number (singular, dual, and plural); and determination (definite and indefinite). Arabic has three types of affixation: prefixes, suffixes and infixes. In contrast with English, some Arabic affixes are very difficult to remove without proper identification. It is common to have multiple affixes on a word. A clear example is the use of pronouns. Unlike English, where words and possessive pronouns are written separately, Arabic possessive pronouns are attached at the end of the word in most cases. For instance, the English sentence “they will teach it to you” can be written in one Arabic word as (see Table 1 for the mapping between this word and its English translation). A good stemmer identifies the stem (teach) or (knew) for this word. Arabic prefixes are widely used in Arabic text. Some of these prefixes are used with verbs, some with nouns, and others with both. Affixes that are three letters in length are easily identified; however, the shorter the affix is, the more difficult it is to identify. A sub-class of prefixes is formed by prepositions and conjunctions; this sub-class is of particular interest because, if they are identified and removed correctly, we will obtain valid Arabic core words. Some of M. Consens and G. Navarro (Eds.): SPIRE 2005, LNCS 3772, pp. 206–217, 2005. c Springer-Verlag Berlin Heidelberg 2005 Stemming Arabic Conjunctions and Prepositions Table 1. Mapping between the Arabic word 207 and its English translation It you they teach will these prefixes are letters attached to the beginning of the word. However, these same letters also frequently appear as part of affix-free words. For example, the letter (waw) in the word (respect) is part of the word, whereas, it is a (and the student). Removing this letter from conjunction in the word the first word results in the word (Asphalt); this should be avoided, because the meaning of the returned word is changed. However, in the second word, it is essential to remove the letter in order to obtain the stem (student). Al though it is important to remove such prefixes, current popular search engines do not do so. Moukdad [10] showed that searching for the words (the uni versity), (and the university),and (and by the university) by four well-known search engines, gives different results by each search engine for each particular word. All four search engines performed badly when searching for the second and the last words. 1.1 Arabic Conjunctions and Prepositions Arabic has nine conjunctions. The majority of these are written separately, ex cept the inseparable conjunctions (waw ) and (faa), which are usually attached to a noun or a verb. In Arabic, prepositions are added before nouns. There are twenty prepositions. Five of these are usually attached to the beginning of a word. These inseparable prepositions [13] are: (lam), (waw ), ! (kaf ), (baa), and (taa). Prepositions and conjunctions occur frequently in Arabic text. To aid information retrieval, they should be removed, so that variant forms of the same word are conflated to a single form. Separable prepositions can be easily detected and removed as stopwords. However, inseparable prepositions are difficult to remove without inadvertently changing the meaning of other words in the text. The set of inseparable prepositions and conjunctions consists of the six letters: lam, waw, faa, taa, baa, and kaf. These letters differ in terms of their function in Arabic text and can be further divided into three different groups: – waw and faa, can be added to any Arabic word as they are conjunctions. While waw is also a valid preposition, the fact that it is a conjunction means that this letter can be added to any Arabic word. – kaf, taa, and baa are prepositions that can only be used before a noun. taa is also used as a prefix for verbs in the present simple tense. It is rarely used as a preposition in modern standard Arabic. – lam, in addition to its purpose as a preposition like kaf and baa, can also be used with verbs as the “lam of command”. Here, it is usually prefixed to the third person to give it an imperative sense, for example "# (say it). It is also used to indicate the purpose for which an action is performed [13]. 208 A.F.A. Nwesri, S.M.M. Tahaghoghi, and F. Scholer In this paper, our main focus is on single-letter inseparable prepositions and conjunctions, and their effects on Arabic stemming. For the reminder of this paper, the term particles will be used to represent the class of inseparable prepositions and conjunctions together. The particle taa is not considered to be a member of this class, due to its rare usage as a preposition in modern standard Arabic. The rest of this paper is structured as follows. We first present related background, examining previous approaches for dealing with particles. We then propose several new techniques for removing particles from Arabic text, with the aim of retaining correct core words. The effectiveness of these techniques is evaluated experimentally, both based on the characteristics of terms that are produced by the various algorithms, as well as based on the impact that they have on retrieval performance. 2 Arabic Stemming There are two main classes of stemming algorithms: heavy stemming, and light stemming [2]. In both cases, the aim of stemming is to remove affixes from an input string, returning the stem of the word as an output. Heavy — or root-based — stemming usually starts by removing well-known prefixes and suffixes. It aims to return the actual root of a word as the remaining stem, usually by applying patterns of fixed consonants [7]. The most well known pattern is $, which is often used to represent three-letters root words. For ! example: the word (wrote) can be represented by the pattern $ by to , and to $. mapping ! to , Many stems can be generated from this root using different standard pat terns. For instance, % , $ , and $# are three different patterns to form the singular noun, nomina agentis, and present tense verb out of the pattern $ respectively. By fixing the core letters and adding additional letters in each pat ! (book), ' (writer), (write) respectively. The tern, we can generate & new words can accept Arabic prefixes and suffixes. Heavy stemmers usually reverse this process by first removing any prefixes and suffixes from the word. They then identify the pattern the remaining word corresponds to, and usually return the root by extracting letters that match the ( (and the letters , and $. For example, to find the root of the word ', writer), any heavy stemmer has to remove the prefixes to get the stem then use the pattern $ which matches this (has the same length, and with 1 ! is then returned . the letter in the same position). The root Heavy stemming has been shown to produce good results in the context of information retrieval. For example, Larkey et al. [8] show that mean average precision is improved by 75.77% using the Khoja heavy stemmer. We discuss retrieval metrics further in Sect. 4. 1 The letter and are two forms for the same letter. Stemming Arabic Conjunctions and Prepositions 209 Light stemming stops after removing prefixes and suffixes, and does not attempt to identify the actual root. It has been demonstrated that light stemming outperforms other techniques in Arabic information retrieval. Aljlayl et al. [2] demonstrate an increase in mean average precision of 87.7%, and Larkey et al. [8] report an increase in mean average precision of 100.52%. The core of both approaches involves the removal of affixes. Generally, removing prefixes has been dealt with in the same manner as for many European languages, by matching the first character of the word to a pre-prepared list of prefixes, and truncating any letters that match without first checking whether or not it is a real prefix [3,4,7,8]. For Arabic text, this frequently results in incorrect root extraction in heavy stemming, and an incorrect stem in light stemming. It is therefore doubtful that these simple approaches are appropriate when deal * (two boys) returns the root + * (soft) ing with Arabic text. For example, ) instead of root word * (gave birth) using the Khoja stemmer, and the stem * (has no meaning) instead of the stem * (a boy) using the Larkey stemmer. In the last two cases, the incorrect root (stem) was due to removal of the first letter waw after incorrectly identifying it as a particle. 2.1 Current Approaches for Stemming Particles Many stemmers have been developed for Arabic [1]. However, none deals with the removal of all particles. Some particles, such as waw, are removed by all existing stemmers; other particles, such as kaf, have never been considered on their own in existing stemming approaches. The way in which existing stemmers deal with particles can be grouped into three general categories: – Matching the first letter with a pre-prepared list of particles. If a match is found, the first letter is removed as long as the remaining word consists of three or more letters. This approach is used by most of the current stemmers to deal with a small subset of particles [3,4,7,8]. We call this approach Match and Truncate (MT). – Matching the first letter with a list of particles. If a particle is found, the remaining word is checked against the list of all words that occur in the document collection being stemmed. If the stemmed word occurs in the collection, the first letter is considered a particle and removed. This approach was used by Aito et. al. [3] in conjunction with the other two approaches. We call this approach Remove and Check (RC). – Removing particles with other letters. For example, removing a combination of particles and the definite article (the), particularly, wal, fal, bal, ' kal. These combinations are removed whenever they occur at the beginning of any word, and this approach is used by most current stemmers. We call this approach Remove With Other Letters (RW). Existing stemmers often use a combination of these approaches. They usually start by using the third approach, then continue by removing other particles, particularly waw and lam. 210 2.2 A.F.A. Nwesri, S.M.M. Tahaghoghi, and F. Scholer Evaluation of Current Approaches To check the effectiveness of current approaches for particle removal in Arabic text, we extracted all correct words that start with a particle from a collection of Arabic documents used in the TREC 2001 Cross-Language Information Retrieval track, and the TREC 2002 Arabic/English CLIR track [6,11]. Further collection details are provided in Sect. 4. The number of words start with a possible particle constitute 24.4% of this collection. To ensure that we extracted only correct words, we checked them using the Microsoft Office 2003 Arabic spellchecker [9]. Stopwords such as pronouns and separable particles were then removed. This procedure resulted in a list of 152,549 unique correct words that start with a possible inseparable particle. We use three measures to evaluate the effectiveness of the above approaches: – The number of incorrect words produced; Although correct words are not the main target of stemming, an incorrect stem can have a completely different meaning and correspond to a wrong index cluster. This is particularly true when a a core letter is removed from an Arabic word, – The number of words that remain with an initial letter that could be a particle. This indicates how many possible particles remain after an approach is applied. In Arabic, the second character could possibly be a particle if the first character is a conjunction. – The number of words actually changed; This shows the strength of each approach [5] by counting the stems that differ from the unstemmed words. Using the assumption that a correct Arabic word with a particle should also be correct without that particle, we experimentally applied the MT, RC, and RW approaches to every word in our collection of unique correct words. The results are shown in Table 2. Table 2. Removing particles using current approaches Approach Incorrect words Possible particles Altered words MT 5,164 21,945 151,040 RC 220 41,599 133,163 RW 724 122,878 33,847 It can be seen that the MT approach produces a large number of incorrect words (3.39% of all correct words). The results also show that when the MT approach truncates the first letter as a particle, there is a chance that the second letter is also a particle. The portion of words that still start with letters that could be particles constitutes 14.39% of the total number of correct words. Manual examination of the stemmed list showed that many words have another particle that should be removed, and that many words have their first letter removed despite this letter not being a particle. Stemming Arabic Conjunctions and Prepositions 211 The RC approach produces fewer incorrect words. This is because no prefix removal is carried out when the remaining word is not found in the collection. The incorrect words we obtain are due to the collection itself containing many incorrect words. Approximately twice as many words still start with possible particles as seen in the first approach. This implies that the RC approach leaves the first letter of many words unchanged. This might be desirable, since these might be valid words that do not actually start with a particle. Indeed, manual examination of the result list revealed that many words with particles have been recognised, and particles have been removed correctly. However, the result list also contained a large proportion of words that still start with particles as their first letter. The RW approach produces a smaller number of incorrect words than the first approach, but generates a very large number of words still starting with possible particles (80.55% of the list of correct words). Moreover, many words are left entirely unchanged. To conclude, the first approach is too aggressive. It affects Arabic words by removing their first letter, regardless of whether this letter is actually a particle. The second approach, while better at recognising particles in the text, leaves a considerable portion of words with real particles untouched. More importantly, in many cases a word is modified to one with completely different meaning. The third approach leaves a big portion of words without removing particles at all, and only deals with a small subset of particles in the text. It also affects words that start with the combination of particles and other letters especially proper nouns and foreign words such as , (the Iraqi city of Fallujah) and (the US city of Baltimore). It is also very hard to recognise such combina tions if they are preceded by another particle (conjunction) such as - . (and by the land). 3 New Approaches Given the incomplete way in which particles have been dealt with in previous approaches, we have investigated techniques to identify and remove inseparable conjunctions and prepositions from core words in a principled manner. Our methods are based on removing particles using grammatical rules, aiming to decrease the number of incorrect words that are produced by the stemming process, and increasing the completeness of the process by reducing the number of words that still start with a particle after stemming. A requirement for being able to recognise affixes in text is a good lexicon. We use the Microsoft Office 2003 Arabic lexicon; this contains more than 15,500,000 words covering mainly modern Arabic usage [9]. We introduce four rules, based on consideration of Arabic grammar, to identify particles in Arabic text. Let L be an Arabic lexicon, P be the set of prepositions {kaf, baa, lam}, C be the set of two conjunctions {waw, faa}, c be a letter in C, p be a letter in P, and w be any word in L. Then: 212 A.F.A. Nwesri, S.M.M. Tahaghoghi, and F. Scholer – Rule 1: Based on grammatical rules of the Arabic language, a correct Arabic word that is prefixed by a particle is also a correct word after that particle is removed. More formally: ∀(p + w) ∈ L ⇒ w ∈ L and ∀(c + w) ∈ L ⇒ w ∈ L – Rule 2: Any correct Arabic word should be correct if prefixed by either conjunction, waw or faa: ∀w ∈ L ⇒ (c + w) ∈ L – Rule 3: Based on the above two rules, any correct word with a particle prefix, should be correct if we replace that prefix with waw or faa: ∀(p + w) ∈ L ⇒ (c + w) ∈ L – Rule 4: Any correct Arabic word that is prefixed by a particle should not be correct if prefixed by the same particle twice, except the particle lam which could occur twice at the beginning of the word. Let p1 and p2 be two particles in (P ∪ C), and p1 = p2 = lam, then ∀(p1 + w) ∈ L ⇒ (p2 + p1 + w) ∈ /L Based on these rules, we define three new algorithms: Remove and Check in Lexicon (RCL); Replace and Remove (RR); and Replicate and Remove (RPR). Due to the peculiarities of the letter lam, we deal with this letter as a common first step before applying any of our algorithms. Removing the particle lam from words start with the combination results in some incorrect words. We therefore deal with this prefix before we deal with the particle lam by itself. The prefix is a result of adding the particle lam to one of the following: – A noun that starts with the definite article. When the particle is added to a word whose first two letters are the definite article , the first letter is usually replaced with the letter lam . For example, (the university) becomes / (for the university). However, if the letter following the definite article is also the letter lam , then next case applies. – A noun that starts with the letter lam. For example, # (surname or cham # when prefixed by the particle lam . pionship) should be written as 0 – A verb that starts with the letter lam. For example, 1 (wrapped) should be written as 1 when prefixed by the particle lam . To stem this combination, we first check whether removing the prefix produces a correct word. If so, we remove the prefix. Otherwise, we try adding Stemming Arabic Conjunctions and Prepositions 213 the letter before this word. If the new word is correct, we drop one lam from the original word. To remove the particle lam from words that originally start with the definite article, we replace the first lam with the letter and check whether the word exists in the lexicon. If so, we can stem the prefix without needing to check whether the remaining part is correct. If not, we remove the first letter and check to see whether we can drop the the first lam. This algorithm is used before we start dealing with any other particles in the three following algorithms. Remove and Check in Lexicon (RCL). In our first algorithm we start by checking the first letter of the word. If it is a possible particle — that is, it is a member of the set P of C — we remove it and check the remaining word in our dictionary. If the remainder is a valid word, the first letter is considered to be a particle, and is removed. Otherwise, the original word is returned unchanged. This approach differs from the RC approach in that we check the remaining word against a dictionary, rather than against all words occurring in the collection. We expect that this will allow us to better avoid invalid words. Replace and Remove (RR). Our second algorithm is based on Rule 3. If the first letter of the word is a possible particle, we first test whether the remaining string appears in our dictionary. If it does, we replace the first letter of the original string with waw and f aa in turn, and test whether the new string is also a valid word. If both of the new instances are correct, the evidence suggests that the original first letter was a particle, and it is removed, with the remainder of the string being returned. The string is returned unchanged if any of the new strings are incorrect. Manual examination of the output list of the RR algorithm shows some interesting trends. The algorithm achieves highly accurate particle recognition (few false positives). However, it often fails to recognise that the first letter is an actual particle, because replacing the first letter with faa and waw will often produce valid new words. For example, consider the word 2 (clever). Applying the RR algorithm results in two valid words: 2 (and look after), and 2 (and look after). The first letter of the original word is therefore removed, giving the word 2 (look after), instead of the original word 2 (clever). Replicate and Remove (RPR). Our third algorithm performs two independent tests on a candidate string. First, the initial letter is removed, and the remaining word is checked against the dictionary. Second, based on Rule 4 above, the initial letter is duplicated, and the result is tested for correctness against the dictionary. If either test succeeds, the unchanged original word is returned (no stemming takes place). We have noticed that if the word is a verb starting with baa or kaa, the first letter is removed whether or not it is a particle, since these are particles that cannot precede verbs. Duplicating them in verbs produces incorrect words, and causes the first letter of the original word to be removed. We can use the letter 214 A.F.A. Nwesri, S.M.M. Tahaghoghi, and F. Scholer lam to recognise verbs that start with those particles. Accordingly, we add a new step where we add the letter lam to the word and check it for correctness. If the word is incorrect with the letter lam and also incorrect with the first letter replicated, then we conclude that the word is not a verb, and we remove the first letter. For words starting with the letter lam, we add both baa and kaf instead of replicating them, since replication will result in a correct word, and lead to the particle lam being preserved. If both new instances are incorrect, we remove the first lam. The above algorithms may be applied repeatedly. In particular, if stemming a word starting with either waw or faa results in a new word of three or more characters that has either waw, kaf, baa, or lam as its first character, the particle removal operation is repeated. 3.1 Evaluation of Our Approaches We have evaluated our new algorithms using the same data set described in Sect. 2.2. As seen from Table 3, all three algorithms result in a low number of incorrect words after stemming, with similar strength. However, they differ in the number of words with possible particles after stemming. The RPR approach leaves many words with possible particles (around 5,000 more than the RR approach and 3,000 more than RCL approach). Table 3. Results of the new approaches, showing significantly lower incorrect words, lower possible particles, and a comparable strength over the baseline in Table2 Approach Incorrect words Possible particles Altered words RCL 82 17,037 146,032 RR 82 15,907 146,779 RPR 82 20,869 142,082 Compared to the previous approaches for handling particles, our algorithms result in 82 incorrect words, compared to 5,164 using MT, 724 using RW, and 220 using RC. The number of words that start with possible particles has also dropped dramatically using both RCL and RR. Using the RPR approach we extracted all words that have not been stemmed (words still having a first letter as a possible particle). The list had 10,476 unique words. To check algorithm accuracy, we randomly selected and examined 250 of these. We found that only 12 words are left with particles that we believe should be stemmed; this indicates an accuracy of around 95%. As stemming particles can result in correct but completely different words, we decided to pass the list we extracted using RPR approach to other approaches and check whether stemmed words would be correct. We extracted correctly stemmed words changed by each approach. Out of the 10,476 words, RR resulted in 4,864 new correct stems. Stemming Arabic Conjunctions and Prepositions 215 We noticed that about 90% of these are ambiguous, where the first character could be interpreted as a particle or a main character of the stemmed word; the meaning is different in the two cases. For example, the words (film) could also mean (and he collects) when considering the first letter as a particle. MT, and RCL resulted in 3,950 similar stems, while RC resulted in 2,706 stems. Examples are shown in Table 4. Table 4. Words with different meaning when stemmed by RPR and RR Word 34 5 678 Stemmed using RPR Stemmed using RR stem Meaning stem meaning 34 5 my ID card 34 my power 678 they missed it 678 it came to them 9 : ;# 78#! 9: pillow 9: masters ;# 78#! her coffin ;# 78 7< #= 7< #= her recipes 7< #= my mate made his promise her art her characterstics RPR keeps any letter that is possibly a core part of the word, even though it might also be considered as a particle. In contrast, RR removes such letters. In most cases, keeping the letter appears to be the best choice. 4 Information Retrieval Evaluation While the ability to stem particles into valid Arabic words is valuable for tasks such as machine translation application, document summarisation, and information extraction, stemming is usually applied with the intention of increasing the effectiveness of an information retrieval system. We therefore evaluate our approaches in the context of an ad-hoc retrieval experiment. We use a collection of 383,872 Arabic documents, mainly newswire stories published by Agence France Press (AFP) between 1994 and 2000. This collection was used for information retrieval experiments in the TREC 2001 and TREC 2002 Arabic tracks [6,11]. Standard TREC queries and ground truth have been generated for this collection: 25 queries defined as part of TREC 2001, and 50 additional queries as part of TREC 2002. Both sets of queries have corresponding relevance judgements, indicating which documents are correct answers for which queries. As most stemmers in the literature start by using the RW approach and then proceed to stem prefixes, we decided to likewise not use this approach on its own, but instead use it in conjunction with other approaches. To form our baseline collection, we preprocessed the TREC collection, by first removing all stopwords, using the Larkey light9 stopword list 2 . Then we 2 http://www.lemurproject.org 216 A.F.A. Nwesri, S.M.M. Tahaghoghi, and F. Scholer Table 5. Performance of different approaches TREC 2001 MAP P10 RP Baseline 0.2400 0.5320 0.3015 MT 0.2528 0.5400 0.3193 RC 0.2382 0.5080 0.3037 LarkeyPR 0.2368 0.4800 0.3102 AlstemPR 0.2328 0.4800 0.2998 BerkeleyPR 0.1953 0.4520 0.2460 RCL 0.2387 0.5080 0.3041 RPR 0.2586 0.5440 0.3246 RR 0.2543 0.5320 0.3200 TREC 2002 MAP P10 RP 0.2184 0.3200 0.2520 0.2405 0.3440 0.2683 0.2319 0.3360 0.2663 0.2345 0.3280 0.2679 0.2194 0.3180 0.2582 0.2072 0.2680 0.2423 0.2320 0.3360 0.2654 0.2379 0.3420 0.2654 0.2394 0.3440 0.2681 removed all definite article combinations and ran each algorithm on this baseline collection. For retrieval evaluation, we used the public domain Zettair search engine developed at RMIT University 3 . We evaluate retrieval performance based on three measures: mean average precision (MAP), precision at 10 documents (P10), and R-precision (RP) [6]. Table 5 shows the results recorded for each approach. Both RC and RCL perform badly and result in lower precision than the baseline. In contrast, MT, RPR, and RR, showed an improvement over the baseline, for all measures. The improvement for MT, RPR and RR is statistically significant for the TREC 2001 and TREC 2002 queries when using the t-test; this test has been demonstrated to be particularly suited to evaluation of IR experiments [12]. The RR and RPR approaches produce results comparable to previous prefix removal approaches. By way of comparison, we also show the performance of the particle removal stages only of three well-known stemmers: Larkey stemmer, Alstem stemmer, and Berkeley stemmer. Our approaches performed better for both the TREC 2001 and TREC 2002 query sets. 5 Conclusion In this work, we have presented three new approaches for the stemming of prepositions and conjunctions in Arabic text. Using a well-known collection of Arabic newswire documents, we have demonstrated that our algorithms for removing these affixes offer two significant advantages over previous approaches while achieving information retrieval results that are comparable to previous work. First, our algorithms identify particles more consistently than previous approaches; and second, they retain a higher ratio of correct words after removing particles. In particular, we believe that by producing correct words as an output, our approach will be of benefit for application to machine translation, 3 http://www.seg.rmit.edu.au/zettair/ Stemming Arabic Conjunctions and Prepositions 217 document summarisation, information extraction, and cross-language information retrieval applications. We plan to extend this work to handle suffixes in Arabic text. Acknowledgements We thank Microsoft Corporation for providing us with a copy of Microsoft Office Proofing Tools 2003. References 1. I. A. Al-Sughaiyer and I. A. Al-Kharashi. Arabic morphological analysis techniques: A comprehensive survey. Journal of the American Society for Information Science and Technology, 55(3):189–213, 2004. 2. M. Aljlayl and O. Frieder. On Arabic search: improving the retrieval effectiveness via a light stemming approach. In Proceedings of the International Conference on Information and Knowledge Management, pages 340–347. ACM Press, 2002. 3. A. Chen and F. Gey. Building an Arabic stemmer for information retrieval. In Proceedings of the Eleventh Text REtrieval Conference (TREC 2002). National Institute of Standards and Technology, November 2002. 4. K. Darwish and D. W. Oard. Term selection for searching printed Arabic. In Proceedings of the ACM-SIGIR International Conference on Research and Development in Information Retrieval, pages 261–268. ACM Press, 2002. 5. W. B. Frakes and C. J. Fox. Strength and similarity of affix removal stemming algorithms. SIGIR Forum, 37(1):26–30, 2003. 6. F. C. Gey and D. W. Oard. The TREC-2001 cross-language information retrieval track: Searching Arabic using English, French or Arabic queries. In Proceedings of TREC10, Gaithersburg: NIST, 2001. 7. S. Khoja and R. Garside. Stemming Arabic text. Technical report, Computing Department, Lancaster University, Lancaster, September 1999. 8. L. S. Larkey, L. Ballesteros, and M. E. Connell. Improving stemming for Arabic information retrieval: light stemming and co-occurrence analysis. In Proceedings of the ACM-SIGIR International Conference on Research and Development in Information Retrieval, pages 275–282. ACM Press, 2002. 9. Microsoft Corporation. Arabic proofing tools in Office 2003, 2002. URL: http://www.microsoft.com/middleeast/arabicdev/office/office2003/ Proofing.asp. 10. H. Moukdad. Lost in cyberspace: How do search engine handle Arabic queries. In Proceedings of CAIS/ACSI 2004 Access to information: Skills, and Socio-political Context, June 2004. 11. D. W. Oard and F. C. Gey. The TREC-2002 Arabic/English CLIR track. In TREC, 2002. 12. M. A. Sanderson and J. Zobel. Information retrieval system evaluation: Effort, Sensitivity, and Reliability. In Proceedings of the ACM-SIGIR International Conference on Research and Development in Information Retrieval. ACM Press, 2005. to appear. 13. W. Wright. A Grammar of the Arabic language, volume 1. Librairie du Liban, 1874. third edition. 14. A. B. Yagoub. Mausooat Annaho wa Assarf. Dar Alilm Lilmalayn, 1988. third reprint.