Download AN ARABIC AUTO-INDEXING SYSTEM FOR INFORMATION

Long Title: An Auto-Indexing System for Arabic Information Retrieval Short Title: Auto-Indexing Authors: R. A. Haraty (corresponding author), N. M. Mansour and W. Daher Lebanese American University P.O. Box 13-5053 Chouran Beirut, Lebanon 1102 3801 Email: [email protected] Telephone: 961 1 867621 ext. 1285 Fax: 961 1 867098 1 AN AUTO-INDEXING SYSTEM FOR ARABIC INFORMATION RETRIEVAL Abstract This work tackles the problem of auto-indexing Arabic documents. Auto-indexing text documents refer using words found in a document to build an index automatically. These indexes, which are referred to as keywords, are then used to build subject headings to describe the topic or the document. We present an algorithm for extracting Arabic stem words. We also introduce a new technique to calculate the weight of a term relevant to its container document. Traditionally, the weight of a term used to rely totally on the rate of occurrence of that term. We propose considering word’s spread within the document. In other words, if a certain word is concentrated at a specific part of a document, then it is less likely that this word reflects its document had it been more spread in the document. This assumption is mathematically proven, and is illustrated by real examples. Keywords: Arabic documents, document auto-indexing, stem words, and word spread. 1. Introduction Manual indexing of text documents is considered to be a cumbersome task for all people who work in the domain of information retrieval. The people who perform indexing in a newspaper, magazine, or any other information resource, are specialists, very well-trained, and have a solid linguistic background. A solid background means that these people should be talented in speaking the language, have rich vocabulary, and most importantly they should be experts in matters that concern the grammar of the language. The people responsible for doing this job are called documenters, or ‘‫ ’موثقين‬in Arabic. The process of manual indexing requires immense human effort since it requires people to read the whole document before selecting the candidate indexes for that document. Indexing is of two types: Thesaurus-based indexing and Full-Text based indexing [1]. In Thesaurus-based indexing, the documenter may choose words to 2 represent a document that do not even exist in the document. However, the synonyms do exist. The documenter may choose the synonym of a word as an index when he/she knows in advance that users are more likely to search for that particular document using the synonym of that term rather than the term itself. A synonym need not be the directly corresponding term in the dictionary. If, for example, a document is about a president of a country, then a valid index might be the name of that president, although his/her name might not occur at all in the document. Thesaurus-based indexing is a difficult, yet a possible system to implement. The reason behind that is obvious; human intervention is highly needed to select synonyms instead of terms that already exist in the document. One way of implementing a solution for that problem is to build a thesaurus file as part of the automated system that, in turn, has to be monitored and updated by an individual. Thereby, human intervention is needed again. However, some systems do build thesauruses intelligently. Full-Text based indexing, on the other hand, is much easier in concept, and much easier to implement. It totally relies on terms, as well as phrases, within the document itself. Nothing is exported. The problem of auto-indexing varies in difficulty between one language and another. Languages with sophisticated grammatical rules such as Arabic or Chinese make the process of auto-indexing quite difficult. The only solution is to implement an algorithm that covers most of the grammatical rules, since writing an algorithm that covers all rules is very difficult, if not impossible. Additionally, it should be very well-designed and modularized in a 3 matter that it should easily allow any missing grammatical rule to be plugged into the algorithm. Whether Thesaurus-based indexing or Full-Text based indexing is used, the output is the same: a set of keywords. Indexes, when extracted from the documents, are referred to as “keywords”. Thus, “keywords” is the term used by documenters to signify an index. Keywords, in turn, are used to build “subject headings”. Usually, subject headings are phrases composed of more than one keyword. A single document may have as many subject headings as possible. The more subject headings a document is assigned, the more likely that a user might hit that document upon searching for a topic. Composing subject headings is what documenters actually do. There are certain rules that documenters follow in order to build subject headings. A subject heading is composed of the following fields:     Name – The name of a person or the organization the document is about. Position – Social position, for example, “The President of Lebanon” or “President”. Country/City/Town/…Place – for example, “Beirut, Lebanon”. Activity – for example, “Meeting with the Prime Minister”. An example of a subject heading is the following: United Nations>UNICEF>Somalia>Famine>Donating Food and Medicine. Once subject headings are built, the overall search engine system will search within the subject headings rather than the whole text. Subject headings allow users to search within categories of topics. In this paper we present an algorithm for extracting Arabic stem words to build an index for Arabic text documents. We also introduce a new technique to calculate the weight of a term relevant to its container document. 4 The rest of the paper is organized as follows: in section 2 an overall description of the model used for auto-indexing is provided. Section 3 tackles particularly the second level of the proposed model for auto-indexing. Section 4 describes how the weight of a certain term is calculated. Section 5 presents the index selection mechanism, and section 6 contains the conclusion. Appendix A presents the verification of the weight calculation formulae. Appendix B presents an example using the weight calculation formula. Appendix C presents an example of our proposed auto-indexing techniques. 2. Four Layer Model The model used consists of four layers as shown in Figure 1. Each layer is a module that is implemented alone, and is totally independent of the other modules. The layers do exchange information, however. The output of one layer is the input for the above layer. Notice in the figure that the second layer, where words are stemmed to their original form, is drawn alone. The figure illustrates that the module that is responsible for stemming the Arabic words can be easily plugged off the system and replaced by another module that stems words in any other language. Everything else will work just fine. This separation of tasks provides the system with high flexibility in the sense that only one part of the system, and not the whole system, needs to be replaced in order to auto-index documents in other languages. 5 Select appropriate words/phrases Perform weight calculations Apply algorithm to extract stem words for Arabic language Read whole document Figure 1. The four-layer model. The first layer is merely concerned with reading each word and inserting it into an array called Document. The Document array consists only of words as is; no stemming is performed, no weight is calculated, no words are omitted. Once the text file is read and loaded into memory in a form of array of words, the real work starts! The second step is to go over the whole array, scan it word by word, and apply the stemming algorithm (see Figure 2). Each word is checked alone. Words that belong to the “stop-list1 terms” are omitted. Phrases that belong to “stop-list phrases” are omitted as well. Like stop-list terms, stop-list phrases are sentences that occur within a document, yet they do not contribute to the meaning of the document. For instance, a document may contain the phrase: “Ladies and Gentlemen...”, yet the document probably tell nothing about “Ladies” or “Gentlemen”. If, however, the word is a candidate word to be a valid index, then the word is stemmed and returned to its, most probably, three letters original word. The details of the stemming algorithm are given in the next section. The output of this module is two set of words (or two arrays of Stop-list terms are often referred to as “noise”. The list consists of all words that do not contribute to the meaning of a sentence, yet they help in forming a proper sentence. Such words constitute a major part of any document. 1 6 strings): The first is called the “Words” array. It is an array of records with each record having four fields: 1 - The word itself, 2 - the stem word, 3 - the count of the word, and 4 - the weight of the word. At this stage, the weight field is kept undefined. It is updated in the third layer when all criteria for weight calculations are available. The second array is called the Stem_Word array. It is an array of records consisting of five fields: 1- The stem word, 2 - the count, 3 - the ideal distance or ID, 4 - the average ideal distance or aid, and 5 - the average distance or ad. Figure 2. Flow chart of the auto-indexing algorithm. 7 2.1 General Algorithm Figure 3 gives a high level algorithm that outlines the layers of the model used. Note that the algorithm contains new terms that will be explained in the next sections. // Layer 1 starts here 1. Read the whole document and put words into array (ArrayName = Document) Assign Each Term a distance value that is autoincremented by one 2. Set N = Count of words in document // Layer 2 starts here 3. If EndOfArray then goto 6 Else Read word from array 4. If Word belong to stop list terms/phrases then Read next word Set PrecededWord = Document.CurrentWord Goto 3 // For each word in Document array do the following block of code 5. Set ThisWord = Document.CurrentWord If ThisWord Exists in ‘Word’ Array then Increment its count by 1 Else Insert ThisWord into WordArray; Set its Count = 1 Set StemWord = ExtractStemWord(PrecededWord, ThisWord) If StemWord Exists in ‘StemWordArray’ then Increment its count by 1 // accumulate sum of distances so as to divide the same // term (ad) by its count when done Increment Average Distance: ad = ad + distance Else Insert StemWord into StemWordArray; Set its Count = 1 Set average distance (ad) = distance Set PrecededWord = ThisWord Goto 3 // Layer 3 starts here – calculate weights 6. For Each Word in StemWordArray Set average distance : ad = ad / Count Set Ideal Distance : ID = N / (Count +1) Set average ideal distance : aid = N / Count Read Next Word from array // Now it is time to assign the weight to each word in Word array 7. For Each Word in WordArray Find Matching Node StemWordArray using Binary Search technique Set the gap variable : g = aid – ad 8 // Difference between averages of ideal distances and real // distances Set F(g) = 1  ( N  1) 2|G | Set Weight = Count X CountStemWord X F(G) Read Next Word // Layer 4 starts here 8. Select words with highest weights as valid indexes for that document. Figure 3. Outlines of the layers of the auto-indexing model. 3. Stem Word Extraction “Every language whether natural or artificial, is characterized by its vocabulary, its syntax, its logical structure, and its domain” [2]. After reading the whole document, it is analyzed word by word, stop-list terms that compose a major part of the document are disposed, and terms are identified as nouns or verbs, and the appropriate stemming techniques are performed on that term. Stemming a word to its root term is an important stage that a document has to undergo while performing autoindexing [3]. Suppose that a certain noun comes once in the form of a singular noun and once in the form of a plural noun. Moreover, suppose that the same noun occurs once as an adjective, another time as a subject, and the other as an object. The same term is actually appearing in different suits, yet with the same body. If the autoindexing algorithm does not perform word stemming, then it would treat each form of this noun as a totally different and independent term. Doing so is not quite a correct way to auto-index documents since a noun is a noun no matter what form does it appear in. 3.1 The Rhyming Algorithm The Rhyming algorithm is a basic function for performing word-stemming. It is not at the core of the word-stemming algorithm, but it is rather described as an 9 essential utility for doing the stemming task. The Rhyming algorithm is used to decide whether a certain word is a noun or a verb, whether a noun is in it singular form or plural form, whether a verb is in its past, present or future tense, whether certain pronouns are attached to a word, etc. When applying the Rhyming algorithm against a certain word, that word is compared to a special set of rhythms. The set of rhythms changes according to the module calling the Rhyming algorithm. For example, the set of rhythms used to decide whether a noun is in its singular or plural form is different from the set used to determine the attached prefix/suffix pronouns. Note that all words are rhymed with the derivations of the word ‘‫’فعل‬. The rhythms of the verb ‘‫ ’فعل‬have been used as a standard in all books that teach the Arabic grammar. The Rhyming algorithm goes as follows: Boolean RhymeWords(Rhythm, Word) { If Length (Rhythm) <> Length (Word) then Return False; // Words do not rhyme Else { Len = Length (Rhyme) // Or Length(Word) // Now compare Rhythm and word letter by letter i := 0; WordsRhyme := True ; While ( i < Len –1 ) && ( WordsRhyme ) { //Ignore letters of word ‘‫ ’فعل‬while rhyming If Not (Rhythm(i) In [‘‫’ل‬, ‘‫’ع‬, ‘‫ )]’ف‬Then WordsRhyme = (Rhyme(i) == Word(i)); i++; } // While } return WordsRhyme; } 10 3.2 Analyzing a Word Recall that the output of the stem-word extraction module is two sets of words2. The first of which is the list of the words that are candidate indexes, and the second set is the one that contains the corresponding stem word. Each stem word in the latter set may have one or more corresponding words in the former set. Two processes precede the process of word stemming. Each word in the document has to be read and analyzed separately. After reading a word, the algorithm has to check whether this word is for use or for disposal! The checking is simple: if the word is a stop-list term, then it is for disposal. Else, it is for use. The second process is to decide whether the word taken into consideration is a noun or a verb. Based on the results, the appropriate stemming techniques will be used. 3.2.1 Stop-list Terms and Phrases Stop-list terms are excluded from the candidate set of indexes since they do not contribute to the meaning of the document whatsoever [4]. Stop-list terms are categorized according to their type. Categorizing stop-list terms helps to a great extent in determining the type of the following word whether it is a noun or a verb. Consequently, the appropriate stemming rules may be applied. Like stop-list terms, stop-list phrases do not contribute to the meaning of the document as well. However, stop-list phrases do not hold any sign regarding the type of the following word. Examples about stop-list phrases are many: ‘ ‫السالم عليكم و رحمة‬ ‫’هللا‬, ‘‫’تحية طيبة و بعد‬, ‘‫’شكرا لتعاونكم‬, etc. Stop-list phrases are detected by comparing the 2 Programmatically speaking, these sets of words are rather called array of words or strings. 11 first word of the phrase with a certain set of words that hold the starting words of all stop-list phrases. If a matching word is detected, then the rest of the phrase is compared with a set of stop-list phrases that begin with the same word. 3.2.2 Identifying Verbs and Nouns Many information retrieval systems perform natural language processing in order to auto-index certain components. This strategy relies basically on acquiring lexical, syntactic, and semantic information from that component3. Following this strategy involves the algorithm to cater for almost all grammatical rules in the language in consideration [5] [6]. As a result, it is quite difficult to do natural language processing for languages with sophisticated grammatical rules such as Arabic. Our algorithm decides whether a word is a noun or a verb by examining two clues. The first clue is the word preceding the word in question. This is the case especially if the preceding word is a stop-list term. Some stop-list terms precede nouns only; others precede verbs only. For example, the stop-list terms that fall under the category of ‘‫ ’إسم موصول‬precede only verbs. The same thing applies for the category of ‘‫ ’أدوات النصب‬and ‘‫’أدوات الجزم‬. On the other hand, stop-list terms categorized as ‘‫ ’أحرف الجر‬precede only nouns. The second clue is the rhythm of the word itself. If for example a word rhymes with ‘‫ ’يفعل‬or ‘‫ ’إفعل‬then it is a verb. If, on the other hand, it rhymes with the word ‘‫ ’فاعل‬or ‘‫’مفعول‬, then it is most likely a noun. Other clues might be the attached pronouns. Some pronouns are attached to verbs only, some others to nouns. The techniques applied to nouns as well as to verbs are applied separately. If either or both sets of stemming techniques succeed in stemming the word to its three letter word 3 A component could be a document, an image, an audio information, etc. 12 form, then that word would be the stem of the initial word in consideration. The algorithm for deciding the type of the word is presented below and can be amended later for enhancements: WordType DecideVerbOrNoun (PrecededWord) { If PrecededWord belongs to ‘‫ ’أدوات النصب‬or ‘‫’أدوات الجزم‬ Return Verb; Else If PrecededWord is ‘‫ ’إسم موصول‬Then Return Verb; Else If PrecededWord rhymes with ‘‫ ’فعل‬Then Return Verb; Else If PrecededWord is Verb Then Return Noun; // 2 verbs can not precede each other Else If PrecededWord is ‘‫ ’حرف جر‬Then Return Noun; Else If attached to it the following prefixes: ‘‫’ال‬, ‘‫بال‬ ’, ‘‫’كال‬, ‘‫ ’فال‬then Return Noun; Else If Word Rhymes with ‘‫ ’فاعل‬or ‘‫’مفعول‬ Return Noun; Else Return ‘Unknown’; } 3.3 Extracting Stem Words from a Verb Recall that the whole idea behind knowing the type of a certain word is to know what stemming techniques ought to be used. Remember that a successful algorithm that extracts stem words from verbs is the one that returns any verb to its original three letter form. 3.3.1 Checking Attached Prefix/Suffix Pronouns The first applied stemming technique is to check whether a word contains attached pronouns. Pronouns in Arabic language come in two forms: attached pronouns (‘‫ )’ضمائر متصلة‬and discrete pronouns (‘‫)’ضمائر منفصلة‬. The discrete pronouns are considered stop-list terms, and thus they are ignored by the algorithm. The attached pronouns however, are part of the word itself. Hence, they should be spotted 13 and identified by the algorithm in order to separate them from the verb. Attached pronouns come either at the beginning of the word or at the end of the word or at both sides. The list of all attached pronouns is a finite and a defined set. The algorithm loops over the whole set of attached pronouns and performs pattern matching in order to check for the existence of any attached pronoun. In case it matches a pronoun, it removes it, and returns the verb barred from all suffix/prefix pronouns. The following tables contain all possible attached pronouns that exist in the Arabic language. The first table contains a list of four prefix pronouns, whereas the second and third tables list all suffix pronouns and their possible combinations [7]. 14 ‫‪Table 1 – List of prefix pronouns.‬‬ ‫أفعل‪/‬سأفعل‬ ‫األلف – للمتكلم‬ ‫يفعل‪/‬سيفعل‬ ‫الياء – للغائب المذكر‬ ‫نفعل‪/‬سنفعل‬ ‫النون – لجمع المتكلم‬ ‫تفعل‪/‬ستفعل‬ ‫التاء – للغائب المؤنث‬ ‫‪15‬‬ Table 2 – List of suffix pronouns. ‫للمخاطب‬ ‫ـه‬/‫ـك‬/‫فعلت‬ ‫ للمثنى الغائب‬- ‫األلف‬ ‫فعال‬ ‫الواو و األلف – لجمع المذكر‬ ‫للمخاطب المثنى‬ ‫ـهما‬/‫ـكما‬/‫فعلتما‬ ‫فعلوا‬ ‫الغائب‬ ‫للمخاطب جمع الذكر‬ ‫ـهم‬/‫ـكم‬/‫فعلتم‬ ‫التاء – للمؤنث الغائب‬ ‫فعلت‬ ‫التاء و األلف – للمؤنث المثنى‬ ‫للمخاطب جمع المؤنث‬ ‫ـهن‬/‫ـكن‬/‫فعلتن‬ ‫فعلتا‬ ‫الغائب‬ ‫النون و األلف – نون الجمع المتكلم‬ ‫فعلنا‬ ‫النون – نون النسوة‬ ‫فعلن‬ ‫التاء – تاء مفرد المتكلم‬ ‫فعلت‬ In addition to the above stated pronouns in the second table, things may even get more complicated when combinations of these pronouns occur together. In table 3, which is an extension of table 2, there are three columns: the first of which is a list of possible attached pronouns; the second column is also a list of attached pronouns, but can precede the pronouns in the first column. Thus, it is a combination of more than one attached pronoun. The third column contains only two examples per row out of 7 x 4 x 2 = 56 possible example [7]. 16 Table 3 – List of combinations of suffix pronouns. ‫جعلناه – رأيتك‬ ‫ـك‬/‫ـه‬ ‫فعال‬ ‫فعالهما – علموكما‬ ‫ـكما‬/‫ـهما‬ ‫فعلوا‬ ‫أطعمتهم – داعبتكم‬ ‫ـكم‬/‫ـهم‬ ‫فعلت‬ ‫جعلناهن – أحببتكن‬ ‫ـكن‬/‫ـهن‬ ‫فعلتا‬ ‫فعلن‬ ‫فعلت‬ ‫فعلنا‬ 3.3.2 Checking Verb against Common Five Verbs The common five verbs are known as ‘‫’األفعال الخمسة‬. These verbs come in a special form and have special properties: They always come in the present tense, and they always end with the letter ‘‫’ن‬. If, however, these verbs are preceded with either ‘‫ ’أدوات نصب‬or ‘‫ ’أدوات جزم‬then the ‘‫ ’ن‬letter must be removed [8]. The five verbs are listed in table 4 with and without the ‘‫ ’ن‬letter at the end of the word. 17 Table 4 – The common five verbs ( ‫) األفعال الخمسة‬. ‫الشــــــــــرح‬ ‫ الصيغة الثانية‬-‫األفعال الخمســـة‬ ‫ الصيغة األولى‬-‫األفعال الخمســـة‬ ‫تســتعمل مع جمع المذكر الغائب‬ ‫يفعلوا‬ ‫يفعلون‬ ‫تســتعمل مع جمع المذكر المخاطب‬ ‫تفعلوا‬ ‫تفعلون‬ ‫تســتعمل مع المثنى الغائب‬ ‫يفعال‬ ‫يفعالن‬ ‫تســتعمل مع المثنى المخاطب‬ ‫تفعال‬ ‫تفعالن‬ ‫تســتعمل مع المؤنث المفرد المخاطب‬ ‫تفعلي‬ ‫تفعلين‬ Unlike the attached pronouns, the algorithm does not perform pattern matching to detect whether a verb belongs to the five common verbs. Instead, it rhymes the verb against one of the ten mentioned rhythms above. If words rhyme, then the letters seen in red are the letters to be discarded, and the stem word would be the word composed of the black letters only. 3.3.3 Checking Verb against the “10-Verb-Additions” In the Arabic language, every verb consists of only three letters. Verbs consisting of more than three letters are merely derivations of their original threeletter verb. The derivations of any verb occur in ten different formats. Three of these formats are obtained by adding a single letter to the original verb, five of them are obtained by adding two letters, and the other two formats are obtained by adding three letters. These ten formats, also named as derivations, are known in the Arabic grammar as ‘‫[ ’الزيادات العشرة‬9]. The derivations, as well as an example of each of these derivations are presented in the table 5. 18 Table 5 – List of the ten derivations ( ‫) الزيادات العشرة‬. ‫أصل الفعل‬ ‫مثال‬ ‫الزيادات‬ ‫أصل الفعل‬ ‫مثال‬ ‫الزيادات‬ ‫هزم‬ ‫إنهزم األعداء‬ ‫إنفعل‬ ‫ضرم‬ ‫أضرم النيران‬ ‫أفعل‬ ‫قرف‬ ً ‫إقترف خطا ً فادحا‬ ‫إفتعل‬ ‫سرع‬ ‫سرع البحث‬ َ 4‫فعل‬ ‫زهر‬ ‫زهر الورد‬ َ ‫إ‬ ‫إفع َل‬ ‫قتل‬ ‫قاتل األعداء‬ ‫فاعل‬ ‫غرق‬ ‫إغرورقت عيناه‬ ‫إفعوعل‬ ‫سبب‬ ‫تس َبب في وفاته‬ ‫تف َعل‬ ‫خرج‬ ‫إستخرج النفط‬ ‫إستفعل‬ ‫عطف‬ ‫تعاطف مع صديقه‬ ‫تفاعل‬ َ Like the five common verbs, the algorithm detects one of the ten derivations by rhyming it with all the ten rhythms mentioned in the above table. If the algorithm detects that a verb is in the form of one of those derivations, it extracts the stem word by removing the letters colored in red. 3.4 Extracting Stem Words from a Noun Extracting a stem word from a noun is more complicated process compared to stemming a verb. The difficulty of stemming a noun is a result of many factors. One reason is that a noun may appear in the singular, double, or plural form. Additionally, each of these three formats differs had the noun been addressing a male or a female. Furthermore, things may get more complicated since there are lots of exceptions for the double and plural formats. Besides, there may be lots of derivations for each noun that have no specific format! In summary, stemming a noun is not as easy as that of a verb. The process of extracting a stem word from a noun is not described in this paper for brevity. Interested readers, however, can refer to [10] for more details. 4 The stressing character – known as ‘‫ – ’شدة‬is considered a letter by itself in the Arabic language. 19 3.5 Algorithm for Arabic Word Extraction The algorithm contains the WordBelongsToList function, which accepts two parameters: The first is an Arabic word, and the second is a list of words (array of strings) that holds a certain set of rhythms. For instance, such set of rhythms could be the list of the common nouns. The function searches for the corresponding rhythm of the passed word within the passed list of rhythms. If it does find the corresponding rhythm, then the latter would be the returned value of the function. Else, the function returns empty string to indicate that the word does not rhyme with any item in the list. The algorithm also uses the GetStemWord function. As its name indicates, GetStemWord gets the stem word from the initial word. It accepts two parameters: The first parameter is the word itself that has to be stemmed. The other parameter is the rhythm that was returned by the function WordBelongsToList. According to the second parameter, the function performs the stemming appropriately, and thus returning the stem word. The algorithm for Arabic word extraction follows: String ExtractStemWord (Word) { // WordType holds the type of the word had it been a noun, a verb, or // unknown WordType wordType; String v_StemWord, n_StemWord, temp_StemWord; String wordRhythm; Char FirstChar; // The following function WordIsStopListTerm returns false if a word is // not a stop-list term. Otherwise true, in addition to the type of the // word that follows (Since stoplist terms usually indicate the type of // the following word) If WordIsStopListTerm (Word, WordType) Then return; If wordType == Unknown then 20 // Try to guess the type of word wordType = DecideWordType (Word); If (wordType == Verb Or wordType == Unknown) Then { v_StemWord = Word ; // The following while loop iterates at most twice since a verb // may have none, one, or a combination of 2 attached pronouns While ThereAreSuffixPronouns (v_StemWord) v_StemWord=RemoveSuffixPronouns(v_StemWord); // Now check if the verb is in the future tense If (v_StemWord starts with letter ‘‫ )’س‬And (Second letter is in [‘‫’ت‬,’‫’ي‬,’‫’ن‬,’‫ )]’أ‬then v_StemWord=RemovePreffixPronouns(v_StemWord); // if length of v_stemword <= 3 then we’re done If Length(v_StemWord) <= 3 then return v_StemWord; // Check if the verb is one of common five verbs wordRhythm = wordBelongsToList(v_StemWord, lstFiveVerbs[]); If (wordRhythm != “”) Then v_StemWord = getStemWord (v_StemWord, wordRhythm); If Length(v_StemWord) <= 3 then return v_StemWord; Word_is_ten_derivations: // a label that is referenced by a goto stmt // Check if the verb is one of ten derivations wordRhythm = wordBelongsToList(v_StemWord, lstTenDerivs[]); If (wordRhythm != “”) Then v_StemWord = getStemWord (v_StemWord, wordRhythm); If Length(v_StemWord) <= 3 then return v_StemWord; // Check if the verb is one of ten derivations but in the present // tense If (FirstChar(v_StemWord) in [‘‫’ت‬,’‫’ي‬,’‫’ن‬,’‫ )]’أ‬Then { Temp_StemWord = v_StemWord; v_StemWord = v_StemWord - FirstChar(v_StemWord); goto Word_is_ten_derivations; v_StemWord = Temp_StemWord; } } // End of “Word is verb” block Else If (wordType == noun Or wordType == Unknown) Then { // If Exist, remove prefixes 21 n_StemWord = RemovePrefixes (word); n_StemWord = RemoveSuffixes (n_StemWord); If n_StemWord is in its regular plural form Then // get stem word by performing pattern matching n_StemWord = getStemWord (n_StemWord); Else // Word might be in its irregular plural form WordRhythm = wordBelongsToList (n_StemWord, lstPluralIrregularNouns[]); If (WordRhythm != “”) Then n_StemWord = getStemWord (n_StemWord, WordRhythm); // At this point, the noun is its singular form // Check if word is one of five common nouns WordRhythm = wordBelongsToList (n_StemWord, lstFiveCommonNouns[]); If (WordRhythm != “”) Then n_StemWord = getStemWord (n_StemWord, WordRhythm); Else { // Check if word belongs to M, T, or Miscellaneous Derivations FirstChar = getFirstChar (n_StemWord); Switch (FirstChar) { Case is = ‘‫’م‬: WordRhythm = wordBelongsToList (n_StemWord, lst_M_Derivations[]); If (WordRhythm != “”) Then n_StemWord = getStemWord (n_StemWord, WordRhythm); Case is = ‘‫’ت‬: WordRhythm = wordBelongsToList (n_StemWord, lst_T_Derivations[]); If (WordRhythm != “”) Then n_StemWord = getStemWord (n_StemWord, WordRhythm); Case Else: WordRhythm = wordBelongsToList (n_StemWord, lst_Misc_Derivations[]); If (WordRhythm != “”) Then n_StemWord = getStemWord (n_StemWord, WordRhythm ); 22 } } // End of “Word is noun” block // Supposedly, at this point a word should have undergone the stemming // process and is in its stem form // Now return the stem word according to whether the word is // noun, verb, or unknown Switch (WordType) { Case is = noun Return n_StemWord; Case is = verb Return v_StemWord; Case is = unknown If (v_StemWord == n_StemWord) Then //* return either words return v_StemWord; // Or n_StemWord Else If Length(v_StemWord) == 3 Then return v_StemWord; Else If Length(n_StemWord) == 3 Then return n_StemWord; Else If (length (v_StemWord) < length ( n_StemWord) Then return v_StemWord; Else return n_StemWord; } // End of Switch Block } // End of Function 4. Weight Calculation After the word stemming stage is over, it is time to calculate weights of these words relevant to their document [11]. The weight of a term relies basically on three factors 23 4.1 Factors Affecting Weight Three factors affect the significance of a certain word to a document. The first factor is obviously the count of that term in its container document [4][12][13][14]. The second factor is the count of the stem words for that word. The third factor, and this is our contribution to the field of auto-indexing, is the spread of that word over the document. This assumption is based on the fact that if a certain word is concentrated at a specific part of a document, then it is less likely that this word reflects its document had it been more spread in that document. Obviously, the weight of a term is directly proportional to its count [4] [12] as well as the count of its stem word. As either counts increase, the weight should increase correspondingly. The only thing missing right now is the factor that determines how much the term is spread within the document. This factor is supposed to increase as the term spreads equivalently among all parts of the document. Likewise, it should decrease as the term concentrates at a certain part of the document. 4.2 Formula Verification Based on what has been stated above, the weight of a certain word with respect to its document becomes: weight = Word_count  Stem_word_count  Spread factor The count of a word as well as its stem word can be easily achieved by simply counting the repeatance of each word in the document. However, the calculation of the spread factor is not as easy. 24 Consider the following terms that are used in the formulas for weight calculation:     N: count of all terms in document. m: count of a certain word in a document. sm: count of stem words for a certain word in a document. f: some factor that indicates how much a word is spread within a document (remember: the more a term is spread, the larger its factor becomes). Therefore, the weight w of a certain word becomes: w = m  sm  f The next step is to find a formula for that factor such that it increases as the term spreads over the document, and decreases as the term concentrates in a specific section. Again, consider the following terms for spread calculation:   d (distance): a distance of a term is simply its position in the document [15]. In other words, it is the count of words preceding it. For example, the distance of the very first term in the document is one. Likewise, the distance of the very last term is N. ad (average distance): is the average of all distances for a stem word.  sm    di  ad =  i 1  where di is the distance of the i th term.  sm     id (ideal distance): is the ideal distance between every two occurrences for each stem word. The ideal distance of course should be equal between every two similar stem words. If the distance between every two stem words equals the ideal distance, this means that the term is perfectly spread over the document. The ideal distance for a specific term is:  N  id =    sm 1  aid (average ideal distance): is the average of all ideal distances for a stem word.  sm    i  ID   aid =  i 1  sm    25 sm Notice however, that i = i 1 sm  ( sm 1) and id = 2  N     sm 1 N  Thus, aid becomes   . Notice that aid is independent of sm. As a result, we 2 can deduce that all stem words has one same average ideal distance, which is totally dependant on the number of terms in the document. Unlike all previously defined terms, the aid term is an attribute of a document rather than the stem word since the distance, ideal distance, and the average distance vary for different stem words, whereas aid remains constant.  g (gap): is the difference between aid and ad. g = aid – ad (Notice that g may have a positive or a negative value). 4.3 Weight Calculation Notice that a decrease in g indicates that the term is perfectly spread over the document. Hence, this should affect positively the weight of that term. The converse is quite true. An increase in g indicates that the term is concentrated at a specific part(s) of the document; perhaps only in the first paragraph, the last paragraph, or even in one sentence. This obviously means that this term weakly reflects the content of the document. Thus, as g increases, the weight of the term should relatively decrease. Performing weight calculation requires the use of f to be a function of g  (g) such that: (1)  < (g) <  ; ,   ; <;   1;   N (2) limit (g) =  (the maximum value) g 0 (3) limit (g)  ( the minimum value ) g  26 Assumption: (g) may be defined as: (g) =  +     Kg Where K   ; K  1 5 5. Index Selection At this stage, we are ready to select the indexes from the candidate terms. What qualifies a term to be an index is basically its weight. The index selection mechanism varies according to the task that the overall auto-indexing system is assigned to do. For example, an auto-indexing system that is part of a general newspaper archiving system, may behave differently had the auto-indexing system been part of an Internet search engine system. This difference in behavior between one auto-indexing system and another is embodied in the very last stage of the system, namely the “index selection” stage. Index selection is the ultimate level that a document undergoes upon autoindexing. After all, this is what it’s all about. The index selection mechanism varies upon the general task that the auto-indexing system is supposed to accomplish. One interesting mechanism might be the one that is used to create an index for a book [4]. The index of any book, usually available at the end of the book, is composed of keywords that are alphabetically sorted, grouped by their first letter, and listed together with the page(s) where they occurred within the book. One slight modification on the whole system is to add the page number field and fill it from the very first stage upon reading the whole document and putting it into an array of strings. Once the words within a document are stemmed, and the weights are 5 Note that K is a constant. The goal of the function is to get a factor which is a function of g, and between  and . 27 calculated, the user may set a threshold on the weight to select the indexes. For example, the user may claim that not all words that occur in a book need to be mentioned in the index. However, terms with weight greater than or equal to a certain threshold may be selected. Accordingly, an appropriate index for the book in question will be achieved. Another index selection technique is the one that is used by some search engines, such as Internet search engines. In such information retrieval systems, all terms that were extracted from the document are selected as indexes! Actually, the whole system works as follows: the document in consideration is assigned an ID, a unique value, and the latter is stored in the system’s database along with the document’s name, and its physical path6 (i.e., where the document physically exists on the hard disk drive). Additionally, the system assigns each new word a unique ID, and inserts it into its database as well. However, all extracted terms, had they been new or already existed in the database, will be assigned an entry whereby the term ID is stored with the document ID along with the term’s weight. In other words, the system will store the corresponding weight for each term where it occurred in each document separately [15]. The idea is illustrated in Figure 4. Figure 4. Organization of terms versus documents. Alternately, some systems may save the whole document in the database rather than the document’s path. 6 28 Notice in the figure above how the database is structured. Obviously, each document contains many terms. On the other hand, each term may appear in one or more documents. Hence, the relationship between documents and terms is many to many. However, the weight of a single term varies from one document to another. That is why it is put in the middle table. Some researchers claim that in order to make the best use of indexes to enhance retrieval time, it is best to put the documents versus terms in a matrix form, and the weights would reside within the matrix to signify the relevance of each term to a document [8]. Some indexes are put in the middle to link documents with terms. An end user usually types a few words to search for certain documents. The information retrieval system reads these words, and fetches their IDs in the database. Then, it carries these term IDs and searches for the corresponding document IDs in the index table labeled as “Doc-Term” in the Figure 4. When the document IDs are known, the system easily fetches the documents’ names, together with their paths, from the document table. The Weight factor signifies how relevant a document is to the user’s request. In other words, the information retrieval system should respond to the user’s query by listing the relevant documents sorted descendingly by their weight [12] [16]. The index selection mechanism that is used by the auto-indexing system in this work is different from both techniques that were discussed earlier. The index selection stage is basically to list all keywords for the documenter. It is like preparing the ingredients for the chef in a cuisine. The auto-indexing system reads the whole document, performs word stemming, calculates terms weights, and finally produces a list of keywords sorted descendingly by weight. In other words, the whole auto- 29 indexing system is there to fulfill what a documenter has to do manually while preparing subject headings for a document. 6. Conclusion The idea of auto-indexing varies in difficulty between one language and another since the module that roots words to their original stem depends absolutely on the grammar of the language in consideration. Our contribution to this field involves presenting a model for auto-indexing that is composed of four interdependent layers. This model provides flexibility for the overall system since the system will not be bound to a specific language. In this report, the choice fell on the Arabic language. The overall system henceforth, performs auto-indexing on Arabic documents. We also presented a new criterion to consider when calculating the weight, or relevance, of a certain term with respect to its container document. This new dimension is actually the level of spreading of a term in the document. This idea is based on the assumption that the more a word is spread in the document, the more likely it is to signify the document. This assumption is mathematically proven in the report. We have also presented new ideas in word extraction for Arabic words. This step is vital yet insignificant in languages that do not have sophisticated grammatical rules. Future work involves enhancing the stem word extraction algorithm. The Arabic language is a rich language in all aspects: the grammatical rules are many (almost infinite) with lots of exceptions for some of these rules, the Arabic thesaurus is huge, and the letters in Arabic are attached letters, that is the letters within each word are attached to each other, and thus, each letter in Arabic may be written in 30 several forms according to its location in a word. For example, the letter ‘‫ ’ت‬is written differently had it appeared at the beginning, middle, or end of a word. References [1] Al Haqhaq, K. (2002) Head of Information Technology Department at KUNA – Kuwait News Agency, Personal Communication. [2] Harter, S. (1986) Online Information Retrieval: Concepts, Principles, and Techniques. Academic Press. [3] Turney, P. (1999) Learning to extract key phrases from text (Technical Report), National Research Council Canada, Institute of Information Technology. [4] McNamee, P. and Mayfield, J. (1998) Indexing using both N-grams and words, Proc. of the Seventh Text Retrieval Conference (TREC-7). [5] Franz, M. and Roukos, S. (1998) Auto-indexing for broadcast news, Proc. of the Seventh Text Retrieval Conference (TREC-7). [6] Brassard, G. and Bratley, P. (1996) Fundamentals of Algorithms. Prentice Hall. [7] Peter, G. (1989) Text retrieval, the state of the art, Proc. of the Institute of Information Scientists Text Retrieval Conference. [8] Billhardt, H., Borrajo, D. and Maojo, V. (2000) Using term co-occurrence data for document indexing and retrieval, Proc. of the BCS-IRSG 22nd Annual Colloquium on Information Retrieval Research (IRSG2000), 105-117. [9] Ng, H., Ang, H. and Soon, W. (2000) A hybrid algorithm for the routing task, Proc. of the Eighth Text Retrieval Conference (TREC-8). [10] Daher, W. (2002) An Arabic auto-indexing system for information retrieval. Lebanese American University – Master’s Thesis. [11] Jones, G. (2000) Exploring the incorporation of acoustic information into term weights for spoken document retrieval, Proc. of the 22nd Annual Colloquium on Information Retrieval Research. [12] Khurshid, A., Gillman, L., and Tostevin, L. (2000) Weirdness indexing for logical document extrapolation and retrieval, Proc. of the Eighth Text Retrieval Conference (TREC-8). [13] Becker, J. and Hayes, R. (1993) Information Storage and Retrieval: Tools, Elements, Theories. John Wiley & Sons, Inc. 31 [14] Vijay, V., Shi, H. (1983) Evaluation of the 2-Poisson model as a basis for using term frequency data in searching, Proc. of the Sixth Annual International ACM SIGIR Conference, 88-100. [15] Convey, J. (1992) Online Information Retrieval: An Introductory Manual to Principles and Practice. Library Association Publishing, London. [16] Siegler, M., Jin, R. and Hauptmann, A. (2000) Analysis of the role of term frequency tf, Proc. Text Retrieval Conference (TREC-8). 32 Appendix A - Verification of the weight calculation formula. Consider the following: (1) limit (g) =   g 0 (2)     = + -= K0 limit (g)=  + 0 =  g  Therefore, the formula is correct. Figure A.1 shows a plot of the values of (g) as g varies. Figure A.1 Graph of f(g). However, the value of g could be positive or negative. Notice that this factor f(g) should have the same value had g been a positive value x or a negative value -x. This is because the average distance ad could have been aid - x or aid + x, and the gap would still have the value of x. Thus, the formula becomes: (g) =       K g and the graph becomes as follows: 33 Figure A.2. Enhanced graph of f(g). As a result, the weight of the term becomes:        w = m  sm       g K   Practically speaking, in the implementation of this formula,  and  have been assigned the values 1 and N, respectively. In other words,  and  are assigned their minimum and maximum values, respectively. Additionally, K is assigned a value of 2. One thing was noticed regarding this constant that the larger K is chosen, the less significant value the weight formula will have. The final shape of the formula would be:  w = m  sm  1   N 1  g 2  34 Appendix B – An example of weight calculation Suppose that a document contains 120 words. Suppose also that the count of some word in that document is 3, and the number of stem words for that word is 5. Thus, m = 3 and sm = 5. Applying the above formulas:  N  120 ID =   = 5  1 = 20  sm 1 This means that the word W would be perfectly spread within the document if W occurs once every 20 terms. Thus, 20 is the ideal distance between the occurrences of words W. aid, on the other hand, is an average value of all distances of these ideal terms (see Figure B.1). aid = N 120 = = 60 2 2 Figure B.1 The scale. The distance of a term is defined to be the position at which the term occurs with respect to the beginning of the document. Now suppose that the first term occurs at location 17, the second at 34, etc. The distances for the five terms are given as follows: d1 = 17, d 2 = 34, d 3 = 61, d 4 = 68, d 5 = 102.7  sm    di  17  34  61  68  102 272 ad =  i 1  = = = 54.4 = 55 5 5  sm    7 Note that these distances are randomly chosen to illustrate the example. 35 With ad = 55, notice the figure 6, which is an extension to Figure B.1. Figure 6. Detailed scale. In this example, there is a difference of 5 between aid and ad. Accordingly, term g is: g = aid – ad = 60 – 55 = 5 Therefore, the weight for the term in consideration is:  w = m  sm  1   N 1  120  1   = 3  5  1   = 108.281 g 25   2  36 ‫‪Appendix C – An example of auto-indexing‬‬ ‫‪In this appendix, we present an example to illustrate the operation of our‬‬ ‫‪proposed techniques shown in Figure C.1.‬‬ ‫أقدم الصخور‬ ‫يعتقد البعض أن الصخور أشياء صلبة يصعب كسرها ‪ .‬نعم ‪ ,‬إن بعض الصخور كذلك ‪ ،‬لكن ال‬ ‫كلها ‪ .‬إن الرمل و الحصى و الطباشير و الصلصال صخور في نظر العالم الجيولوجي ‪ .‬ولو أنك‬ ‫تناولت قبضة رمل على الشاطئ و تركتها تنساب بين أصابعك ‪ ,‬فإنك تكون قد تناولت صخورا‬ ‫شبيهة بالحجر الرملي الصلب الذي يصلح للبناء ‪ .‬و هنالك أنواع عديدة من الصخور تكونت بطرق‬ ‫مختلفة ‪ .‬و يقسمها علماء الجيولوجيا إلى ثالث مجموعات ‪ .‬الصخور النارية ‪ ,‬و الصخور الرسوبية‬ ‫‪ ,‬و الصخور المتحولة ‪ .‬و يجب علينا أن نحفظ هذه األسماء ألنها األسماء التي يستعملها‬ ‫الجيولوجيون دائما ‪ .‬و صخور المجموعة الثالثة أو الصخور المتحولة هي الصخور التي بدأت‬ ‫نارية أو رسوبية ثم تغيرت بفعل الحرارة الشديدة أو الضغط أو اإللتواء تغيرا كبيرا حتى صارت‬ ‫شيئا آخر ‪ .‬و على سبيل المثال نجد أن بعض الحجارة الرملية تغيرت إلى حجارة أشد صالبة هي‬ ‫المرو أو الكوارتز ‪ ,‬و أن حجر الجير تحول إلى رخام ‪ ,‬و أن صخر سهل التفتت يدعى الطين‬ ‫الصفحي قد تحول إلى إردواز أو الحجر المشقق ‪ ،‬وهو صخر قاس سريع اإلنكسار‪.‬‬ ‫حين نشأت األرض قبل أربع ماليين سنة ‪ ,‬كانت مكونة من مادة ذائبة شديدة الحرارة ‪ ,‬ثم أخذت‬ ‫تبرد ‪ .‬و تحولت المادة الخارجة بصورة تدريجية إلى صخر صلب شديد الشبه بالصخور النارية‬ ‫التي نراها في ا لوقت الحاضر ‪ .‬و قد كانت آنذاك هي النوع الوحيد من الصخور ‪ .‬و في خالل‬ ‫ماليين السنين كانت المادة الذائبة التي على سطح األرض تبرد بسرعة معقولة و هي التي ندعوها‬ ‫بالصخور البركانية ألنها تشبه الصخور التي تقذفها البراكين الحية في الوقت الحاضر ‪ .‬لكن‬ ‫الصخور الغائرة في األرض كانت تبرد ببطئ أشد ‪ ,‬و تعرف بالصخور الجوفية أو البلوتونية نسبة‬ ‫إلى بلوتو ‪ .‬و يمكن لعلماء الجيولوجيا أن يميزوا بسهولة بين الصخور البركانية و الصخور‬ ‫الجوفية و ذلك من طريقة تكوينها ‪ .‬فالصخور التي بردت بسرعة معقولة تتكون من بلورات‬ ‫صغيرة ذات شكل منتظم ‪ ,‬لكن تلك التي بردت ببطئ تتكون من بلورات أكبر حجما ‪.‬‬ ‫لقد عرف علماء الجيولوجيا كيف كان الصخور البركانية تبرد في الماضي بمراقبة ما يحدث أحيانا‬ ‫عندما ينفجر بركان في الوقت الحاضر ‪ .‬فقد يقذف من فوهته جداول من صخور مذابة تدعى البة‪.‬‬ ‫و إذا بردت الالبة بسرعة جرى ذلك بدون أن تتكون فيها بلورات ‪ ,‬و ظهرت قاسية و زجاجية ‪ .‬أما‬ ‫إذا بردت بسرعة أقل ‪ ,‬فإن بلورات صغيرة تتكون فيها ‪ .‬و قد تخرج الالبة من السطح أحيانا مليئة‬ ‫بفقاقيع من الغاز ثم تبرد و تتكون منها صخر إسفنجي الشكل يدعى الخفان و هو الحجر الذي‬ ‫يستعمل إلزالة األوساخ العالقة باأليدي‪.‬‬ ‫و أكثر الصخور انتشارا هي البازلت ‪ .‬و تتكون من بلورات صغيرة ‪ .‬و قد تكون القسم األكبر منها‬ ‫قبل ماليين السنين عندما تجمعت الصخور المذابة في شقوق طويلة بقشرة األرض و اندفعت‬ ‫انهارا من الالبة بلغت عرضها كيلومترات عديدة في بعض األحيان ‪ .‬و يمكننا أن نرى ذلك في‬ ‫الوقت الحاضر في أمكنة كثيرة ‪ .‬ففي الساحل الشمالي من إيرلندا ‪ ,‬بردت الصخور المذابة بسرعة‬ ‫شديدة فتشققت و كونت أعمدة هائلة ألكثرها ستة جوانب‪ .‬و بدت و كأنها درجات يستخدمها مارد‬ ‫للنزول إلى البحر ‪ ,‬أو كأنها أنابيب أرغن ضخم على قمم الصخور ‪ .‬و في ستافا و هي جزيرة على‬ ‫الساحل الغربي من إسكتلندة فتحت مغاور في األعمدة ‪.‬‬ ‫أما الصخر الجوفي البلوتوني األكثر إنتشارا فهو الغرانيت ‪ .‬و مع أن هذا الصخر قد تكون في‬ ‫البداية ببطئ على عمق تحت األرض ‪ ,‬فإنه موجود في الوقت الحاضر على سطح األرض في‬ ‫أمكنة كثيرة على السواح ل الصخرية في الشمال الشرقي من إسكتلندة ‪ .‬و قد حدث أحيانا أن اندفع‬ ‫هذا الصخر إلى سطح األرض بفعل تحركات كبيرة في األرض ‪ ,‬أو لعل الصخور التي كانت‬ ‫تغطيه منذ ماليين السنين قد تفتت بصورة تدريجية ‪ .‬و الغرانيت صخر قاس أبيض اللون أو‬ ‫رمادي أو وردي فيه بقع سوداء المعة مما يجعله بريق ‪ .‬و كثيرا ما يكون في الغرانيت قطع‬ ‫صغيرة شبيهة بالحجارة الكريمة مثل العقيق أو التوباز‪.‬‬ ‫‪Figure C.1 Document considered for indexing.‬‬ ‫‪37‬‬ ‫‪After reading the whole document, the algorithm ignores stop list terms and‬‬ ‫‪stems the candidate words. The output is two arrays shown in Figure C.2. Note that‬‬ ‫‪only the top 50 terms are shown due to limited space.‬‬ ‫العدد‬ ‫‪1‬‬ ‫‪37‬‬ ‫‪1‬‬ ‫‪1‬‬ ‫‪4‬‬ ‫‪1‬‬ ‫‪2‬‬ ‫‪4‬‬ ‫‪1‬‬ ‫‪1‬‬ ‫‪1‬‬ ‫‪1‬‬ ‫‪3‬‬ ‫‪5‬‬ ‫‪2‬‬ ‫‪1‬‬ ‫‪1‬‬ ‫‪1‬‬ ‫‪1‬‬ ‫‪1‬‬ ‫‪3‬‬ ‫‪7‬‬ ‫‪1‬‬ ‫‪1‬‬ ‫‪2‬‬ ‫‪1‬‬ ‫‪1‬‬ ‫‪2‬‬ ‫‪2‬‬ ‫‪3‬‬ ‫‪3‬‬ ‫‪2‬‬ ‫‪1‬‬ ‫‪1‬‬ ‫‪2‬‬ ‫‪2‬‬ ‫‪2‬‬ ‫‪1‬‬ ‫‪1‬‬ ‫‪1‬‬ ‫‪1‬‬ ‫‪38‬‬ ‫أصل الكلمـــة‬ ‫قدم‬ ‫صخر‬ ‫عقد‬ ‫أشي‬ ‫صلب‬ ‫صعب‬ ‫كسر‬ ‫رمل‬ ‫حصى‬ ‫طبشر‬ ‫صلصل‬ ‫نظر‬ ‫علم‬ ‫جيولوج‬ ‫نول‬ ‫قبض‬ ‫شطئ‬ ‫ركت‬ ‫ساب‬ ‫أصبع‬ ‫شبه‬ ‫حجر‬ ‫صلح‬ ‫بنء‬ ‫نوع‬ ‫بطرق‬ ‫خلف‬ ‫قسم‬ ‫ثلث‬ ‫جمع‬ ‫نار‬ ‫رسب‬ ‫يجب‬ ‫حفظ‬ ‫أسم‬ ‫عمل‬ ‫حرر‬ ‫شدد‬ ‫ضغط‬ ‫لوء‬ ‫شيئ‬ ‫العدد‬ ‫‪1‬‬ ‫‪20‬‬ ‫‪1‬‬ ‫‪1‬‬ ‫‪1‬‬ ‫‪1‬‬ ‫‪1‬‬ ‫‪1‬‬ ‫‪1‬‬ ‫‪1‬‬ ‫‪1‬‬ ‫‪3‬‬ ‫‪1‬‬ ‫‪1‬‬ ‫‪1‬‬ ‫‪2‬‬ ‫‪1‬‬ ‫‪1‬‬ ‫‪1‬‬ ‫‪1‬‬ ‫‪1‬‬ ‫‪1‬‬ ‫‪1‬‬ ‫‪2‬‬ ‫‪1‬‬ ‫‪1‬‬ ‫‪1‬‬ ‫‪1‬‬ ‫‪1‬‬ ‫‪1‬‬ ‫‪1‬‬ ‫‪1‬‬ ‫‪1‬‬ ‫‪2‬‬ ‫‪3‬‬ ‫‪1‬‬ ‫‪1‬‬ ‫‪2‬‬ ‫‪1‬‬ ‫‪1‬‬ ‫‪1‬‬ ‫أصل الكلمـــة‬ ‫قدم‬ ‫صخر‬ ‫عقد‬ ‫أشي‬ ‫صلب‬ ‫صعب‬ ‫كسر‬ ‫رمل‬ ‫حصى‬ ‫طبشر‬ ‫صلصل‬ ‫صخر‬ ‫نظر‬ ‫علم‬ ‫جيولوج‬ ‫نول‬ ‫قبض‬ ‫رمل‬ ‫شطئ‬ ‫ركت‬ ‫ساب‬ ‫أصبع‬ ‫صخر‬ ‫شبه‬ ‫حجر‬ ‫رمل‬ ‫صلب‬ ‫صلح‬ ‫بنء‬ ‫نوع‬ ‫بطرق‬ ‫خلف‬ ‫قسم‬ ‫علم‬ ‫جيولوج‬ ‫ثلث‬ ‫جمع‬ ‫نار‬ ‫رسب‬ ‫يجب‬ ‫حفظ‬ ‫الكلمـــــة‬ ‫أقدم‬ ‫الصخور‬ ‫يعتقد‬ ‫أشياء‬ ‫صلبة‬ ‫يصعب‬ ‫كسرها‬ ‫الرمل‬ ‫الحصى‬ ‫الطباشير‬ ‫الصلصال‬ ‫صخور‬ ‫نظر‬ ‫العالم‬ ‫الجيولوجي‬ ‫تناولت‬ ‫قبضة‬ ‫رمل‬ ‫الشاطئ‬ ‫تركتها‬ ‫تنساب‬ ‫أصابعك‬ ‫صخورا‬ ‫شبيهة‬ ‫بالحجر‬ ‫الرملي‬ ‫الصلب‬ ‫يصلح‬ ‫للبناء‬ ‫أنواع‬ ‫بطرق‬ ‫مختلفة‬ ‫يقسمها‬ ‫علماء‬ ‫الجيولوجيا‬ ‫ثالث‬ ‫مجموعات‬ ‫النارية‬ ‫الرسوبية‬ ‫يجب‬ ‫نحفظ‬ ‫‪1‬‬ ‫‪1‬‬ ‫‪1‬‬ ‫‪1‬‬ ‫‪1‬‬ ‫‪1‬‬ ‫‪1‬‬ ‫آخر‬ ‫سبل‬ ‫نجد‬ ‫مرو‬ ‫كوارتز‬ ‫جير‬ ‫رخم‬ ‫أسم‬ ‫عمل‬ ‫جيولوج‬ ‫جمع‬ ‫ثلث‬ ‫نار‬ ‫رسب‬ ‫‪2‬‬ ‫‪1‬‬ ‫‪1‬‬ ‫‪1‬‬ ‫‪1‬‬ ‫‪1‬‬ ‫‪1‬‬ ‫األسماء‬ ‫يستعملها‬ ‫الجيولوجيون‬ ‫المجموعة‬ ‫الثالثة‬ ‫نارية‬ ‫رسوبية‬ ‫‪Figure C.2 Output of the proposed technique.‬‬ ‫‪Next, we calculate the weights and all the terms that are needed for weight‬‬ ‫‪calculation. These are given in Figure C.3.‬‬ ‫يمة اإلنتشار ‪Factor-‬‬ ‫‪1.01‬‬ ‫‪3.508‬‬ ‫‪1.01‬‬ ‫‪1.01‬‬ ‫‪1.01‬‬ ‫‪1.01‬‬ ‫‪1.01‬‬ ‫‪1.01‬‬ ‫‪1.01‬‬ ‫‪1.01‬‬ ‫‪1.01‬‬ ‫‪1.01‬‬ ‫‪1.01‬‬ ‫‪1.01‬‬ ‫‪1.01‬‬ ‫‪1.01‬‬ ‫‪1.01‬‬ ‫‪1.01‬‬ ‫‪1.01‬‬ ‫‪1.01‬‬ ‫‪41.125‬‬ ‫‪3.508‬‬ ‫‪1.01‬‬ ‫‪1.01‬‬ ‫‪1.01‬‬ ‫‪1.01‬‬ ‫‪1.01‬‬ ‫‪3.508‬‬ ‫‪1.01‬‬ ‫‪1.01‬‬ ‫‪1.01‬‬ ‫‪1.01‬‬ ‫‪1.01‬‬ ‫‪1.01‬‬ ‫‪1.01‬‬ ‫‪3.508‬‬ ‫‪1.01‬‬ ‫‪1.01‬‬ ‫‪1.01‬‬ ‫‪1.01‬‬ ‫‪39‬‬ ‫قيمة الفراغ ‪Gap -‬‬ ‫‪320‬‬ ‫‪46‬‬ ‫‪318‬‬ ‫‪314‬‬ ‫‪214‬‬ ‫‪312‬‬ ‫‪224‬‬ ‫‪254‬‬ ‫‪295‬‬ ‫‪293‬‬ ‫‪291‬‬ ‫‪288‬‬ ‫‪174‬‬ ‫‪154‬‬ ‫‪275‬‬ ‫‪281‬‬ ‫‪278‬‬ ‫‪276‬‬ ‫‪275‬‬ ‫‪273‬‬ ‫‪19‬‬ ‫‪73‬‬ ‫‪261‬‬ ‫‪260‬‬ ‫‪174‬‬ ‫‪251‬‬ ‫‪250‬‬ ‫‪64‬‬ ‫‪227‬‬ ‫‪110‬‬ ‫‪182‬‬ ‫‪218‬‬ ‫‪228‬‬ ‫‪225‬‬ ‫‪222‬‬ ‫‪61‬‬ ‫‪160‬‬ ‫‪197‬‬ ‫‪195‬‬ ‫‪193‬‬ ‫معدل مســـافة الكلمــــة‬ ‫‪1‬‬ ‫‪275‬‬ ‫‪3‬‬ ‫‪7‬‬ ‫‪107‬‬ ‫‪9‬‬ ‫‪97‬‬ ‫‪67‬‬ ‫‪26‬‬ ‫‪28‬‬ ‫‪30‬‬ ‫‪33‬‬ ‫‪147‬‬ ‫‪167‬‬ ‫‪46‬‬ ‫‪40‬‬ ‫‪43‬‬ ‫‪45‬‬ ‫‪46‬‬ ‫‪48‬‬ ‫‪302‬‬ ‫‪248‬‬ ‫‪60‬‬ ‫‪61‬‬ ‫‪147‬‬ ‫‪70‬‬ ‫‪71‬‬ ‫‪257‬‬ ‫‪94‬‬ ‫‪211‬‬ ‫‪139‬‬ ‫‪103‬‬ ‫‪93‬‬ ‫‪96‬‬ ‫‪99‬‬ ‫‪260‬‬ ‫‪161‬‬ ‫‪124‬‬ ‫‪126‬‬ ‫‪128‬‬ ‫العدد‬ ‫‪1‬‬ ‫‪37‬‬ ‫‪1‬‬ ‫‪1‬‬ ‫‪4‬‬ ‫‪1‬‬ ‫‪2‬‬ ‫‪4‬‬ ‫‪1‬‬ ‫‪1‬‬ ‫‪1‬‬ ‫‪1‬‬ ‫‪3‬‬ ‫‪5‬‬ ‫‪2‬‬ ‫‪1‬‬ ‫‪1‬‬ ‫‪1‬‬ ‫‪1‬‬ ‫‪1‬‬ ‫‪3‬‬ ‫‪7‬‬ ‫‪1‬‬ ‫‪1‬‬ ‫‪2‬‬ ‫‪1‬‬ ‫‪1‬‬ ‫‪2‬‬ ‫‪2‬‬ ‫‪3‬‬ ‫‪3‬‬ ‫‪2‬‬ ‫‪1‬‬ ‫‪1‬‬ ‫‪2‬‬ ‫‪2‬‬ ‫‪2‬‬ ‫‪1‬‬ ‫‪1‬‬ ‫‪1‬‬ ‫أصــــل الكلمــــــة‬ ‫قدم‬ ‫صخر‬ ‫عقد‬ ‫أشي‬ ‫صلب‬ ‫صعب‬ ‫كسر‬ ‫رمل‬ ‫حصى‬ ‫طبشر‬ ‫صلصل‬ ‫نظر‬ ‫علم‬ ‫جيولوج‬ ‫نول‬ ‫قبض‬ ‫شطئ‬ ‫ركت‬ ‫ساب‬ ‫أصبع‬ ‫شبه‬ ‫حجر‬ ‫صلح‬ ‫بنء‬ ‫نوع‬ ‫بطرق‬ ‫خلف‬ ‫قسم‬ ‫ثلث‬ ‫جمع‬ ‫نار‬ ‫رسب‬ ‫يجب‬ ‫حفظ‬ ‫أسم‬ ‫عمل‬ ‫حرر‬ ‫شدد‬ ‫ضغط‬ ‫لوء‬ ‫‪188‬‬ ‫‪187‬‬ ‫‪183‬‬ ‫‪181‬‬ ‫‪170‬‬ ‫‪170‬‬ ‫‪168‬‬ ‫‪163‬‬ ‫‪160‬‬ ‫‪1.01‬‬ ‫‪1.01‬‬ ‫‪1.01‬‬ ‫‪1.01‬‬ ‫‪1.01‬‬ ‫‪1.01‬‬ ‫‪1.01‬‬ ‫‪1.01‬‬ ‫‪1.01‬‬ ‫‪133‬‬ ‫‪134‬‬ ‫‪138‬‬ ‫‪140‬‬ ‫‪151‬‬ ‫‪151‬‬ ‫‪153‬‬ ‫‪158‬‬ ‫‪161‬‬ ‫العدد‬ ‫‪1‬‬ ‫‪2595.781‬‬ ‫‪1‬‬ ‫‪1‬‬ ‫‪4.039‬‬ ‫‪1‬‬ ‫‪2.02‬‬ ‫‪4.039‬‬ ‫‪1‬‬ ‫‪1‬‬ ‫‪1‬‬ ‫‪389.367‬‬ ‫‪1‬‬ ‫‪3.029‬‬ ‫‪5.049‬‬ ‫‪4.039‬‬ ‫‪1‬‬ ‫‪4.039‬‬ ‫‪1‬‬ ‫‪1‬‬ ‫‪1‬‬ ‫‪1‬‬ ‫‪129.789‬‬ ‫‪246.75‬‬ ‫‪24.555‬‬ ‫‪4.039‬‬ ‫‪4.039‬‬ ‫‪1‬‬ ‫‪1‬‬ ‫‪2.02‬‬ ‫‪1‬‬ ‫‪1‬‬ ‫‪7.016‬‬ ‫‪6.059‬‬ ‫‪15.147‬‬ ‫‪2.02‬‬ ‫‪3.029‬‬ ‫‪6.059‬‬ ‫‪2.02‬‬ ‫‪1‬‬ ‫‪1‬‬ ‫‪4.039‬‬ ‫‪7.016‬‬ ‫‪5.049‬‬ ‫‪3.029‬‬ ‫‪2.02‬‬ ‫‪40‬‬ ‫‪1‬‬ ‫‪1‬‬ ‫‪1‬‬ ‫‪1‬‬ ‫‪1‬‬ ‫‪1‬‬ ‫‪1‬‬ ‫‪1‬‬ ‫‪1‬‬ ‫الوزن‬ ‫‪1‬‬ ‫‪20‬‬ ‫‪1‬‬ ‫‪1‬‬ ‫‪1‬‬ ‫‪1‬‬ ‫‪1‬‬ ‫‪1‬‬ ‫‪1‬‬ ‫‪1‬‬ ‫‪1‬‬ ‫‪3‬‬ ‫‪1‬‬ ‫‪1‬‬ ‫‪1‬‬ ‫‪2‬‬ ‫‪1‬‬ ‫‪1‬‬ ‫‪1‬‬ ‫‪1‬‬ ‫‪1‬‬ ‫‪1‬‬ ‫‪1‬‬ ‫‪2‬‬ ‫‪1‬‬ ‫‪1‬‬ ‫‪1‬‬ ‫‪1‬‬ ‫‪1‬‬ ‫‪1‬‬ ‫‪1‬‬ ‫‪1‬‬ ‫‪1‬‬ ‫‪2‬‬ ‫‪3‬‬ ‫‪1‬‬ ‫‪1‬‬ ‫‪2‬‬ ‫‪1‬‬ ‫‪1‬‬ ‫‪1‬‬ ‫‪2‬‬ ‫‪1‬‬ ‫‪1‬‬ ‫‪1‬‬ ‫‪1‬‬ ‫أصل الكلمة‬ ‫قدم‬ ‫صخر‬ ‫عقد‬ ‫أشي‬ ‫صلب‬ ‫صعب‬ ‫كسر‬ ‫رمل‬ ‫حصى‬ ‫طبشر‬ ‫صلصل‬ ‫صخر‬ ‫نظر‬ ‫علم‬ ‫جيولوج‬ ‫نول‬ ‫قبض‬ ‫رمل‬ ‫شطئ‬ ‫ركت‬ ‫ساب‬ ‫أصبع‬ ‫صخر‬ ‫شبه‬ ‫حجر‬ ‫رمل‬ ‫صلب‬ ‫صلح‬ ‫بنء‬ ‫نوع‬ ‫بطرق‬ ‫خلف‬ ‫قسم‬ ‫علم‬ ‫جيولوج‬ ‫ثلث‬ ‫جمع‬ ‫نار‬ ‫رسب‬ ‫يجب‬ ‫حفظ‬ ‫أسم‬ ‫عمل‬ ‫جيولوج‬ ‫جمع‬ ‫ثلث‬ ‫شيئ‬ ‫آخر‬ ‫سبل‬ ‫نجد‬ ‫مرو‬ ‫مرو‬ ‫كوارتز‬ ‫جير‬ ‫رخم‬ ‫الكلمة‬ ‫أقدم‬ ‫الصخور‬ ‫يعتقد‬ ‫أشياء‬ ‫صلبة‬ ‫يصعب‬ ‫كسرها‬ ‫الرمل‬ ‫الحصى‬ ‫الطباشير‬ ‫الصلصال‬ ‫صخور‬ ‫نظر‬ ‫العالم‬ ‫الجيولوجي‬ ‫تناولت‬ ‫قبضة‬ ‫رمل‬ ‫الشاطئ‬ ‫تركتها‬ ‫تنساب‬ ‫أصابعك‬ ‫صخورا‬ ‫شبيهة‬ ‫بالحجر‬ ‫الرملي‬ ‫الصلب‬ ‫يصلح‬ ‫للبناء‬ ‫أنواع‬ ‫بطرق‬ ‫مختلفة‬ ‫يقسمها‬ ‫علماء‬ ‫الجيولوجيا‬ ‫ثالث‬ ‫مجموعات‬ ‫النارية‬ ‫الرسوبية‬ ‫يجب‬ ‫نحفظ‬ ‫األسماء‬ ‫يستعملها‬ ‫الجيولوجيون‬ ‫المجموعة‬ ‫الثالثة‬ ‫‪3.029‬‬ ‫‪2.02‬‬ ‫‪4.039‬‬ ‫نار‬ ‫رسب‬ ‫حرر‬ ‫‪1‬‬ ‫‪1‬‬ ‫‪2‬‬ ‫نارية‬ ‫رسوبية‬ ‫الحرارة‬ ‫‪Figure C.3 Terms and weights.‬‬ ‫‪The list of the top 35 indexes is given in Figure C.4.‬‬ ‫الوزن‬ ‫‪8,075.000‬‬ ‫‪8,075.000‬‬ ‫‪2,595.781‬‬ ‫‪648.945‬‬ ‫‪616.875‬‬ ‫‪389.367‬‬ ‫‪389.367‬‬ ‫‪389.367‬‬ ‫‪323.000‬‬ ‫‪323.000‬‬ ‫‪246.750‬‬ ‫‪205.625‬‬ ‫‪205.625‬‬ ‫‪164.500‬‬ ‫‪129.789‬‬ ‫‪129.789‬‬ ‫‪129.789‬‬ ‫‪123.375‬‬ ‫‪87.695‬‬ ‫‪87.695‬‬ ‫‪87.695‬‬ ‫‪82.250‬‬ ‫‪82.250‬‬ ‫‪82.250‬‬ ‫‪82.250‬‬ ‫‪64.627‬‬ ‫‪56.125‬‬ ‫‪49.109‬‬ ‫‪31.570‬‬ ‫‪31.570‬‬ ‫‪24.555‬‬ ‫‪24.555‬‬ ‫‪8,075.000‬‬ ‫‪8,075.000‬‬ ‫‪2,595.781‬‬ ‫‪648.945‬‬ ‫‪616.875‬‬ ‫‪389.367‬‬ ‫‪389.367‬‬ ‫‪389.367‬‬ ‫‪323.000‬‬ ‫‪323.000‬‬ ‫‪246.750‬‬ ‫‪205.625‬‬ ‫‪205.625‬‬ ‫‪164.500‬‬ ‫‪129.789‬‬ ‫‪129.789‬‬ ‫‪41‬‬ ‫الكلمــــــة‬ ‫تبرد‬ ‫بردت‬ ‫الصخور‬ ‫صخر‬ ‫البركانية‬ ‫صخور‬ ‫بالصخور‬ ‫الصخر‬ ‫المشقق‬ ‫شقوق‬ ‫شبيهة‬ ‫البراكين‬ ‫بركان‬ ‫يدعى‬ ‫صخورا‬ ‫فالصخور‬ ‫الصخرية‬ ‫الشبه‬ ‫الوقت‬ ‫الحاضر‬ ‫بلورات‬ ‫الخارجة‬ ‫تقذفها‬ ‫يقذف‬ ‫تخرج‬ ‫األرض‬ ‫ماليين‬ ‫الحجر‬ ‫ببطئ‬ ‫الالبة‬ ‫الحجارة‬ ‫حجارة‬ ‫تبرد‬ ‫بردت‬ ‫الصخور‬ ‫صخر‬ ‫البركانية‬ ‫صخور‬ ‫بالصخور‬ ‫الصخر‬ ‫المشقق‬ ‫شقوق‬ ‫شبيهة‬ ‫البراكين‬ ‫بركان‬ ‫يدعى‬ ‫صخورا‬ ‫فالصخور‬ ‫الصخرية‬ ‫الشبه‬ ‫الوقت‬ ‫الحاضر‬ ‫بلورات‬ ‫الخارجة‬ ‫تقذفها‬ ‫يقذف‬ ‫تخرج‬ ‫األرض‬ ‫ماليين‬ ‫الحجر‬ ‫ببطئ‬ ‫الالبة‬ ‫الحجارة‬ ‫حجارة‬ 129.789 123.375 87.695 87.695 87.695 82.250 82.250 82.250 82.250 64.627 56.125 49.109 31.570 31.570 24.555 24.555 Figure C.4 Indexes selected and their corresponding weight. Recall that these numbers are calculated according to the formulas mentioned in section four. 42

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download AN ARABIC AUTO-INDEXING SYSTEM FOR INFORMATION