Long Title: An Auto-Indexing System for Arabic Information Retrieval
Short Title: Auto-Indexing
Authors: R. A. Haraty (corresponding author), N. M. Mansour and W. Daher
Lebanese American University
P.O. Box 13-5053 Chouran
Beirut, Lebanon 1102 3801
Email: [email protected]
Telephone: 961 1 867621 ext. 1285
Fax: 961 1 867098
This work tackles the problem of auto-indexing Arabic documents. Auto-indexing text
documents refer using words found in a document to build an index automatically. These indexes,
which are referred to as keywords, are then used to build subject headings to describe the topic or the
document. We present an algorithm for extracting Arabic stem words. We also introduce a new
technique to calculate the weight of a term relevant to its container document. Traditionally, the weight
of a term used to rely totally on the rate of occurrence of that term. We propose considering word’s
spread within the document. In other words, if a certain word is concentrated at a specific part of a
document, then it is less likely that this word reflects its document had it been more spread in the
document. This assumption is mathematically proven, and is illustrated by real examples.
Keywords: Arabic documents, document auto-indexing, stem words, and word
1. Introduction
Manual indexing of text documents is considered to be a cumbersome task for
all people who work in the domain of information retrieval. The people who perform
indexing in a newspaper, magazine, or any other information resource, are specialists,
very well-trained, and have a solid linguistic background. A solid background means
that these people should be talented in speaking the language, have rich vocabulary,
and most importantly they should be experts in matters that concern the grammar of
the language. The people responsible for doing this job are called documenters, or
‘‫ ’موثقين‬in Arabic. The process of manual indexing requires immense human effort
since it requires people to read the whole document before selecting the candidate
indexes for that document.
Indexing is of two types: Thesaurus-based indexing and Full-Text based
indexing [1]. In Thesaurus-based indexing, the documenter may choose words to
represent a document that do not even exist in the document. However, the synonyms
do exist. The documenter may choose the synonym of a word as an index when he/she
knows in advance that users are more likely to search for that particular document
using the synonym of that term rather than the term itself. A synonym need not be the
directly corresponding term in the dictionary. If, for example, a document is about a
president of a country, then a valid index might be the name of that president,
although his/her name might not occur at all in the document.
Thesaurus-based indexing is a difficult, yet a possible system to implement.
The reason behind that is obvious; human intervention is highly needed to select
synonyms instead of terms that already exist in the document. One way of
implementing a solution for that problem is to build a thesaurus file as part of the
automated system that, in turn, has to be monitored and updated by an individual.
Thereby, human intervention is needed again. However, some systems do build
thesauruses intelligently.
Full-Text based indexing, on the other hand, is much easier in concept, and
much easier to implement. It totally relies on terms, as well as phrases, within the
document itself. Nothing is exported. The problem of auto-indexing varies in
difficulty between one language and another. Languages with sophisticated
grammatical rules such as Arabic or Chinese make the process of auto-indexing quite
difficult. The only solution is to implement an algorithm that covers most of the
grammatical rules, since writing an algorithm that covers all rules is very difficult, if
not impossible. Additionally, it should be very well-designed and modularized in a
matter that it should easily allow any missing grammatical rule to be plugged into the
Whether Thesaurus-based indexing or Full-Text based indexing is used, the
output is the same: a set of keywords. Indexes, when extracted from the documents,
are referred to as “keywords”. Thus, “keywords” is the term used by documenters to
signify an index. Keywords, in turn, are used to build “subject headings”. Usually,
subject headings are phrases composed of more than one keyword. A single document
may have as many subject headings as possible. The more subject headings a
document is assigned, the more likely that a user might hit that document upon
searching for a topic. Composing subject headings is what documenters actually do.
There are certain rules that documenters follow in order to build subject headings. A
subject heading is composed of the following fields:
Name – The name of a person or the organization the document is about.
Position – Social position, for example, “The President of Lebanon” or
Country/City/Town/…Place – for example, “Beirut, Lebanon”.
Activity – for example, “Meeting with the Prime Minister”.
An example of a subject heading is the following:
United Nations>UNICEF>Somalia>Famine>Donating Food and Medicine.
Once subject headings are built, the overall search engine system will search
within the subject headings rather than the whole text. Subject headings allow users to
search within categories of topics.
In this paper we present an algorithm for extracting Arabic stem words to
build an index for Arabic text documents. We also introduce a new technique to
calculate the weight of a term relevant to its container document.
The rest of the paper is organized as follows: in section 2 an overall
description of the model used for auto-indexing is provided. Section 3 tackles
particularly the second level of the proposed model for auto-indexing. Section 4
describes how the weight of a certain term is calculated. Section 5 presents the index
selection mechanism, and section 6 contains the conclusion. Appendix A presents the
verification of the weight calculation formulae. Appendix B presents an example
using the weight calculation formula. Appendix C presents an example of our
proposed auto-indexing techniques.
2. Four Layer Model
The model used consists of four layers as shown in Figure 1. Each layer is a
module that is implemented alone, and is totally independent of the other modules.
The layers do exchange information, however. The output of one layer is the input for
the above layer. Notice in the figure that the second layer, where words are stemmed
to their original form, is drawn alone. The figure illustrates that the module that is
responsible for stemming the Arabic words can be easily plugged off the system and
replaced by another module that stems words in any other language. Everything else
will work just fine. This separation of tasks provides the system with high flexibility
in the sense that only one part of the system, and not the whole system, needs to be
replaced in order to auto-index documents in other languages.
Select appropriate words/phrases
Perform weight calculations
Apply algorithm to extract stem
words for Arabic language
Read whole document
Figure 1. The four-layer model.
The first layer is merely concerned with reading each word and inserting it
into an array called Document. The Document array consists only of words as is; no
stemming is performed, no weight is calculated, no words are omitted. Once the text
file is read and loaded into memory in a form of array of words, the real work starts!
The second step is to go over the whole array, scan it word by word, and apply the
stemming algorithm (see Figure 2). Each word is checked alone. Words that belong to
the “stop-list1 terms” are omitted. Phrases that belong to “stop-list phrases” are
omitted as well. Like stop-list terms, stop-list phrases are sentences that occur within
a document, yet they do not contribute to the meaning of the document. For instance,
a document may contain the phrase: “Ladies and Gentlemen...”, yet the document
probably tell nothing about “Ladies” or “Gentlemen”. If, however, the word is a
candidate word to be a valid index, then the word is stemmed and returned to its, most
probably, three letters original word. The details of the stemming algorithm are given
in the next section. The output of this module is two set of words (or two arrays of
Stop-list terms are often referred to as “noise”. The list consists of all words that do not contribute to
the meaning of a sentence, yet they help in forming a proper sentence. Such words constitute a major
part of any document.
strings): The first is called the “Words” array. It is an array of records with each
record having four fields:
1 - The word itself,
2 - the stem word,
3 - the count of the word, and
4 - the weight of the word.
At this stage, the weight field is kept undefined. It is updated in the third layer
when all criteria for weight calculations are available.
The second array is called the Stem_Word array. It is an array of records
consisting of five fields:
1- The stem word,
2 - the count,
3 - the ideal distance or ID,
4 - the average ideal distance or aid, and
5 - the average distance or ad.
Figure 2. Flow chart of the auto-indexing algorithm.
2.1 General Algorithm
Figure 3 gives a high level algorithm that outlines the layers of the model used. Note
that the algorithm contains new terms that will be explained in the next sections.
// Layer 1 starts here
1. Read the whole document and put words into array
(ArrayName = Document)
Assign Each Term a distance value that is autoincremented by one
2. Set N = Count of words in document
// Layer 2 starts here
3. If EndOfArray then goto 6 Else Read word from array
4. If Word belong to stop list terms/phrases then
Read next word
Set PrecededWord = Document.CurrentWord
Goto 3
// For each word in Document array do the following block of code
5. Set ThisWord = Document.CurrentWord
If ThisWord Exists in ‘Word’ Array then
Increment its count by 1
Insert ThisWord into WordArray; Set its Count = 1
Set StemWord = ExtractStemWord(PrecededWord, ThisWord)
If StemWord Exists in ‘StemWordArray’ then
Increment its count by 1
// accumulate sum of distances so as to divide the same
// term (ad) by its count when done
Increment Average Distance: ad = ad + distance
Insert StemWord into StemWordArray;
Set its Count = 1
Set average distance (ad) = distance
Set PrecededWord = ThisWord
Goto 3
// Layer 3 starts here – calculate weights
6. For Each Word in StemWordArray
Set average distance : ad = ad / Count
Set Ideal Distance : ID = N / (Count +1)
Set average ideal distance : aid = N / Count
Read Next Word from array
// Now it is time to assign the weight to each word in Word
7. For Each Word in WordArray
Find Matching Node StemWordArray using Binary Search
Set the gap variable : g = aid – ad
// Difference between averages of ideal distances and
// distances
Set F(g) = 1 
( N  1)
2|G |
Set Weight = Count X CountStemWord X F(G)
Read Next Word
// Layer 4 starts here
8. Select words with highest weights as valid indexes for that
Figure 3. Outlines of the layers of the auto-indexing model.
3. Stem Word Extraction
“Every language whether natural or artificial, is characterized by its
vocabulary, its syntax, its logical structure, and its domain” [2]. After reading the
whole document, it is analyzed word by word, stop-list terms that compose a major
part of the document are disposed, and terms are identified as nouns or verbs, and the
appropriate stemming techniques are performed on that term. Stemming a word to its
root term is an important stage that a document has to undergo while performing autoindexing [3]. Suppose that a certain noun comes once in the form of a singular noun
and once in the form of a plural noun. Moreover, suppose that the same noun occurs
once as an adjective, another time as a subject, and the other as an object. The same
term is actually appearing in different suits, yet with the same body. If the autoindexing algorithm does not perform word stemming, then it would treat each form of
this noun as a totally different and independent term. Doing so is not quite a correct
way to auto-index documents since a noun is a noun no matter what form does it
appear in.
3.1 The Rhyming Algorithm
The Rhyming algorithm is a basic function for performing word-stemming. It
is not at the core of the word-stemming algorithm, but it is rather described as an
essential utility for doing the stemming task. The Rhyming algorithm is used to decide
whether a certain word is a noun or a verb, whether a noun is in it singular form or
plural form, whether a verb is in its past, present or future tense, whether certain
pronouns are attached to a word, etc.
When applying the Rhyming algorithm against a certain word, that word is
compared to a special set of rhythms. The set of rhythms changes according to the
module calling the Rhyming algorithm. For example, the set of rhythms used to
decide whether a noun is in its singular or plural form is different from the set used to
determine the attached prefix/suffix pronouns.
Note that all words are rhymed with the derivations of the word ‘‫’فعل‬. The
rhythms of the verb ‘‫ ’فعل‬have been used as a standard in all books that teach the
Arabic grammar. The Rhyming algorithm goes as follows:
Boolean RhymeWords(Rhythm, Word) {
If Length (Rhythm) <> Length (Word) then
Return False; // Words do not rhyme
Else {
Len = Length (Rhyme) // Or Length(Word)
// Now compare Rhythm and word letter by letter
i := 0;
WordsRhyme := True ;
While ( i < Len –1 ) && ( WordsRhyme ) {
//Ignore letters of word ‘‫ ’فعل‬while rhyming
If Not (Rhythm(i) In [‘‫’ل‬, ‘‫’ع‬, ‘‫ )]’ف‬Then
WordsRhyme = (Rhyme(i) == Word(i));
} // While
return WordsRhyme;
3.2 Analyzing a Word
Recall that the output of the stem-word extraction module is two sets of
words2. The first of which is the list of the words that are candidate indexes, and the
second set is the one that contains the corresponding stem word. Each stem word in
the latter set may have one or more corresponding words in the former set.
Two processes precede the process of word stemming. Each word in the
document has to be read and analyzed separately. After reading a word, the algorithm
has to check whether this word is for use or for disposal! The checking is simple: if
the word is a stop-list term, then it is for disposal. Else, it is for use. The second
process is to decide whether the word taken into consideration is a noun or a verb.
Based on the results, the appropriate stemming techniques will be used.
3.2.1 Stop-list Terms and Phrases
Stop-list terms are excluded from the candidate set of indexes since they do
not contribute to the meaning of the document whatsoever [4]. Stop-list terms are
categorized according to their type. Categorizing stop-list terms helps to a great extent
in determining the type of the following word whether it is a noun or a verb.
Consequently, the appropriate stemming rules may be applied.
Like stop-list terms, stop-list phrases do not contribute to the meaning of the
document as well. However, stop-list phrases do not hold any sign regarding the type
of the following word. Examples about stop-list phrases are many: ‘ ‫السالم عليكم و رحمة‬
‫’هللا‬, ‘‫’تحية طيبة و بعد‬, ‘‫’شكرا لتعاونكم‬, etc. Stop-list phrases are detected by comparing the
Programmatically speaking, these sets of words are rather called array of words or strings.
first word of the phrase with a certain set of words that hold the starting words of all
stop-list phrases. If a matching word is detected, then the rest of the phrase is
compared with a set of stop-list phrases that begin with the same word.
3.2.2 Identifying Verbs and Nouns
Many information retrieval systems perform natural language processing in
order to auto-index certain components. This strategy relies basically on acquiring
lexical, syntactic, and semantic information from that component3. Following this
strategy involves the algorithm to cater for almost all grammatical rules in the
language in consideration [5] [6]. As a result, it is quite difficult to do natural
language processing for languages with sophisticated grammatical rules such as
Arabic. Our algorithm decides whether a word is a noun or a verb by examining two
clues. The first clue is the word preceding the word in question. This is the case
especially if the preceding word is a stop-list term. Some stop-list terms precede
nouns only; others precede verbs only. For example, the stop-list terms that fall under
the category of ‘‫ ’إسم موصول‬precede only verbs. The same thing applies for the
category of ‘‫ ’أدوات النصب‬and ‘‫’أدوات الجزم‬. On the other hand, stop-list terms
categorized as ‘‫ ’أحرف الجر‬precede only nouns.
The second clue is the rhythm of the word itself. If for example a word rhymes
with ‘‫ ’يفعل‬or ‘‫ ’إفعل‬then it is a verb. If, on the other hand, it rhymes with the word
‘‫ ’فاعل‬or ‘‫’مفعول‬, then it is most likely a noun. Other clues might be the attached
pronouns. Some pronouns are attached to verbs only, some others to nouns. The
techniques applied to nouns as well as to verbs are applied separately. If either or both
sets of stemming techniques succeed in stemming the word to its three letter word
A component could be a document, an image, an audio information, etc.
form, then that word would be the stem of the initial word in consideration. The
algorithm for deciding the type of the word is presented below and can be amended
later for enhancements:
DecideVerbOrNoun (PrecededWord) {
If PrecededWord belongs to ‘‫ ’أدوات النصب‬or ‘‫’أدوات الجزم‬
Return Verb;
Else If PrecededWord is ‘‫ ’إسم موصول‬Then
Return Verb;
Else If PrecededWord rhymes with ‘‫ ’فعل‬Then
Return Verb;
Else If PrecededWord is Verb Then
Return Noun; // 2 verbs can not precede each other
Else If PrecededWord is ‘‫ ’حرف جر‬Then
Return Noun;
Else If attached to it the following prefixes: ‘‫’ال‬, ‘‫بال‬
’, ‘‫’كال‬, ‘‫ ’فال‬then
Return Noun;
Else If Word Rhymes with ‘‫ ’فاعل‬or ‘‫’مفعول‬
Return Noun;
Return ‘Unknown’;
3.3 Extracting Stem Words from a Verb
Recall that the whole idea behind knowing the type of a certain word is to
know what stemming techniques ought to be used. Remember that a successful
algorithm that extracts stem words from verbs is the one that returns any verb to its
original three letter form.
3.3.1 Checking Attached Prefix/Suffix Pronouns
The first applied stemming technique is to check whether a word contains
attached pronouns. Pronouns in Arabic language come in two forms: attached
pronouns (‘‫ )’ضمائر متصلة‬and discrete pronouns (‘‫)’ضمائر منفصلة‬. The discrete pronouns
are considered stop-list terms, and thus they are ignored by the algorithm. The
attached pronouns however, are part of the word itself. Hence, they should be spotted
and identified by the algorithm in order to separate them from the verb. Attached
pronouns come either at the beginning of the word or at the end of the word or at both
sides. The list of all attached pronouns is a finite and a defined set. The algorithm
loops over the whole set of attached pronouns and performs pattern matching in order
to check for the existence of any attached pronoun. In case it matches a pronoun, it
removes it, and returns the verb barred from all suffix/prefix pronouns. The following
tables contain all possible attached pronouns that exist in the Arabic language. The
first table contains a list of four prefix pronouns, whereas the second and third tables
list all suffix pronouns and their possible combinations [7].
‫‪Table 1 – List of prefix pronouns.‬‬
‫األلف – للمتكلم‬
‫الياء – للغائب المذكر‬
‫النون – لجمع المتكلم‬
‫التاء – للغائب المؤنث‬
Table 2 – List of suffix pronouns.
‫ للمثنى الغائب‬- ‫األلف‬
‫الواو و األلف – لجمع المذكر‬
‫للمخاطب المثنى‬
‫للمخاطب جمع الذكر‬
‫التاء – للمؤنث الغائب‬
‫التاء و األلف – للمؤنث المثنى‬
‫للمخاطب جمع المؤنث‬
‫النون و األلف – نون الجمع المتكلم‬
‫النون – نون النسوة‬
‫التاء – تاء مفرد المتكلم‬
In addition to the above stated pronouns in the second table, things may even
get more complicated when combinations of these pronouns occur together. In table 3,
which is an extension of table 2, there are three columns: the first of which is a list of
possible attached pronouns; the second column is also a list of attached pronouns, but
can precede the pronouns in the first column. Thus, it is a combination of more than
one attached pronoun. The third column contains only two examples per row out of 7
x 4 x 2 = 56 possible example [7].
Table 3 – List of combinations of suffix pronouns.
‫جعلناه – رأيتك‬
‫فعالهما – علموكما‬
‫أطعمتهم – داعبتكم‬
‫جعلناهن – أحببتكن‬
3.3.2 Checking Verb against Common Five Verbs
The common five verbs are known as ‘‫’األفعال الخمسة‬. These verbs come in a
special form and have special properties: They always come in the present tense, and
they always end with the letter ‘‫’ن‬. If, however, these verbs are preceded with either
‘‫ ’أدوات نصب‬or ‘‫ ’أدوات جزم‬then the ‘‫ ’ن‬letter must be removed [8]. The five verbs are
listed in table 4 with and without the ‘‫ ’ن‬letter at the end of the word.
Table 4 – The common five verbs ( ‫) األفعال الخمسة‬.
‫ الصيغة الثانية‬-‫األفعال الخمســـة‬
‫ الصيغة األولى‬-‫األفعال الخمســـة‬
‫تســتعمل مع جمع المذكر الغائب‬
‫تســتعمل مع جمع المذكر المخاطب‬
‫تســتعمل مع المثنى الغائب‬
‫تســتعمل مع المثنى المخاطب‬
‫تســتعمل مع المؤنث المفرد المخاطب‬
Unlike the attached pronouns, the algorithm does not perform pattern
matching to detect whether a verb belongs to the five common verbs. Instead, it
rhymes the verb against one of the ten mentioned rhythms above. If words rhyme,
then the letters seen in red are the letters to be discarded, and the stem word would be
the word composed of the black letters only.
3.3.3 Checking Verb against the “10-Verb-Additions”
In the Arabic language, every verb consists of only three letters. Verbs
consisting of more than three letters are merely derivations of their original threeletter verb. The derivations of any verb occur in ten different formats. Three of these
formats are obtained by adding a single letter to the original verb, five of them are
obtained by adding two letters, and the other two formats are obtained by adding three
letters. These ten formats, also named as derivations, are known in the Arabic
grammar as ‘‫[ ’الزيادات العشرة‬9]. The derivations, as well as an example of each of these
derivations are presented in the table 5.
Table 5 – List of the ten derivations ( ‫) الزيادات العشرة‬.
‫أصل الفعل‬
‫أصل الفعل‬
‫إنهزم األعداء‬
‫أضرم النيران‬
ً ‫إقترف خطا ً فادحا‬
‫سرع البحث‬
‫زهر الورد‬
َ ‫إ‬
‫إفع َل‬
‫قاتل األعداء‬
‫إغرورقت عيناه‬
‫تس َبب في وفاته‬
‫تف َعل‬
‫إستخرج النفط‬
‫تعاطف مع صديقه‬
Like the five common verbs, the algorithm detects one of the ten derivations
by rhyming it with all the ten rhythms mentioned in the above table. If the algorithm
detects that a verb is in the form of one of those derivations, it extracts the stem word
by removing the letters colored in red.
3.4 Extracting Stem Words from a Noun
Extracting a stem word from a noun is more complicated process compared to
stemming a verb. The difficulty of stemming a noun is a result of many factors. One
reason is that a noun may appear in the singular, double, or plural form. Additionally,
each of these three formats differs had the noun been addressing a male or a female.
Furthermore, things may get more complicated since there are lots of exceptions for
the double and plural formats. Besides, there may be lots of derivations for each noun
that have no specific format! In summary, stemming a noun is not as easy as that of a
verb. The process of extracting a stem word from a noun is not described in this paper
for brevity. Interested readers, however, can refer to [10] for more details.
The stressing character – known as ‘‫ – ’شدة‬is considered a letter by itself in the Arabic language.
3.5 Algorithm for Arabic Word Extraction
The algorithm contains the WordBelongsToList function, which accepts
two parameters: The first is an Arabic word, and the second is a list of words (array of
strings) that holds a certain set of rhythms. For instance, such set of rhythms could be
the list of the common nouns. The function searches for the corresponding rhythm of
the passed word within the passed list of rhythms. If it does find the corresponding
rhythm, then the latter would be the returned value of the function. Else, the function
returns empty string to indicate that the word does not rhyme with any item in the list.
The algorithm also uses the GetStemWord function. As its name indicates,
GetStemWord gets the stem word from the initial word. It accepts two parameters:
The first parameter is the word itself that has to be stemmed. The other parameter is
the rhythm that was returned by the function WordBelongsToList. According
to the second parameter, the function performs the stemming appropriately, and thus
returning the stem word.
The algorithm for Arabic word extraction follows:
String ExtractStemWord (Word) {
// WordType holds the type of the word had it been a noun, a verb, or
// unknown
WordType wordType;
v_StemWord, n_StemWord, temp_StemWord;
// The following function WordIsStopListTerm returns false if a word
is // not a stop-list term. Otherwise true, in addition to the type
of the // word that follows (Since stoplist terms usually indicate
the type of // the following word)
If WordIsStopListTerm (Word, WordType) Then return;
If wordType == Unknown then
// Try to guess the type of word
wordType = DecideWordType (Word);
If (wordType == Verb Or wordType == Unknown) Then {
v_StemWord = Word ;
// The following while loop iterates at most twice since a verb
// may have none, one, or a combination of 2 attached pronouns
While ThereAreSuffixPronouns (v_StemWord)
// Now check if the verb is in the future tense
If (v_StemWord starts with letter ‘‫ )’س‬And
(Second letter is in [‘‫’ت‬,’‫’ي‬,’‫’ن‬,’‫ )]’أ‬then
// if length of v_stemword <= 3 then we’re done
If Length(v_StemWord) <= 3 then return v_StemWord;
// Check if the verb is one of common five verbs
wordRhythm = wordBelongsToList(v_StemWord, lstFiveVerbs[]);
If (wordRhythm != “”) Then
v_StemWord = getStemWord (v_StemWord, wordRhythm);
If Length(v_StemWord) <= 3 then return v_StemWord;
Word_is_ten_derivations: // a label that is referenced by a goto stmt
// Check if the verb is one of ten derivations
wordRhythm = wordBelongsToList(v_StemWord, lstTenDerivs[]);
If (wordRhythm != “”) Then
v_StemWord = getStemWord (v_StemWord, wordRhythm);
If Length(v_StemWord) <= 3 then return v_StemWord;
// Check if the verb is one of ten derivations but in the
present // tense
If (FirstChar(v_StemWord) in [‘‫’ت‬,’‫’ي‬,’‫’ن‬,’‫ )]’أ‬Then {
Temp_StemWord = v_StemWord;
v_StemWord = v_StemWord - FirstChar(v_StemWord);
goto Word_is_ten_derivations;
v_StemWord = Temp_StemWord;
} // End of “Word is verb” block
Else If (wordType == noun Or wordType == Unknown) Then {
// If Exist, remove prefixes
n_StemWord = RemovePrefixes (word);
n_StemWord = RemoveSuffixes (n_StemWord);
If n_StemWord is in its regular plural form Then
// get stem word by performing pattern matching
n_StemWord = getStemWord (n_StemWord);
// Word might be in its irregular plural form
WordRhythm = wordBelongsToList (n_StemWord,
If (WordRhythm != “”) Then
n_StemWord = getStemWord (n_StemWord, WordRhythm);
// At this point, the noun is its singular form
// Check if word is one of five common nouns
WordRhythm = wordBelongsToList (n_StemWord,
If (WordRhythm != “”) Then
n_StemWord = getStemWord (n_StemWord, WordRhythm);
Else {
// Check if word belongs to M, T, or Miscellaneous Derivations
FirstChar = getFirstChar (n_StemWord);
Switch (FirstChar) {
Case is = ‘‫’م‬:
WordRhythm = wordBelongsToList
(n_StemWord, lst_M_Derivations[]);
If (WordRhythm != “”) Then
n_StemWord = getStemWord
(n_StemWord, WordRhythm);
Case is = ‘‫’ت‬:
WordRhythm = wordBelongsToList
(n_StemWord, lst_T_Derivations[]);
If (WordRhythm != “”) Then
n_StemWord = getStemWord
(n_StemWord, WordRhythm);
Case Else:
WordRhythm = wordBelongsToList
(n_StemWord, lst_Misc_Derivations[]);
If (WordRhythm != “”) Then
n_StemWord = getStemWord
(n_StemWord, WordRhythm );
} // End of “Word is noun” block
// Supposedly, at this point a word should have undergone the
// process and is in its stem form
// Now return the stem word according to whether the word is
// noun, verb, or unknown
Switch (WordType) {
Case is = noun
Return n_StemWord;
Case is = verb
Return v_StemWord;
Case is = unknown
If (v_StemWord == n_StemWord) Then
//* return either words
return v_StemWord; // Or n_StemWord
If Length(v_StemWord) == 3 Then
return v_StemWord;
Else If Length(n_StemWord) == 3 Then
return n_StemWord;
If (length (v_StemWord) < length (
return v_StemWord;
return n_StemWord;
} // End of Switch Block
} // End of Function
4. Weight Calculation
After the word stemming stage is over, it is time to calculate weights of these
words relevant to their document [11]. The weight of a term relies basically on three
4.1 Factors Affecting Weight
Three factors affect the significance of a certain word to a document. The first
factor is obviously the count of that term in its container document [4][12][13][14].
The second factor is the count of the stem words for that word. The third factor, and
this is our contribution to the field of auto-indexing, is the spread of that word over
the document. This assumption is based on the fact that if a certain word is
concentrated at a specific part of a document, then it is less likely that this word
reflects its document had it been more spread in that document.
Obviously, the weight of a term is directly proportional to its count [4] [12] as
well as the count of its stem word. As either counts increase, the weight should
increase correspondingly. The only thing missing right now is the factor that
determines how much the term is spread within the document. This factor is supposed
to increase as the term spreads equivalently among all parts of the document.
Likewise, it should decrease as the term concentrates at a certain part of the
4.2 Formula Verification
Based on what has been stated above, the weight of a certain word with
respect to its document becomes:
weight = Word_count  Stem_word_count  Spread factor
The count of a word as well as its stem word can be easily achieved by simply
counting the repeatance of each word in the document. However, the calculation of
the spread factor is not as easy.
Consider the following terms that are used in the formulas for weight
N: count of all terms in document.
m: count of a certain word in a document.
sm: count of stem words for a certain word in a document.
f: some factor that indicates how much a word is spread within a document
(remember: the more a term is spread, the larger its factor becomes).
Therefore, the weight w of a certain word becomes:
w = m  sm  f
The next step is to find a formula for that factor such that it increases as the
term spreads over the document, and decreases as the term concentrates in a specific
section. Again, consider the following terms for spread calculation:
d (distance): a distance of a term is simply its position in the document
[15]. In other words, it is the count of words preceding it. For example, the
distance of the very first term in the document is one. Likewise, the
distance of the very last term is N.
ad (average distance): is the average of all distances for a stem word.
 sm 
  di 
ad =  i 1  where di is the distance of the i th term.
 sm 
id (ideal distance): is the ideal distance between every two occurrences for
each stem word. The ideal distance of course should be equal between
every two similar stem words. If the distance between every two stem
words equals the ideal distance, this means that the term is perfectly spread
over the document. The ideal distance for a specific term is:
 N 
id = 
 sm 1
aid (average ideal distance): is the average of all ideal distances for a stem
 sm
  i  ID 
aid =  i 1
 sm 
Notice however, that
i =
i 1
sm  ( sm 1)
and id =
 N 
 sm 1
N 
Thus, aid becomes   . Notice that aid is independent of sm. As a result, we
can deduce that all stem words has one same average ideal distance, which is
totally dependant on the number of terms in the document. Unlike all
previously defined terms, the aid term is an attribute of a document rather than
the stem word since the distance, ideal distance, and the average distance vary
for different stem words, whereas aid remains constant.
 g (gap): is the difference between aid and ad.
g = aid – ad
(Notice that g may have a positive or a negative
4.3 Weight Calculation
Notice that a decrease in g indicates that the term is perfectly spread over the
document. Hence, this should affect positively the weight of that term. The converse
is quite true. An increase in g indicates that the term is concentrated at a specific
part(s) of the document; perhaps only in the first paragraph, the last paragraph, or
even in one sentence. This obviously means that this term weakly reflects the content
of the document. Thus, as g increases, the weight of the term should relatively
Performing weight calculation requires the use of f to be a function of g  (g)
such that:
(1)  < (g) <  ; ,   ; <;
  1;   N
(2) limit (g) = 
(the maximum value)
g 0
limit (g)
( the minimum value )
g 
Assumption: (g) may be defined as:
(g) =  +
   
Where K   ; K  1 5
5. Index Selection
At this stage, we are ready to select the indexes from the candidate terms.
What qualifies a term to be an index is basically its weight. The index selection
mechanism varies according to the task that the overall auto-indexing system is
assigned to do. For example, an auto-indexing system that is part of a general
newspaper archiving system, may behave differently had the auto-indexing system
been part of an Internet search engine system. This difference in behavior between
one auto-indexing system and another is embodied in the very last stage of the
system, namely the “index selection” stage.
Index selection is the ultimate level that a document undergoes upon autoindexing. After all, this is what it’s all about. The index selection mechanism varies
upon the general task that the auto-indexing system is supposed to accomplish. One
interesting mechanism might be the one that is used to create an index for a book [4].
The index of any book, usually available at the end of the book, is composed of
keywords that are alphabetically sorted, grouped by their first letter, and listed
together with the page(s) where they occurred within the book. One slight
modification on the whole system is to add the page number field and fill it from the
very first stage upon reading the whole document and putting it into an array of
strings. Once the words within a document are stemmed, and the weights are
Note that K is a constant. The goal of the function is to get a factor which is a function of g, and
between  and .
calculated, the user may set a threshold on the weight to select the indexes. For
example, the user may claim that not all words that occur in a book need to be
mentioned in the index. However, terms with weight greater than or equal to a certain
threshold may be selected. Accordingly, an appropriate index for the book in question
will be achieved.
Another index selection technique is the one that is used by some search
engines, such as Internet search engines. In such information retrieval systems, all
terms that were extracted from the document are selected as indexes! Actually, the
whole system works as follows: the document in consideration is assigned an ID, a
unique value, and the latter is stored in the system’s database along with the
document’s name, and its physical path6 (i.e., where the document physically exists
on the hard disk drive). Additionally, the system assigns each new word a unique ID,
and inserts it into its database as well. However, all extracted terms, had they been
new or already existed in the database, will be assigned an entry whereby the term ID
is stored with the document ID along with the term’s weight. In other words, the
system will store the corresponding weight for each term where it occurred in each
document separately [15]. The idea is illustrated in Figure 4.
Figure 4. Organization of terms versus documents.
Alternately, some systems may save the whole document in the database rather than the document’s
Notice in the figure above how the database is structured. Obviously, each
document contains many terms. On the other hand, each term may appear in one or
more documents. Hence, the relationship between documents and terms is many to
many. However, the weight of a single term varies from one document to another.
That is why it is put in the middle table. Some researchers claim that in order to make
the best use of indexes to enhance retrieval time, it is best to put the documents versus
terms in a matrix form, and the weights would reside within the matrix to signify the
relevance of each term to a document [8].
Some indexes are put in the middle to link documents with terms. An end user
usually types a few words to search for certain documents. The information retrieval
system reads these words, and fetches their IDs in the database. Then, it carries these
term IDs and searches for the corresponding document IDs in the index table labeled
as “Doc-Term” in the Figure 4. When the document IDs are known, the system easily
fetches the documents’ names, together with their paths, from the document table.
The Weight factor signifies how relevant a document is to the user’s request. In other
words, the information retrieval system should respond to the user’s query by listing
the relevant documents sorted descendingly by their weight [12] [16].
The index selection mechanism that is used by the auto-indexing system in
this work is different from both techniques that were discussed earlier. The index
selection stage is basically to list all keywords for the documenter. It is like preparing
the ingredients for the chef in a cuisine. The auto-indexing system reads the whole
document, performs word stemming, calculates terms weights, and finally produces a
list of keywords sorted descendingly by weight. In other words, the whole auto-
indexing system is there to fulfill what a documenter has to do manually while
preparing subject headings for a document.
6. Conclusion
The idea of auto-indexing varies in difficulty between one language and
another since the module that roots words to their original stem depends absolutely on
the grammar of the language in consideration. Our contribution to this field involves
presenting a model for auto-indexing that is composed of four interdependent layers.
This model provides flexibility for the overall system since the system will not be
bound to a specific language. In this report, the choice fell on the Arabic language.
The overall system henceforth, performs auto-indexing on Arabic documents. We also
presented a new criterion to consider when calculating the weight, or relevance, of a
certain term with respect to its container document. This new dimension is actually
the level of spreading of a term in the document. This idea is based on the assumption
that the more a word is spread in the document, the more likely it is to signify the
document. This assumption is mathematically proven in the report. We have also
presented new ideas in word extraction for Arabic words. This step is vital yet
insignificant in languages that do not have sophisticated grammatical rules.
Future work involves enhancing the stem word extraction algorithm. The
Arabic language is a rich language in all aspects: the grammatical rules are many
(almost infinite) with lots of exceptions for some of these rules, the Arabic thesaurus
is huge, and the letters in Arabic are attached letters, that is the letters within each
word are attached to each other, and thus, each letter in Arabic may be written in
several forms according to its location in a word. For example, the letter ‘‫ ’ت‬is written
differently had it appeared at the beginning, middle, or end of a word.
Appendix A - Verification of the weight calculation formula.
Consider the following:
(g) =  
g 0
    = + -=
limit (g)=  + 0 = 
g 
Therefore, the formula is correct. Figure A.1 shows a plot of the values of (g)
as g varies.
Figure A.1 Graph of f(g).
However, the value of g could be positive or negative. Notice that this factor
f(g) should have the same value had g been a positive value x or a negative value -x.
This is because the average distance ad could have been aid - x or aid + x, and the gap
would still have the value of x. Thus, the formula becomes:
(g) =  
   
and the graph becomes as follows:
Figure A.2. Enhanced graph of f(g).
As a result, the weight of the term becomes:
    
w = m  sm   
Practically speaking, in the implementation of this formula,  and  have been
assigned the values 1 and N, respectively. In other words,  and  are assigned their
minimum and maximum values, respectively. Additionally, K is assigned a value of 2.
One thing was noticed regarding this constant that the larger K is chosen, the less
significant value the weight formula will have. The final shape of the formula would
w = m  sm  1 
N 1
Appendix B – An example of weight calculation
Suppose that a document contains 120 words. Suppose also that the count of
some word in that document is 3, and the number of stem words for that word is 5.
Thus, m = 3 and sm = 5. Applying the above formulas:
 N 
ID = 
 = 5  1 = 20
 sm 1
This means that the word W would be perfectly spread within the document if
W occurs once every 20 terms. Thus, 20 is the ideal distance between the occurrences
of words W. aid, on the other hand, is an average value of all distances of these ideal
terms (see Figure B.1).
aid =
N 120
= 60
Figure B.1 The scale.
The distance of a term is defined to be the position at which the term
occurs with respect to the beginning of the document. Now suppose that the first term
occurs at location 17, the second at 34, etc. The distances for the five terms are given
as follows:
d1 = 17, d 2 = 34, d 3 = 61, d 4 = 68, d 5 = 102.7
 sm 
  di  17  34  61  68  102 272
ad =  i 1  =
= 54.4 = 55
 sm 
Note that these distances are randomly chosen to illustrate the example.
With ad = 55, notice the figure 6, which is an extension to Figure B.1.
Figure 6. Detailed scale.
In this example, there is a difference of 5 between aid and ad. Accordingly, term g is:
g = aid – ad = 60 – 55 = 5
Therefore, the weight for the term in consideration is:
w = m  sm  1 
N 1
 120  1 
 = 3  5  1 
 = 108.281
25 
‫‪Appendix C – An example of auto-indexing‬‬
‫‪In this appendix, we present an example to illustrate the operation of our‬‬
‫‪proposed techniques shown in Figure C.1.‬‬
‫أقدم الصخور‬
‫يعتقد البعض أن الصخور أشياء صلبة يصعب كسرها ‪ .‬نعم ‪ ,‬إن بعض الصخور كذلك ‪ ،‬لكن ال‬
‫كلها ‪ .‬إن الرمل و الحصى و الطباشير و الصلصال صخور في نظر العالم الجيولوجي ‪ .‬ولو أنك‬
‫تناولت قبضة رمل على الشاطئ و تركتها تنساب بين أصابعك ‪ ,‬فإنك تكون قد تناولت صخورا‬
‫شبيهة بالحجر الرملي الصلب الذي يصلح للبناء ‪ .‬و هنالك أنواع عديدة من الصخور تكونت بطرق‬
‫مختلفة ‪ .‬و يقسمها علماء الجيولوجيا إلى ثالث مجموعات ‪ .‬الصخور النارية ‪ ,‬و الصخور الرسوبية‬
‫‪ ,‬و الصخور المتحولة ‪ .‬و يجب علينا أن نحفظ هذه األسماء ألنها األسماء التي يستعملها‬
‫الجيولوجيون دائما ‪ .‬و صخور المجموعة الثالثة أو الصخور المتحولة هي الصخور التي بدأت‬
‫نارية أو رسوبية ثم تغيرت بفعل الحرارة الشديدة أو الضغط أو اإللتواء تغيرا كبيرا حتى صارت‬
‫شيئا آخر ‪ .‬و على سبيل المثال نجد أن بعض الحجارة الرملية تغيرت إلى حجارة أشد صالبة هي‬
‫المرو أو الكوارتز ‪ ,‬و أن حجر الجير تحول إلى رخام ‪ ,‬و أن صخر سهل التفتت يدعى الطين‬
‫الصفحي قد تحول إلى إردواز أو الحجر المشقق ‪ ،‬وهو صخر قاس سريع اإلنكسار‪.‬‬
‫حين نشأت األرض قبل أربع ماليين سنة ‪ ,‬كانت مكونة من مادة ذائبة شديدة الحرارة ‪ ,‬ثم أخذت‬
‫تبرد ‪ .‬و تحولت المادة الخارجة بصورة تدريجية إلى صخر صلب شديد الشبه بالصخور النارية‬
‫التي نراها في ا لوقت الحاضر ‪ .‬و قد كانت آنذاك هي النوع الوحيد من الصخور ‪ .‬و في خالل‬
‫ماليين السنين كانت المادة الذائبة التي على سطح األرض تبرد بسرعة معقولة و هي التي ندعوها‬
‫بالصخور البركانية ألنها تشبه الصخور التي تقذفها البراكين الحية في الوقت الحاضر ‪ .‬لكن‬
‫الصخور الغائرة في األرض كانت تبرد ببطئ أشد ‪ ,‬و تعرف بالصخور الجوفية أو البلوتونية نسبة‬
‫إلى بلوتو ‪ .‬و يمكن لعلماء الجيولوجيا أن يميزوا بسهولة بين الصخور البركانية و الصخور‬
‫الجوفية و ذلك من طريقة تكوينها ‪ .‬فالصخور التي بردت بسرعة معقولة تتكون من بلورات‬
‫صغيرة ذات شكل منتظم ‪ ,‬لكن تلك التي بردت ببطئ تتكون من بلورات أكبر حجما ‪.‬‬
‫لقد عرف علماء الجيولوجيا كيف كان الصخور البركانية تبرد في الماضي بمراقبة ما يحدث أحيانا‬
‫عندما ينفجر بركان في الوقت الحاضر ‪ .‬فقد يقذف من فوهته جداول من صخور مذابة تدعى البة‪.‬‬
‫و إذا بردت الالبة بسرعة جرى ذلك بدون أن تتكون فيها بلورات ‪ ,‬و ظهرت قاسية و زجاجية ‪ .‬أما‬
‫إذا بردت بسرعة أقل ‪ ,‬فإن بلورات صغيرة تتكون فيها ‪ .‬و قد تخرج الالبة من السطح أحيانا مليئة‬
‫بفقاقيع من الغاز ثم تبرد و تتكون منها صخر إسفنجي الشكل يدعى الخفان و هو الحجر الذي‬
‫يستعمل إلزالة األوساخ العالقة باأليدي‪.‬‬
‫و أكثر الصخور انتشارا هي البازلت ‪ .‬و تتكون من بلورات صغيرة ‪ .‬و قد تكون القسم األكبر منها‬
‫قبل ماليين السنين عندما تجمعت الصخور المذابة في شقوق طويلة بقشرة األرض و اندفعت‬
‫انهارا من الالبة بلغت عرضها كيلومترات عديدة في بعض األحيان ‪ .‬و يمكننا أن نرى ذلك في‬
‫الوقت الحاضر في أمكنة كثيرة ‪ .‬ففي الساحل الشمالي من إيرلندا ‪ ,‬بردت الصخور المذابة بسرعة‬
‫شديدة فتشققت و كونت أعمدة هائلة ألكثرها ستة جوانب‪ .‬و بدت و كأنها درجات يستخدمها مارد‬
‫للنزول إلى البحر ‪ ,‬أو كأنها أنابيب أرغن ضخم على قمم الصخور ‪ .‬و في ستافا و هي جزيرة على‬
‫الساحل الغربي من إسكتلندة فتحت مغاور في األعمدة ‪.‬‬
‫أما الصخر الجوفي البلوتوني األكثر إنتشارا فهو الغرانيت ‪ .‬و مع أن هذا الصخر قد تكون في‬
‫البداية ببطئ على عمق تحت األرض ‪ ,‬فإنه موجود في الوقت الحاضر على سطح األرض في‬
‫أمكنة كثيرة على السواح ل الصخرية في الشمال الشرقي من إسكتلندة ‪ .‬و قد حدث أحيانا أن اندفع‬
‫هذا الصخر إلى سطح األرض بفعل تحركات كبيرة في األرض ‪ ,‬أو لعل الصخور التي كانت‬
‫تغطيه منذ ماليين السنين قد تفتت بصورة تدريجية ‪ .‬و الغرانيت صخر قاس أبيض اللون أو‬
‫رمادي أو وردي فيه بقع سوداء المعة مما يجعله بريق ‪ .‬و كثيرا ما يكون في الغرانيت قطع‬
‫صغيرة شبيهة بالحجارة الكريمة مثل العقيق أو التوباز‪.‬‬
‫‪Figure C.1 Document considered for indexing.‬‬
‫‪After reading the whole document, the algorithm ignores stop list terms and‬‬
‫‪stems the candidate words. The output is two arrays shown in Figure C.2. Note that‬‬
‫‪only the top 50 terms are shown due to limited space.‬‬
‫أصل الكلمـــة‬
‫أصل الكلمـــة‬
‫‪Figure C.2 Output of the proposed technique.‬‬
‫‪Next, we calculate the weights and all the terms that are needed for weight‬‬
‫‪calculation. These are given in Figure C.3.‬‬
‫يمة اإلنتشار ‪Factor-‬‬
‫قيمة الفراغ ‪Gap -‬‬
‫معدل مســـافة الكلمــــة‬
‫أصــــل الكلمــــــة‬
‫أصل الكلمة‬
‫‪Figure C.3 Terms and weights.‬‬
‫‪The list of the top 35 indexes is given in Figure C.4.‬‬
Figure C.4 Indexes selected and their corresponding weight.
Recall that these numbers are calculated according to the formulas mentioned
in section four.