Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Esperanto grammar wikipedia , lookup
Ojibwe grammar wikipedia , lookup
Junction Grammar wikipedia , lookup
Symbol grounding problem wikipedia , lookup
Untranslatability wikipedia , lookup
Word-sense disambiguation wikipedia , lookup
Pipil grammar wikipedia , lookup
Malay grammar wikipedia , lookup
Agglutination wikipedia , lookup
CHAPTER 2 THEORETICAL FOUNDATION 2.1 Indonesian Language Indonesian language is a quite important language for Southeast Asia region. According to Sneddon (2003, p.225), Even though Indonesian language is not used worldwide, still it is a language used by the fourth most populous nation in the world and other Indonesia’s neighboring countries. So it is quite significant to develop lemmatization method for this language. 2.1.1 Nature of Indonesian Language According to Tucker (2009, p.75), most languages can be morphologically classified into three categories. This categorization is based on the nature of the language. Those three categories are listed in an ascending order of progress. First category is the monosyllabic, isolating, or radical languages, such as Chinese. The language in this category is incapable of accommodating any modification of words such as the usage of suffixes, prefixes, etc. The words in this language are merely simple roots which stand by themselves independently. The second category is the agglutinating languages, such as Turkish and Japanese. Agglutinating means the words in a sentence are attachable and detachable at will. There could be some affixes, but it would not change the form of the word. It would be simply glued on (agglutinated). Not only affixes, but even the words could be glued to each other. Tucker (2009, p.78) gave an example about this. Aulisariartorasuarpok means ‘he hastens to go fishing’ In Greenland. A sentence is 6 7 possible to be constructed only by one word consists of some agglutinated word. This construction is only possible in a highly agglutinative language. The last category is the inflexional, organic, or amalgamating languages, such as the Semitic and most of the European. In this category, words may change its form to show more specific function or role in a sentence, such as irregular verb or past participle in English. Indonesian language is in transition between agglutinating and inflexional state. Indonesian words cannot be glued on like the Greenland’s language, but it still can be glued with some affixes. If in a highly agglutinative language the affixes added will not affect the form of the word, in Indonesian, some affixes will change the form of the word. This is a characteristic of an inflexional language. So Indonesian is both inflexional and agglutinative, but not in an extreme form of both stages. This is also mentioned by Tucker (2009, p.89) that most Indo-European language in the modern shape is semi-inflexional in characteristic. 2.1.2 Importance of Indonesian Language in the World Indonesian language has been facing many social and political problems and developments since 1997. Those turmoils become an appealing interest to many people, including academics in fields such as history, politics and sociology, journalists and those with an interest in international affairs (Sneddon, 2003, p.1). Indonesian language as the national language is intimately linked with the nation and in some unique ways reflects the nation, which is also an interesting thing to many people internationally. Although Indonesian language is natively used only in Indonesia, still it is almost a language with the most speaker and user in the world (Sneddon, 2003, p.1). This is 8 because Indonesia is the fourth most populous country in the world. The language is important to the world not only because the number of the user, but also because many aspects are confined to the nation and its language which is quite significant to the world, such as Indonesia is the biggest moslem country in the world (Sneddon, 2003, p.2), Indonesia’s political history which is related to the Holland and other countries, etc. These aspects make Indonesian language somewhat important to the world. 2.2 Algorithm According to Cormen (2009, p.3), algorithm is defined as "a well-defined computational procedure, which receives the value or set of values as input and produces value, or a set of values, as output (output)". Therefore, the algorithm is a sequence of computational steps that transform the input data into a desired output data. Algorithm can also be interpreted as a tool to solve an unspecified or even a well-defined computational problem. In general, the statement of the problem specifies the relationship between the desired data input and data output. The algorithm itself describes specific computational procedure for to achieve a connection between input and output. The example is sorting a bunch of numbers in order from smallest to largest. This problem is frequently can be found in the real world, and provide a ‘fertile ground’ for introducing a variety of analysis tools and standard design techniques. Formally, the problem of sorting can be defined as follows: Input: string of numbers n {a1, a2, ..., an}. Output: permutation (re-ordering) {a’1, a’2, ..., a’n} which is equals to a’1, a’2, ... a’n. 9 For example, when given an input sequence {31, 41, 59, 26, 41, 58}, then the algorithm for sorting will give the output circuit {26, 31, 41, 41, 58, 59}. In this case, the input circuit is referred to as instances of the sorting problem. In general, an instance of a problem consisting of inputs (which meets each of the requirements defined in the problem statement) needed to produce a solution to the problem. An algorithm is correct if for each agency input, the algorithm will always give a correct output. We can say that, correct algorithm is an algorithm that can solve a given computational problem. A wrong algorithm is an algorithm that one does not give an answer, or give answers that are not appropriate for some, or all of the input. But keep in mind that the wrong algorithm is useful sometimes, in case the percentage of his fault can be controlled. Often, a wrong algorithm will be re-used if it has better performance compared to its error rate. 2.2.1 Algorithm Measurement According to Atallah and Blanton (2010, p. 1-1), that it is convenient to measure an algorithm by using Big-O notation. Sharod (2010, p. 15) complement, that “big-O notation is used to described the theoretical performance of an algorithm”. Big-O notation usually made to measure the time or memory consumption used by an algorithm. The purpose of Big-O notation is to be able to create a representation of the performance of an algorithm to compare it to others. Here is the table of Big-O notation that commonly used. 10 Table 2.1 Order of complexity (Reingold, 2010, pp. 1-2) Rate of Growth Description O (1) The time required is constant, independent of problem size Expected time for hash searching Very slow growth of time required Expected time of interpolation search of n elements Logarithmic growth of time required— doubling the problem size increases the time by only a constant amount Computing xn; binary search of an array of n elements O (n) Time grows linearly with problem size— doubling the problem size doubles the time required Adding/subtracting n-digit numbers; linear search of an nelement array O (n log n) Time grows worse than linearly, but not much worse—doubling the problem size somewhat more than doubles the time required Merge sort or heap sort of n elements; lower bound on comparison-based sorting of n elements O (log log n) O (log n) O (n2) O (n3) n O (c ) Examples Time grows quadratically—doubling the problem size quadruples the time required Simple-minded sorting algorithms Time grows cubically—doubling the problem size results in an eightfold increase in the time required Ordinary matrix multiplication Time grows exponentially—increasing the problem size by 1 results in a c-fold increase in the time required; doubling the problem size squares the time required Some traveling salesman problem algorithms based on exhaustive search 11 2.3 Artificial Intelligence According to Poole and Mackworth (2010, p.3), artificial intelligence or AI, is “a field that studies the synthesis and analysis of computational agents that act intelligently”. An agent is something that acts in an environment and it does something. Some examples of agent are worms, dogs, thermostats, airplanes, robots, humans, companies, and countries. An agent acts intelligently when: a. What it does is appropriate for its circumstances and its goals. b. It is flexible to changing environments and changing goals. c. It learns from experience. d. It makes appropriate choices given its perceptual and computational limitations. An agent typically cannot observe the state of the world directly; it has only a finite memory and it does not have unlimited time to act. A computational agent is an agent whose decisions about its actions can be explained in terms of computation. That is, the decision can be broken down into primitive operation that can be implemented in a physical device. This computation can take many forms. In humans this computation is carried out in "wetware", in computers it is carried out in "hardware." Although there are some agents that are arguably not computational, such as the wind and rain eroding a landscape, it is an open question whether all intelligent agents are computational. 12 2.3.1 History of Artificial Intelligence According to Poole and Mackworth (2010, p.6), about 400 years ago people started to write about the nature of thought and reason. Hobbes (1588–1679), who has been described by Haugeland, as the “Grandfather of AI,” espoused the position that thinking was symbolic reasoning like talking out loud or working out an answer with pen and paper. The idea of symbolic reasoning was further developed by Descartes (1596–1650), Pascal (1623–1662), Spinoza (1632–1677), Leibniz (1646–1716), and others who were pioneers in the philosophy of mind. The idea of symbolic operations became more concrete with the development of computers. The first general-purpose computer designed (but not built until 1991, at the Science Museum of London) was the Analytical Engine by Babbage (1792–1871). In the early part of the 20th century, there was much work done on understanding computation. Several models of computation were proposed, including the Turing machine by Alan Turing (1912–1954), a theoretical machine that writes symbols on an infinitely long tape, and the lambda calculus of Church (1903–1995), which is a mathematical formalism for rewriting formulas. It can be shown that these very different formalisms are equivalent in that any function computable by one is computable by the others. This leads to the Church–Turing thesis: “Any effectively computable function can be carried out on a Turing machine (and so also in the lambda calculus or any of the other equivalent formalisms).” Once real computers were built, some of the first applications of computers were AI programs. For example, Samuel built a checkers program in 1952 and implemented a 13 program that learns to play checkers in the late 1950s. Newell and Simon in 1956 built a program called Logic Theorist which discovers proofs in propositional logic. These early programs concentrated on learning and search as the foundations of the field. It became apparent early that one of the main problems was how to represent the knowledge needed to solve a problem. Before learning, an agent must have an appropriate target language for the learned knowledge. During the 1960s and 1970s, success was had in building natural language understanding systems in limited domains. For example, the STUDENT program of Daniel Bobrow (1967) could solve high school algebra problems expressed in natural language. Winograd’s (1972) SHRDLU system could, using restricted natural language, discuss and carry out tasks in a simulated blocks world. CHAT-80 which is developed by Warren and Pereira in 1982 could answer geographical questions placed to it in natural language. During the 1970s and 1980s, there was a large body of work on expert systems, where the aim was to capture the knowledge of an expert in some domain so that a computer could carry out expert tasks. For example, DENDRAL which is developed by Buchanan and Feigenbaum from 1965 to 1983 in the field of organic chemistry, proposed plausible structures for new organic compounds. In 1984, Buchanan and Shortliffe developed MYCIN which is able to diagnosed infectious diseases of the blood, prescribed antimicrobial therapy, and explained its reasoning. The 1970s and 1980s were also a period when AI reasoning became widespread in languages such as Prolog which is developed in 1972 by Colmerauer and Roussel. 14 2.3.2 Artificial Intelligence Application Areas According to Russel & Norvig (2010, p.28), there are several artificial intelligence application areas, but is not limited to: a. Robotics Robotics is a mechanical device which can act by themselves and take place of human activities. Robotics can reduce the time and process needed to be done by humans. b. Speech Recognition Speech Recognition is computer’s ability to analyze human voices and interpret them into text, which is also known as “speech to text”. c. Autonomous Planning and Scheduling Computer’s ability to do automated planning and scheduling. d. Game Playing Computers can be programmed to behave like a human player in games, allowing people to play a game that needs human interactions without human. e. Spam Fighting Spam fighting is a computer’s ability to delete message which is classified as spam automatically. f. Machine Translation Machine translation is a computer’s ability to translate from one language to another one. 15 2.3.3 Natural Language Processing According to Pustejovsky and Stubbs (2012, p.4), natural language processing (NLP) is a field of computer science and engineering that has developed from the study of language and computational linguistics within the field of Artificial Intelligence. The goal of NLP is to design and build applications that facilitate human interaction with machines and other devices through the use of natural language. Some of the major areas of NLP include: a. Question Answering Systems (QAS) Question answering systems is a computer’s ability to answer questions asked by human. Rather than typing keywords into a search browser window, we could simply ask in our own natural language, whether it’s English, Mandarin, or Indonesian. b. Summarization This area includes applications that can take a collection of documents or emails and produce a coherent summary of their content. These applications also have possibility to turn them into slide presentations. c. Machine Translation This was the first major area of research and engineering in the field. It aims to create an application to understand human languages and translate them into another language. This includes Google Translate which are getting better and better, and the BabelFish that translates in real time. d. Speech Recognition This is one of the most difficult problems in NLP. There has been great progress in building models that can be used on phones or computers to recognize spoken 16 language utterances that are questions and commands. Unfortunately, while these Automatic Speech Recognition (ASR) systems are ubiquitous, they work best in narrowly defined domains and don’t allow the speaker to stray from the expected scripted input (“Please say or type your card number now”). e. Document classification This is one of the most successful areas of NLP, where the task is to identify in which category a document should be placed. This has proved to be enormously useful for applications such as spam filtering, news article classification, and movie reviews, among others. One reason this has had such a big impact is the relative simplicity of the learning models needed for training the algorithms that do the classification. The development of natural language processing provides the possibility of natural language interfaces to knowledge bases and natural language translation. According to Poole and Mackworth (2010, p.520), there are three major aspects of any natural language understanding theory: 1. Syntax. The syntax describes the form of the language. It is usually specified by a grammar. Natural language is much more complicated than the formal languages used for the artificial languages of logics and computer programs. 2. Semantics. The semantics provides the meaning of the utterances or sentences of the language. Although general semantic theories exist, when we build a natural language understanding system for a particular application, we try to use the simplest representation we can. For example, in the development that follows, there is a fixed mapping between words and concepts in the knowledge base, which is inappropriate for many domains but simplifies development 17 3. Pragmatics. The pragmatic component explains how the utterances relate to the world. To understand language, an agent should consider more than the sentence; it has to take into account the context of the sentence, the state of the world, the goals of the speaker and the listener, special conventions, and the like. To understand the difference among these aspects, consider the following sentences which might appear at the start of an artificial intelligence textbook: a. This book is about artificial intelligence. b. The green frogs sleep soundly. c. Colorless green ideas sleep furiously. d. Furiously sleep ideas green colorless. The first sentence would be quite appropriate at the start of such a book; it is syntactically, semantically, and pragmatically well formed. The second sentence is syntactically and semantically well formed, but it would appear very strange at the start of an AI book; it is thus not pragmatically well formed for that context. The third sentence is syntactically well formed, but it is semantically non-sensical. The fourth sentence does not make any sense: syntactically, semantically, or pragmatically. 2.3.3.1 Stemming According to Kowalski (2011, p.76), stemming is a process that aims to reduce the number of variation in a representation of a concept become a standard morphological or canonical representation. Risk of stemming process is the information of the concept might be lost in process, resulting in decreased accuracy or precision, and reduce the performance in terms of ranking. The advantage is the process of stemming increasing the ability to recall. The original purpose of stemming is to improve the 18 performance and reduce the use of system resources, by reducing the number of unique words which must be accommodated by the system. So, in generally, stemming algorithms transform a word into a morphological representation standard (which called as stem). For example, a stem “comput” to associate “Computable, computability, computation, computational, computed, computing, compute, computerize” into one word. 2.3.3.2 Lemmatization “Lemmatization is a process to find the base (entry) of a given word form” (Ingason, 2008, p.1). S.Nirenburg (2009, p.31) reinforces this theory by explaining that lemmatization is a process aimed at normalizing the text, according to the partner associations of the form based on its own lemma. Normalization in this context is a process that identifies and removes the prefix and the suffix from a word. The common case of morphological analysis includes derivational process which is especially relevant for agglutinative languages. In addition, a form of the prefixed and or suffixed word may have many interpretations, so lemmatization algorithm must determine the context of the form he said, which analyzes the possible or appropriate to the context. Manning, Raghavan, and Schütze (2009, p.32) explained that, for grammatical reasons, the document will use several different tenses, for example (in English) organize, organizes, and organizing. In addition, there are several groups of words derived (derivational) is concerned with the meaning of the word such as democracy, democratic, and democratization. The goal of stemming and lemmatization is to reduce inflectional forms, and (sometimes) the forms related derivative, to form the base or common. For example: 19 Input: “The boy’s cars are different colors” Transformation: am, is, are => be Transformation: car, cars, car’s, cars’ => car Result: “The boy car be differ color” Stemming usually refers to the process of removing heuristic inflectional prefixes and suffixes, and affixes-derivational affixes, in the hope that the success of elimination has high precision. On the other hand, lemmatization usually refers to the use of vocabulary and morphological analysis of the word in question, and aims to eliminate the inflectional forms and generate the basic shape or form of the dictionarybased word, which is known as lemma. For example, for the word of “saw” in English, stemming method probably will only return “s”, while the return lemmatization may see or saw depending on the context (is it a noun or a verb). Another difference between the two methods lies in the form of derivational. Stemming method will usually cut derivational words are concerned, which usually lemmatization only removes inflectional forms from a lemma. 2.4 State of the Art 2.4.1 Lemmatization for Foreign Language This section is divided into 2 sub-sections, where the first part mainly focuses on the brief explanation of some lemmatization methods for English language, and the second part focuses on the overview of some lemmatization processes for other foreign languages. 20 2.4.1.1 Lemmatization for English Loponen & Järvelin (2010) constructed a statistical based, dictionary- and corpusindependent lemmatizer for low resource languages (p. 15), called StaLe. In his publication, Loponen described the capability of StaLe lemmatizing four languages: Finnish, Swedish, German, and English. Figure. 2.1 StaLe lemmatization process (Loponen & Järvelin, 2010, p. 5) StaLe accepts a word form as an input, which will be processed based on the set of rules taken from rule set. There are some cases that more than one rule can be applied to the input form, which will then output “candidate lemmas”. Each of these candidates will be sorted based on their confidence factor values. Then, according to the parameters, the candidates will go through a checking phase, based on set parameters which will eliminate unfulfilling candidates, and thus produce result lemmas. The size of training list for StaLe depends on the complexity of target language. The more morphological variation a language has, the bigger the size of the training list. The test set for StaLe was taken from 54 full-text English language collections from CLEF 2003, and the lemmatizer achieved 91.09% accuracy in English lemmatization. Minnen, Carroll, & Pearce (2001) developed a program for English inflectional morphology analysis and a morphological generator, based on finite-state techniques (p. 1). Basically, the analyzer accepts a pair of word form and its PoS tag as an input, and 21 outputs the lemma and suffix of the input, by using a rule set that represents English morphology, including irregular forms. The rule set itself is based on FLEX and acquired semi-automatically from several large corpora and dictionaries (p.1), but the process itself is fully dependent to the rule set (i.e. dictionary-independent). Take one rule from (p. 4) for example: {A}+{C}”ied” {return(lemma(3,”y”,”ed”));} The rule is divided into two parts: the left part and the right part. The left part serves as the condition, and the right part serves as the action to be executed provided the condition is fulfilled. The condition itself is a regular expression; {A}+ means a sequence of one or more, upper or lower case of the alphabet, while {C} stands for the consonants, and and the double quotes represent exact string match. The function lemma consists of three parts: how many characters to be removed from behind, the suffix to be attached, and the inflectional suffix that caused the lemma to transform into the word form (input). For irregular forms, the analyzer solves this problem by adding a specific rule/exception, for example exact string match for irregular word forms. The test set for the analyzer was obtained from CELEX lexical database of English. Overall test set contains to 38,882 unique word forms, and the analyzer achieved 99.94% in type accuracy and 99.93% in token accuracy (p. 9). Northwestern University (2009) published a morphological adornment tool for English Language texts, called MorphAdorner. Listed as one of its features, is an English lemmatizer which uses a rule set, including exceptions, to lemmatize all kinds of word form. The rule set contains about 150 rules and 1,600 irregular forms. MorphAdorner accepts input that follows the NUPOS tag set, in a form of (spelling, part of speech) pair. The lemmatizer is based on a list of irregular forms and 22 grammar rules. Instead of focusing to a specific part or speech set, it categorizes irregular forms and rules using major part of speech classes: adjective, adverb, compound, conjunction, infinitive-to, noun plural, noun possessive, preposition, verb, and pronoun. 2.4.1.2 Lemmatization for Other Languages Plisson, Lavrac, & Mladenic (2004) developed a lemmatizer for Slovenian language. The lemmatizer uses Ripple Down Rules induction algorithms (henceforth known as RDR), as the basis of their lemmatization algorithm (p. 83). Basically, RDR is a learning algorithm that uses the idea of if-then-else branches/rules. Each rule can grow further, depends on the given training set. The lemmatizer uses five datasets for testing, which is collected randomly, with various sizes, from the Slovenian lexicon dictionary, MULText-EAST, which contains about 20,000 normalized words and 500,000 surface forms. The lemmatizer achieved 77.0% accuracy (p. 85). Szopa (2007) developed a rule-based Dutch lemmatizer in Lisp, namely LRBL (p. 1). Different from RDR, LRBL’s rules are hand-tuned so it avoids having repetitive training for every rule changes. However, LRBL needs “POS” (Part-of-Speech), and “Features” tags in order to lemmatize an input word accordingly. The output of LRBL’s lemmatization process depends on a heuristic algorithm to select which rules should be executed. The test set consists of 145,829 tagged and lemmatized dutch tokens, and LRBL achieved 86% in accuracy (p. 13). Ingason et al. (2008) described a lemmatizer for Icelandic, called Lemmald. The lemmatization process relies on IceTagger to tag the input before lemmatization is performed. The Icelandic Frequency Dictionary (IFD) corpus, which contains about 23 590K word and 700 different POS tags, is used for training. Furthermore, Lemmald can also be run with the Database of Modern Icelandic Inflections (DMII) as an add-on for improved accuracy; however this impacts lemmatizer performance (p. 208). Hierarchy of Linguistic Identities (HOLI) is also used to handle organization of features and feature structures for the machine learning based on linguistic knowledge (p. 205). The test set was taken from IFD corpus which contains about 530,000 tokens and evaluation shows that given correct tagging, Lemmald lemmatizes with 98.54% of accuracy. Using DMII as an add-on additionally improves accuracy to 99.55% (p. 214). Chrupala (2006) constructed a lemmatization method for languages with rich inflectional morphology. In his work, lemmatization is treated as a classification process for Machine Learning, and automatic induction of class labels (p. 121). The approach is based on Shortest Edit Script (SES) between reversed input and output strings. By computing SES, the most optimal transformation can be achieved to transform a word form into its lemma. The test was performed on eight languages: Spanish, Catalan, Portuguese, French, Polish, Dutch, German, and Japanese. The test set and training set was taken from a lemma-annotated corpora/Treebank specific to each language, 10,000 tokens each. The lemmatizer achieved 88.42% on baseline accuracy (p. 123). 2.4.2 Stemming for Indonesian Language Frakes (1992) initiated the development of Indonesian stemming algorithm, with the approach of porting Porter’s Algorithm to fit Indonesian language’s morphology rules. Since then, improvements over existing algorithm and new algorithm approaches has developed in order to improve the accuracy of Indonesian stemming algorithm. According to Asian (2009), the accuracy of stemming algorithm is measured by the 24 correctness of transforming a word form into its common root form, given a test set. Ideally, a good stemmer will stem all words from the same semantic group to the same stem. But due to the irregularities which are prominent in all natural languages, all stemmers unavoidably make mistakes, including the ones which use vocabulary lists (Tala, 2003, p. 11). Furthermore, Indonesian language is still under development/transition period, which means new words can be added, edited, and/or removed any time and adds more difficulty for stemming Indonesian language. The sections below will mainly focus on describing each of Indonesian stemming algorithm chronologically. 2.4.2.1 Nazief and Adriani‘s Algorithm Nazief and Adriani’s algorithm (henceforth referred as NAZIEF) was developed in 1996, using a confix (combination of prefixes and suffixes) stripping approach with a dictionary look up which consists of a list of lemmas (Tala, 2003, p. 1). NAZIEF is based on comprehensive morphological rules that group together and encapsulate allowed and disallowed affixes, including prefixes, suffixes, and confixes (combination of prefix and suffix), which are also known as circumfixes (Asian, 2005, p. 308). Below is the logical flow of NAZIEF, as described in (Asian, 2005, pp. 308-309): 1. The input word is checked against the dictionary. If found, the input word will be returned as the output and algorithm ends. If input word does not exist in dictionary, the input word will be stored in a temporary variable (e.g. CURRENT_WORD). After each stemming rule, CURRENT_WORD will be checked against the dictionary. 25 2. Removal of inflectional suffixes, the set of suffixes that cannot alter the meaning of a word. This set is divided into two types: inflectional particle suffixes (-lah, kah), for example: “merekalah” and “apakah”, and inflectional possessive pronoun suffixes (-ku, -mu, -nya), for example: “sepatuku”, “sepedamu”, and “bajunya”. If the algorithm successfully removed inflectional particle suffix from CURRENT_WORD, then this step will be repeated to try to remove the inflectional possessive pronoun suffix. 3. Removal of derivational suffixes (-i, -an), the set of suffixes that may alter the meaning a word. For example: “dinikmati” and “makanan”. The removed suffix will be stored temporarily for future recoding step. 4. Removal of derivational prefixes, which may alter the meaning of a word. This set is divided into two types: simple prefixes (di-, ke-, se-), for example “dimakan”, “kemakan”, “sejalan”. This set of prefixes can be removed immediately without affecting CURRENT_WORD, and complex prefixes (te-, be-, me-, pe-), for example “penahan”, “menyelam”, “bersimpati”, and “terbuai”, which needs an additional process to determine what change is needed, using a specific rule table. If the CURRENT_WORD is still not found in the dictionary, step 4 will be attempted again to check for stacking prefixes. 5. If CURRENT_WORD is not found after recursive attempt of step 4, then recoding is performed. 6. If CURRENT_WORD is still not found after step 5, then the removed suffix at step 3 will be checked. If “-an” was removed and the last letter is “-k”, then “-k” will be removed and attempt step 4. If CURRENT_WORD is still not found, then the 26 removed suffix at step 3 will be restored, and re-attempt step 4. If still not found, the algorithm will return original input word. NAZIEF achieved 92.1% baseline accuracy on unique words test set (p. 312). 2.4.2.2 Arifin and Setiono’s Algorithm In 2002, Arifin and Setiono constructed an algorithm (henceforth referred as ARIF) which is a variant of NAZIEF. Overall, the algorithm still preserves dictionary checking, prefix and suffix stripping, and recoding, but with affix restoration functionality as an addition (Asian, 2009, p. 64). ARIF limits affix removal by two prefixes, and three suffixes (p. 64). The logical flow of ARIF is described below: 1. First, the input word will be checked against the dictionary. If exists then the input word is returned as result lemma. The input word will be saved temporarily in a variable, for example CURRENT_WORD. 2. Prefix removal is done recursively with the loop limit of two. Each process detects whether in the CURRENT_WORD exist known prefixes (be-, di-, ke-, me-, pe-, se-, te-). After each removal, the CURRENT_WORD is checked against the dictionary, and will continue if it does not exist. Recoding is performed if possible. 3. Suffix removal is done recursively, with the loop limit of three. Each process detects whether there are known suffixes (-i, -kan, -an, -kah, -lah, -tah, -pun, -ku, -mu, -nya) in the ending of the CURRENT_WORD. After each removal, the CURRENT_WORD will be checked against the dictionary. 4. This step runs the affix restoration part with certain order, and after each step the result candidate will be checked against the dictionary, and recoded if possible. 27 Assuming two prefixes and three suffixes were removed in the previous steps, The order is as follows: a. Restore all prefixes to CURRENT_WORD. b. Restore second prefix to CURRENT_WORD. c. Restore all prefixes and third suffix to CURRENT_WORD. d. Restore second prefix and third suffix to CURRENT_WORD. e. Restore all prefixes, third suffix, and second suffix to CURRENT_WORD. f. Restore second prefix, third suffix, and second suffix to CURRENT_WORD. 5. If after step 4 CURRENT_WORD still remains unknown to the dictionary, the process will return original input/word form. Asian identified two ‘shortcomings’ for this algorithm (p. 65). First, it has the possibility of removing a suffix that resembles the previously removed suffix identically (i.e. identical affix removal). Secondly, ARIF is sensitive to prefix and suffix removal order. (p. 66). ARIF achieved 88.0% accuracy on unique words test set. 2.4.2.3 Vega’s Algorithm Berlian Vega (2001) proposed a stemmer with different approach; a purely rulebased algorithm (henceforth referred to VEGA). VEGA is dictionary-independent, however it still uses external list of exceptions to handle exceptional forms in Indonesian language (e.g. ‘pelajar’ is stemmed to ‘ajar’). This means that the completeness of the rule determines the accuracy of VEGA (i.e. heavily relies on ruleset). The order of rules is also important, because each rule will be checked sequentially, and some rules may use other rules in its process in order to break down a given word form into smaller 28 parts. The algorithm proceeds only when one rule fails. Consider an example ruleset for VEGA taken from (Asian, 2009, p.66): Word form: ‘keselamatan’ (Example) Rule set: Rule 1: word(Root) circumfix(Root) Rule 2: word(Root): StemWord Rule 3: circumfix(Root) ber-(Root), (-kan | -an) Rule 4: circumfix(Root) ke-(Root), -an Rule 5: ber-(Root) ber-, stem(Root) Rule 6: ke-(Root) ke-, stem(Root) The algorithm starts form the first rule, with ‘Root’ containing the input/word form. The ‘circumfix’ function passes ‘Root’ to Rule 3. Rule 3 then calls ber-(Root) function (i.e. Rule 5), which then fails because prefix ‘ber-‘ is not found in the current word. This failure causes Rule 3 to fail, and thus the algorithm proceeds to Rule 4. This causes ke-(Root) rule (i.e. Rule 6) to be called and thus correctly returns ‘selamatan’ to Rule 4, and Rule 4 successfully returns ‘selamat’. VEGA handles exceptional cases by creating specific rules with literal match condition, such as ‘megawati’ to prevent ‘me... –i’ affix removal (p. 66). VEGA achieved 69.4% on unique test set (p. 73). 2.4.2.4 Ahmad, Yusoff, and Sembok’s Algorithm Ahmad, Yusoff, and Sembok’s stemming algorithm for Malaysian language (AHMAD), proposed in 1996. Malaysian language is similar to Indonesian language, but there are still some differences in terms of rules and affix usage. AHMAD uses a root word dictionary, and a set of valid affix rules to correctly stem the word form. 29 When an input (word form) is entered, it will be immediately checked against the dictionary. If it is unknown to dictionary, then the input will be checked through all the affix rules, one by one. If, in the end, not even one single rule is satisfied, the input will be returned originally. AHMAD also handles recoding if met with correct requirements. Take for example “mengalahkan” and an example affix rule “meng-...-kan”. The first removal produces “alah”, which is invalid in the dictionary. However, because it is possible to do a recoding, it gives the result “kalah”, which is the correct root word. Overall, AHMAD’s accuracy depends on the ordering of affix rules, because if ordered wrongly AHMAD can mistakenly stem certain word forms (Asian, 2009, p.68). AHMAD achieved 88.3% accuracy on unique words test set (p. 73). 2.4.2.5 Idris’s Algorithm In 2001, Idris expands AHMAD by adding alterations between prefix and suffix removal. Idris managed to construct two types of affix removal; the first one is the algorithm assumes that the word form have different prefix, for example a word form “mematahkan”, in the first algorithm type, the result can be either “mem-atah-kan” or “me-matah-kan”. The second type prioritizes recoding first, and with the same example, the result can be either “mem-atah-kan”, or (with recoding) “me-patah-kan”. Asian (p. 69) analyzed that the second type has a slightly better performance, compared to the first. Same as AHMAD, Idris’ algorithm is sensitive to rules ordering. Idris’ second type algorithm achieved 88.8% accuracy on unique words test set (p. 73). 30 2.4.2.6 Confix-Stripping Stemmer In 2005, Jelita Asian attempted to improve the accuracy of NAZIEF (1996), because NAZIEF has the best scheme/approach to Indonesian stemming, and also the best in terms of stemming accuracy. According to her analysis, NAZIEF’s error is majorly caused by several aspects: non-root word in the dictionary, incomplete dictionary, and hyphenated words, while the rest are caused by ineffective rules and rules ordering (Asian, 2005, p.312). In 2007, Asian, Nazief, et al. collaborated to produce a paper that presents the ‘Confix-Stripping Stemmer’, an improved version of NAZIEF. The modified rules and algorithm changes were presented in a detailed, algorithmic explanation: 1. The input is first checked against the dictionary. If the input exists in dictionary, then the input will be returned as result lemma. 2. Inflectional particle suffixes (-kah, -lah, -tah, -pun) will be removed from the current input, and the remains will be kept inside a string variable (CURRENT_WORD), checked against the dictionary. If exists, then process terminates. 3. Inflectional possessive pronoun (-ku, -mu, -nya) will be removed from CURRENT_WORD, and checked against the dictionary. If exists, process terminates. 4. Derivational suffixes (-i, -kan, -an) will be removed from CURRENT_WORD, and checked against the dictionary. If exists, process terminates. 5. This step focuses on removing derivational prefixes (beN-, di-, ke-, meN-, peN-, se-, teN-) from CURRENT_WORD. This step is recursive because in Indonesian morphology derivational prefix can be stacked. Some prefixes (di-, ke-, se-) are 31 considered simple, because in practice they do not change the lemma. On the other hand, the other prefixes (beN-, meN-, peN-, teN-) do change the lemma, and differs according to the first letter of the lemma. These transformations and variants are listed on Table 2.2 below. Table 2.2 Derivational prefix stripping rule set (Asian, et al., 2007, p. 13) Rule 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 Construct berV... berCAP... berCAerV... belajar... beC1erC2... terV... terCerV... terCP... teC1erC2... me{l|r|w|y}V... mem{b|f|v}... mempe{r|l} mem{rV|V}... men{c|d|j|z}... menV... meng{g|h|q|k}... mengV... menyV... mempV... pe{w|y}V... perV... perCAP... perCAerV... pem{b|f|v}... pem{rV|V}... pen{c|d|j|z}... penV... Return ber-V... | be-rV... ber-CAP... Where C!='r' and P!='er' ber-CAerV... Where C!='r' bel-ajar... be-C1erC2... Where C1!={'r' | 'l'} ter-V... | te-rV... ter-cerV... Where C!='r' ter-CP... Where C!='r' and P!='er' te-C1erC2... Where C1!='r' me-{l|r|w|y}V... mem-{b|f|v}... mem-pe... me-m{rV|V}... | me-p{rV|V}... men-{c|d|j|z}... me-nV... | me-tV... meng-{g|h|q|k}... meng-V... | meng-kV... meny-sV... mem-pV... where V!=’e’ pe-{w|y}V... per-V... | pe-rV... per-CAP... where C!=’r’ and P!=’er’ per-CaerV... where C!=’r’ pem-{b|f|v}... pe-m{rV|V}... | pe-p{rV|V}... pen-{c|d|jz}... pe-nV... | pe-tV... 32 28 29 30 31 32 33 peng{g|h|q}... pengV... penyV... pelV... peCP peCerV peng-{g|h|q}... peng-V... | peng-kV... peny-sV... pe-lV... Except: “pelajar” return “ajar” pe-CP... where C!={r|w|y|l|m|n} and P!=’er’ per-CerV. . .where C!={r|w|y|l|m|n} There are several termination conditions for this step: a. The prefix and the removed suffix are listed in the invalid affix pair table below (Table 2.3). b. The removed prefix is literally equivalent to previously removed prefix. c. The recursive limit for this step is three. Table 2.3 Disallowed prefix and suffixes pairs; except the “ke-“ and “-i” affix pair for the root word “tahu”. (Asian et al., 2007, p. 6) Prefix berdikemeterper- Disallowed Suffixes -i -an -i and –kan -an -an -an The removed prefix will be recorded, and CURRENT_WORD will be checked against the dictionary. If CURRENT_WORD does not exist in the dictionary and no termination condition is satisfied, then step 5 will be repeated, with CURRENT_WORD as input. 33 6. If CURRENT_WORD is still not found after Step 5, then Table 2.2 will be examined whether recoding (p. 63) is possible. In the rule set, there are several rules that hold more than one output. Take rule 17 for example; mengV has two outputs: meng-V or meng-kV. In step 5, the first output (i.e. left one) will always be picked first, and this can cause error. Recoding is done to undo this kind of error by going back to step 5 where this output selection happens; and instead selects the other output. 7. If CURRENT_WORD still remains unknown to the dictionary, then the original input word will be returned. In order to solve the major error causes as stated above (i.e. non-root word in the lookup dictionary, incomplete dictionary, and hyphenated words), Asian suggested three approaches: 1. Improve dictionary quality by using different dictionary sources, and compares its accuracy with the previous dictionary. 2. Add extra rules to handle hyphenated words. The main idea used to construct this rule is, if a hyphenated word contains an exact same pair word (e.g. bulir-bulir) then it will be stemmed to bulir. This also applies for hyphenated word with affixes (e.g. seindah-indahnya); the affix will be removed first and then checked whether the pair word is stemmable. 3. Modification of rules, prefixes and suffixes: a. Rules alteration for prefixes (“ter-“, “pe-“, “mem-“, and “meng-“) which has already been applied to Table 2.2 above. In detail, rule number 9 and 33 were added, rule number 12 and 16 were a modified version from the previous rule. 34 b. Prefix removal will be performed before suffix removal if a given word has an affix pair from the list below: “be-“ and “-lah” “be-“ and –an” “me-“ and “-i” “di-“ and “-i” “pe-“ and “-i” “ter-“ and “-i” Compared against NAZIEF with the same dataset, the modified NAZIEF achieves around 2-3% higher accuracy (approximately 95%). 2.4.2.7 Enhanced Confix Stripping Arifin, Mahendra, & Ciptaningtyas (2009) extended (improved the accuracy of) the Confix-Stripping Stemmer by solving unhandled cases with specific prefix type (p. 151), listed below: 1. “mem-p” as in mempromosikan, 2. “men-s” as in mensyukuri, 3. “menge-“ as in mengerem, 4. “penge-“ as in pengeboman, 5. “peng-k” as in pengkajian, 6. Incorrect affix removal order, resulting in unstemmed input. For example the word pelaku is overstemmed because of the “-ku” on the last part of the word is considered as an inflectional possessive pronoun suffix. Other example is the 35 word pelanggan, which is overstemmed because the “-an” part on the last is considered as a derivational suffix. In order to solve the cases above, Arifin, et al. suggested two improvements: 1. Rules modification and addition to Table 2.2, in order to fit the specific unhandled cases above. 2. Extra process of stemming, which is called loopPengembalianAkhiran (p. 151), henceforth referenced as LPA. This extra step is appended after the last step of CS Stemmer’s stemming process, specifically after the recoding attempt has failed (i.e. Step 8, in the CS Stemmer). After each step, a dictionary lookup is performed to check if the processed input listed in the dictionary. The detailed flow of LPA is as follows: a. Return CURRENT_WORD to the state before recoding, and return all prefixes that have been removed in the prefix removal process, and perform a dictionary lookup. b. Redo the prefix removal process. c. Return the previously removed suffixes in order: derivational, personal pronoun, and particle suffixes. On each order of suffix restoration, step d to step e is performed. An important exception is made for the derivational suffix “-kan”. Firstly, only the ‘-k’ is restored and step d and e is performed. However, if fails, then the rest (i.e. ‘-an’) will be restored, and step d and e is performed. d. Redo the prefix removal process, and perform recoding if possible. e. If dictionary lookup fails, execute step a and restore the next order of suffix according to step c. 36 2.5 State of the Art Overview The progress of Indonesian stemming methods differs in many ways. Some proposed a difference approach, some developed/attempt to upgrade existing methods, and some modified the existing approach. In summary, the development progress is presented in a form of table below: Table 2.4 Stemming progress overview Method 1996 AHMAD (Malay) Dictionary- and Rulebased Affix removal rule, with dictionary lookup for Malay 88.3 1996 NAZIEF Dictionary- and Rulebased Affix and confix removal rule, with dictionary lookup 92.1 2001 VEGA Affix removal rule, with hardcoded exception 69.4 2001 IDRIS Dictionary- and Rulebased Improvement from AHMAD, uses 2 dictionaries 88.8 2002 ARIF Dictionary- and Rulebased Variant of NAZIEF, always remove prefix first 88 2007 CS STEMMER Dictionary- and Rulebased Improvement from NAZIEF addition & modified rules 94 Dictionary- and Rulebased Improvement from CS STEMMER, added process - 2009 ENHANCED CS Stemming Approach Pure Rule-based Work Acc. (%) Year