Download CHAPTER 2 THEORETICAL FOUNDATION 2.1 Indonesian

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Inflection wikipedia , lookup

Esperanto grammar wikipedia , lookup

Ojibwe grammar wikipedia , lookup

Pleonasm wikipedia , lookup

Junction Grammar wikipedia , lookup

Symbol grounding problem wikipedia , lookup

Untranslatability wikipedia , lookup

Word-sense disambiguation wikipedia , lookup

Pipil grammar wikipedia , lookup

Malay grammar wikipedia , lookup

Agglutination wikipedia , lookup

Morphology (linguistics) wikipedia , lookup

OK wikipedia , lookup

Stemming wikipedia , lookup

Transcript
CHAPTER 2
THEORETICAL FOUNDATION
2.1
Indonesian Language
Indonesian language is a quite important language for Southeast Asia region.
According to Sneddon (2003, p.225), Even though Indonesian language is not used
worldwide, still it is a language used by the fourth most populous nation in the world
and other Indonesia’s neighboring countries. So it is quite significant to develop
lemmatization method for this language.
2.1.1
Nature of Indonesian Language
According to Tucker (2009, p.75), most languages can be morphologically
classified into three categories. This categorization is based on the nature of the
language. Those three categories are listed in an ascending order of progress. First
category is the monosyllabic, isolating, or radical languages, such as Chinese. The
language in this category is incapable of accommodating any modification of words
such as the usage of suffixes, prefixes, etc. The words in this language are merely simple
roots which stand by themselves independently.
The second category is the agglutinating languages, such as Turkish and
Japanese. Agglutinating means the words in a sentence are attachable and detachable at
will. There could be some affixes, but it would not change the form of the word. It
would be simply glued on (agglutinated). Not only affixes, but even the words could be
glued to each other. Tucker (2009, p.78) gave an example about this.
Aulisariartorasuarpok means ‘he hastens to go fishing’ In Greenland. A sentence is
6
7
possible to be constructed only by one word consists of some agglutinated word. This
construction is only possible in a highly agglutinative language.
The last category is the inflexional, organic, or amalgamating languages, such as
the Semitic and most of the European. In this category, words may change its form to
show more specific function or role in a sentence, such as irregular verb or past
participle in English.
Indonesian language is in transition between agglutinating and inflexional state.
Indonesian words cannot be glued on like the Greenland’s language, but it still can be
glued with some affixes. If in a highly agglutinative language the affixes added will not
affect the form of the word, in Indonesian, some affixes will change the form of the
word. This is a characteristic of an inflexional language. So Indonesian is both
inflexional and agglutinative, but not in an extreme form of both stages. This is also
mentioned by Tucker (2009, p.89) that most Indo-European language in the modern
shape is semi-inflexional in characteristic.
2.1.2
Importance of Indonesian Language in the World
Indonesian language has been facing many social and political problems and
developments since 1997. Those turmoils become an appealing interest to many people,
including academics in fields such as history, politics and sociology, journalists and
those with an interest in international affairs (Sneddon, 2003, p.1). Indonesian language
as the national language is intimately linked with the nation and in some unique ways
reflects the nation, which is also an interesting thing to many people internationally.
Although Indonesian language is natively used only in Indonesia, still it is almost
a language with the most speaker and user in the world (Sneddon, 2003, p.1). This is
8
because Indonesia is the fourth most populous country in the world. The language is
important to the world not only because the number of the user, but also because many
aspects are confined to the nation and its language which is quite significant to the
world, such as Indonesia is the biggest moslem country in the world (Sneddon, 2003,
p.2), Indonesia’s political history which is related to the Holland and other countries, etc.
These aspects make Indonesian language somewhat important to the world.
2.2
Algorithm
According to Cormen (2009, p.3), algorithm is defined as "a well-defined
computational procedure, which receives the value or set of values as input and produces
value, or a set of values, as output (output)". Therefore, the algorithm is a sequence of
computational steps that transform the input data into a desired output data.
Algorithm can also be interpreted as a tool to solve an unspecified or even a
well-defined computational problem. In general, the statement of the problem specifies
the relationship between the desired data input and data output. The algorithm itself
describes specific computational procedure for to achieve a connection between input
and output.
The example is sorting a bunch of numbers in order from smallest to largest. This
problem is frequently can be found in the real world, and provide a ‘fertile ground’ for
introducing a variety of analysis tools and standard design techniques. Formally, the
problem of sorting can be defined as follows:
Input: string of numbers n {a1, a2, ..., an}.
Output: permutation (re-ordering) {a’1, a’2, ..., a’n} which is equals to a’1, a’2,
... a’n.
9
For example, when given an input sequence {31, 41, 59, 26, 41, 58}, then the
algorithm for sorting will give the output circuit {26, 31, 41, 41, 58, 59}. In this case, the
input circuit is referred to as instances of the sorting problem. In general, an instance of
a problem consisting of inputs (which meets each of the requirements defined in the
problem statement) needed to produce a solution to the problem.
An algorithm is correct if for each agency input, the algorithm will always give a
correct output. We can say that, correct algorithm is an algorithm that can solve a given
computational problem. A wrong algorithm is an algorithm that one does not give an
answer, or give answers that are not appropriate for some, or all of the input. But keep in
mind that the wrong algorithm is useful sometimes, in case the percentage of his fault
can be controlled. Often, a wrong algorithm will be re-used if it has better performance
compared to its error rate.
2.2.1
Algorithm Measurement
According to Atallah and Blanton (2010, p. 1-1), that it is convenient to measure
an algorithm by using Big-O notation. Sharod (2010, p. 15) complement, that “big-O
notation is used to described the theoretical performance of an algorithm”. Big-O
notation usually made to measure the time or memory consumption used by an
algorithm. The purpose of Big-O notation is to be able to create a representation of the
performance of an algorithm to compare it to others.
Here is the table of Big-O notation that commonly used.
10
Table 2.1 Order of complexity
(Reingold, 2010, pp. 1-2)
Rate of Growth
Description
O (1)
The time required is constant, independent
of problem size
Expected time for hash
searching
Very slow growth of time required
Expected time of interpolation
search of n elements
Logarithmic growth of time required—
doubling the problem size increases the
time by only a constant amount
Computing xn; binary search
of an array of n elements
O (n)
Time grows linearly with problem size—
doubling the problem size doubles the
time required
Adding/subtracting n-digit
numbers; linear search of an nelement array
O (n log n)
Time grows worse than linearly, but not
much worse—doubling the problem size
somewhat more than doubles the time
required
Merge sort or heap sort of n
elements; lower bound on
comparison-based sorting of n
elements
O (log log n)
O (log n)
O (n2)
O (n3)
n
O (c )
Examples
Time grows quadratically—doubling the
problem size quadruples the time required
Simple-minded sorting
algorithms
Time grows cubically—doubling the
problem size results in an eightfold
increase in the time required
Ordinary matrix multiplication
Time grows exponentially—increasing the
problem size by 1 results in a c-fold
increase in the time required; doubling the
problem size squares the time required
Some traveling salesman
problem algorithms based on
exhaustive search
11
2.3
Artificial Intelligence
According to Poole and Mackworth (2010, p.3), artificial intelligence or AI, is “a
field that studies the synthesis and analysis of computational agents that act
intelligently”.
An agent is something that acts in an environment and it does something. Some
examples of agent are worms, dogs, thermostats, airplanes, robots, humans, companies,
and countries.
An agent acts intelligently when:
a.
What it does is appropriate for its circumstances and its goals.
b.
It is flexible to changing environments and changing goals.
c.
It learns from experience.
d.
It makes appropriate choices given its perceptual and computational
limitations. An agent typically cannot observe the state of the world directly;
it has only a finite memory and it does not have unlimited time to act.
A computational agent is an agent whose decisions about its actions can be
explained in terms of computation. That is, the decision can be broken down into
primitive operation that can be implemented in a physical device. This computation can
take many forms. In humans this computation is carried out in "wetware", in computers
it is carried out in "hardware." Although there are some agents that are arguably not
computational, such as the wind and rain eroding a landscape, it is an open question
whether all intelligent agents are computational.
12
2.3.1
History of Artificial Intelligence
According to Poole and Mackworth (2010, p.6), about 400 years ago people
started to write about the nature of thought and reason. Hobbes (1588–1679), who has
been described by Haugeland, as the “Grandfather of AI,” espoused the position that
thinking was symbolic reasoning like talking out loud or working out an answer with
pen and paper. The idea of symbolic reasoning was further developed by Descartes
(1596–1650), Pascal (1623–1662), Spinoza (1632–1677), Leibniz (1646–1716), and
others who were pioneers in the philosophy of mind.
The idea of symbolic operations became more concrete with the development of
computers. The first general-purpose computer designed (but not built until 1991, at the
Science Museum of London) was the Analytical Engine by Babbage (1792–1871). In the
early part of the 20th century, there was much work done on understanding computation.
Several models of computation were proposed, including the Turing machine by Alan
Turing (1912–1954), a theoretical machine that writes symbols on an infinitely long
tape, and the lambda calculus of Church (1903–1995), which is a mathematical
formalism for rewriting formulas. It can be shown that these very different formalisms
are equivalent in that any function computable by one is computable by the others. This
leads to the Church–Turing thesis:
“Any effectively computable function can be carried out on a Turing
machine (and so also in the lambda calculus or any of the other
equivalent formalisms).”
Once real computers were built, some of the first applications of computers were
AI programs. For example, Samuel built a checkers program in 1952 and implemented a
13
program that learns to play checkers in the late 1950s. Newell and Simon in 1956 built a
program called Logic Theorist which discovers proofs in propositional logic.
These early programs concentrated on learning and search as the foundations of
the field. It became apparent early that one of the main problems was how to represent
the knowledge needed to solve a problem. Before learning, an agent must have an
appropriate target language for the learned knowledge.
During the 1960s and 1970s, success was had in building natural language
understanding systems in limited domains. For example, the STUDENT program of
Daniel Bobrow (1967) could solve high school algebra problems expressed in natural
language. Winograd’s (1972) SHRDLU system could, using restricted natural language,
discuss and carry out tasks in a simulated blocks world. CHAT-80 which is developed
by Warren and Pereira in 1982 could answer geographical questions placed to it in
natural language.
During the 1970s and 1980s, there was a large body of work on expert systems,
where the aim was to capture the knowledge of an expert in some domain so that a
computer could carry out expert tasks. For example, DENDRAL which is developed by
Buchanan and Feigenbaum from 1965 to 1983 in the field of organic chemistry,
proposed plausible structures for new organic compounds. In 1984, Buchanan and
Shortliffe developed MYCIN which is able to diagnosed infectious diseases of the
blood, prescribed antimicrobial therapy, and explained its reasoning. The 1970s and
1980s were also a period when AI reasoning became widespread in languages such as
Prolog which is developed in 1972 by Colmerauer and Roussel.
14
2.3.2
Artificial Intelligence Application Areas
According to Russel & Norvig (2010, p.28), there are several artificial
intelligence application areas, but is not limited to:
a. Robotics
Robotics is a mechanical device which can act by themselves and take place of
human activities. Robotics can reduce the time and process needed to be done by
humans.
b. Speech Recognition
Speech Recognition is computer’s ability to analyze human voices and interpret
them into text, which is also known as “speech to text”.
c. Autonomous Planning and Scheduling
Computer’s ability to do automated planning and scheduling.
d. Game Playing
Computers can be programmed to behave like a human player in games,
allowing people to play a game that needs human interactions without human.
e. Spam Fighting
Spam fighting is a computer’s ability to delete message which is classified as
spam automatically.
f. Machine Translation
Machine translation is a computer’s ability to translate from one language to
another one.
15
2.3.3
Natural Language Processing
According to Pustejovsky and Stubbs (2012, p.4), natural language processing
(NLP) is a field of computer science and engineering that has developed from the study
of language and computational linguistics within the field of Artificial Intelligence. The
goal of NLP is to design and build applications that facilitate human interaction with
machines and other devices through the use of natural language. Some of the major areas
of NLP include:
a. Question Answering Systems (QAS)
Question answering systems is a computer’s ability to answer questions asked by
human. Rather than typing keywords into a search browser window, we could
simply ask in our own natural language, whether it’s English, Mandarin, or
Indonesian.
b. Summarization
This area includes applications that can take a collection of documents or emails
and produce a coherent summary of their content. These applications also have
possibility to turn them into slide presentations.
c. Machine Translation
This was the first major area of research and engineering in the field. It aims to
create an application to understand human languages and translate them into
another language. This includes Google Translate which are getting better and
better, and the BabelFish that translates in real time.
d. Speech Recognition
This is one of the most difficult problems in NLP. There has been great progress
in building models that can be used on phones or computers to recognize spoken
16
language utterances that are questions and commands. Unfortunately, while these
Automatic Speech Recognition (ASR) systems are ubiquitous, they work best in
narrowly defined domains and don’t allow the speaker to stray from the expected
scripted input (“Please say or type your card number now”).
e. Document classification
This is one of the most successful areas of NLP, where the task is to identify in
which category a document should be placed. This has proved to be enormously
useful for applications such as spam filtering, news article classification, and
movie reviews, among others. One reason this has had such a big impact is the
relative simplicity of the learning models needed for training the algorithms that
do the classification.
The development of natural language processing provides the possibility of
natural language interfaces to knowledge bases and natural language translation.
According to Poole and Mackworth (2010, p.520), there are three major aspects of any
natural language understanding theory:
1. Syntax. The syntax describes the form of the language. It is usually specified by
a grammar. Natural language is much more complicated than the formal
languages used for the artificial languages of logics and computer programs.
2. Semantics. The semantics provides the meaning of the utterances or sentences of
the language. Although general semantic theories exist, when we build a natural
language understanding system for a particular application, we try to use the
simplest representation we can. For example, in the development that follows,
there is a fixed mapping between words and concepts in the knowledge base,
which is inappropriate for many domains but simplifies development
17
3. Pragmatics. The pragmatic component explains how the utterances relate to the
world. To understand language, an agent should consider more than the sentence;
it has to take into account the context of the sentence, the state of the world, the
goals of the speaker and the listener, special conventions, and the like.
To understand the difference among these aspects, consider the following
sentences which might appear at the start of an artificial intelligence textbook:
a. This book is about artificial intelligence.
b. The green frogs sleep soundly.
c. Colorless green ideas sleep furiously.
d. Furiously sleep ideas green colorless.
The first sentence would be quite appropriate at the start of such a book; it is
syntactically, semantically, and pragmatically well formed. The second sentence is
syntactically and semantically well formed, but it would appear very strange at the start
of an AI book; it is thus not pragmatically well formed for that context. The third
sentence is syntactically well formed, but it is semantically non-sensical. The fourth
sentence does not make any sense: syntactically, semantically, or pragmatically.
2.3.3.1 Stemming
According to Kowalski (2011, p.76), stemming is a process that aims to reduce
the number of variation in a representation of a concept become a standard
morphological or canonical representation. Risk of stemming process is the information
of the concept might be lost in process, resulting in decreased accuracy or precision, and
reduce the performance in terms of ranking. The advantage is the process of stemming
increasing the ability to recall. The original purpose of stemming is to improve the
18
performance and reduce the use of system resources, by reducing the number of unique
words which must be accommodated by the system. So, in generally, stemming
algorithms transform a word into a morphological representation standard (which called
as stem). For example, a stem “comput” to associate “Computable, computability,
computation, computational, computed, computing, compute, computerize” into one
word.
2.3.3.2 Lemmatization
“Lemmatization is a process to find the base (entry) of a given word form”
(Ingason, 2008, p.1). S.Nirenburg (2009, p.31) reinforces this theory by explaining that
lemmatization is a process aimed at normalizing the text, according to the partner
associations of the form based on its own lemma. Normalization in this context is a
process that identifies and removes the prefix and the suffix from a word. The common
case of morphological analysis includes derivational process which is especially relevant
for agglutinative languages. In addition, a form of the prefixed and or suffixed word may
have many interpretations, so lemmatization algorithm must determine the context of the
form he said, which analyzes the possible or appropriate to the context.
Manning, Raghavan, and Schütze (2009, p.32) explained that, for grammatical
reasons, the document will use several different tenses, for example (in English)
organize, organizes, and organizing. In addition, there are several groups of words
derived (derivational) is concerned with the meaning of the word such as democracy,
democratic, and democratization. The goal of stemming and lemmatization is to reduce
inflectional forms, and (sometimes) the forms related derivative, to form the base or
common. For example:
19

Input: “The boy’s cars are different colors”

Transformation: am, is, are => be

Transformation: car, cars, car’s, cars’ => car

Result: “The boy car be differ color”
Stemming usually refers to the process of removing heuristic inflectional
prefixes and suffixes, and affixes-derivational affixes, in the hope that the success of
elimination has high precision. On the other hand, lemmatization usually refers to the
use of vocabulary and morphological analysis of the word in question, and aims to
eliminate the inflectional forms and generate the basic shape or form of the dictionarybased word, which is known as lemma. For example, for the word of “saw” in English,
stemming method probably will only return “s”, while the return lemmatization may see
or saw depending on the context (is it a noun or a verb). Another difference between the
two methods lies in the form of derivational. Stemming method will usually cut
derivational words are concerned, which usually lemmatization only removes
inflectional forms from a lemma.
2.4
State of the Art
2.4.1
Lemmatization for Foreign Language
This section is divided into 2 sub-sections, where the first part mainly focuses on
the brief explanation of some lemmatization methods for English language, and the
second part focuses on the overview of some lemmatization processes for other foreign
languages.
20
2.4.1.1 Lemmatization for English
Loponen & Järvelin (2010) constructed a statistical based, dictionary- and corpusindependent lemmatizer for low resource languages (p. 15), called StaLe. In his
publication, Loponen described the capability of StaLe lemmatizing four languages:
Finnish, Swedish, German, and English.
Figure. 2.1 StaLe lemmatization process
(Loponen & Järvelin, 2010, p. 5)
StaLe accepts a word form as an input, which will be processed based on the set of
rules taken from rule set. There are some cases that more than one rule can be applied to
the input form, which will then output “candidate lemmas”. Each of these candidates
will be sorted based on their confidence factor values. Then, according to the
parameters, the candidates will go through a checking phase, based on set parameters
which will eliminate unfulfilling candidates, and thus produce result lemmas. The size of
training list for StaLe depends on the complexity of target language. The more
morphological variation a language has, the bigger the size of the training list. The test
set for StaLe was taken from 54 full-text English language collections from CLEF 2003,
and the lemmatizer achieved 91.09% accuracy in English lemmatization.
Minnen, Carroll, & Pearce (2001) developed a program for English inflectional
morphology analysis and a morphological generator, based on finite-state techniques (p.
1). Basically, the analyzer accepts a pair of word form and its PoS tag as an input, and
21
outputs the lemma and suffix of the input, by using a rule set that represents English
morphology, including irregular forms. The rule set itself is based on FLEX and
acquired semi-automatically from several large corpora and dictionaries (p.1), but the
process itself is fully dependent to the rule set (i.e. dictionary-independent). Take one
rule from (p. 4) for example:
{A}+{C}”ied”
{return(lemma(3,”y”,”ed”));}
The rule is divided into two parts: the left part and the right part. The left part
serves as the condition, and the right part serves as the action to be executed provided
the condition is fulfilled. The condition itself is a regular expression; {A}+ means a
sequence of one or more, upper or lower case of the alphabet, while {C} stands for the
consonants, and and the double quotes represent exact string match. The function lemma
consists of three parts: how many characters to be removed from behind, the suffix to be
attached, and the inflectional suffix that caused the lemma to transform into the word
form (input). For irregular forms, the analyzer solves this problem by adding a specific
rule/exception, for example exact string match for irregular word forms. The test set for
the analyzer was obtained from CELEX lexical database of English. Overall test set
contains to 38,882 unique word forms, and the analyzer achieved 99.94% in type
accuracy and 99.93% in token accuracy (p. 9).
Northwestern University (2009) published a morphological adornment tool for
English Language texts, called MorphAdorner. Listed as one of its features, is an
English lemmatizer which uses a rule set, including exceptions, to lemmatize all kinds of
word form. The rule set contains about 150 rules and 1,600 irregular forms.
MorphAdorner accepts input that follows the NUPOS tag set, in a form of
(spelling, part of speech) pair. The lemmatizer is based on a list of irregular forms and
22
grammar rules. Instead of focusing to a specific part or speech set, it categorizes
irregular forms and rules using major part of speech classes: adjective, adverb,
compound, conjunction, infinitive-to, noun plural, noun possessive, preposition, verb,
and pronoun.
2.4.1.2 Lemmatization for Other Languages
Plisson, Lavrac, & Mladenic (2004) developed a lemmatizer for Slovenian
language. The lemmatizer uses Ripple Down Rules induction algorithms (henceforth
known as RDR), as the basis of their lemmatization algorithm (p. 83). Basically, RDR is
a learning algorithm that uses the idea of if-then-else branches/rules. Each rule can grow
further, depends on the given training set. The lemmatizer uses five datasets for testing,
which is collected randomly, with various sizes, from the Slovenian lexicon dictionary,
MULText-EAST, which contains about 20,000 normalized words and 500,000 surface
forms. The lemmatizer achieved 77.0% accuracy (p. 85).
Szopa (2007) developed a rule-based Dutch lemmatizer in Lisp, namely LRBL
(p. 1). Different from RDR, LRBL’s rules are hand-tuned so it avoids having repetitive
training for every rule changes. However, LRBL needs “POS” (Part-of-Speech), and
“Features” tags in order to lemmatize an input word accordingly. The output of LRBL’s
lemmatization process depends on a heuristic algorithm to select which rules should be
executed. The test set consists of 145,829 tagged and lemmatized dutch tokens, and
LRBL achieved 86% in accuracy (p. 13).
Ingason et al. (2008) described a lemmatizer for Icelandic, called Lemmald. The
lemmatization process relies on IceTagger to tag the input before lemmatization is
performed. The Icelandic Frequency Dictionary (IFD) corpus, which contains about
23
590K word and 700 different POS tags, is used for training. Furthermore, Lemmald can
also be run with the Database of Modern Icelandic Inflections (DMII) as an add-on for
improved accuracy; however this impacts lemmatizer performance (p. 208). Hierarchy
of Linguistic Identities (HOLI) is also used to handle organization of features and
feature structures for the machine learning based on linguistic knowledge (p. 205). The
test set was taken from IFD corpus which contains about 530,000 tokens and evaluation
shows that given correct tagging, Lemmald lemmatizes with 98.54% of accuracy. Using
DMII as an add-on additionally improves accuracy to 99.55% (p. 214).
Chrupala (2006) constructed a lemmatization method for languages with rich
inflectional morphology. In his work, lemmatization is treated as a classification process
for Machine Learning, and automatic induction of class labels (p. 121). The approach is
based on Shortest Edit Script (SES) between reversed input and output strings. By
computing SES, the most optimal transformation can be achieved to transform a word
form into its lemma. The test was performed on eight languages: Spanish, Catalan,
Portuguese, French, Polish, Dutch, German, and Japanese. The test set and training set
was taken from a lemma-annotated corpora/Treebank specific to each language, 10,000
tokens each. The lemmatizer achieved 88.42% on baseline accuracy (p. 123).
2.4.2 Stemming for Indonesian Language
Frakes (1992) initiated the development of Indonesian stemming algorithm, with
the approach of porting Porter’s Algorithm to fit Indonesian language’s morphology
rules. Since then, improvements over existing algorithm and new algorithm approaches
has developed in order to improve the accuracy of Indonesian stemming algorithm.
According to Asian (2009), the accuracy of stemming algorithm is measured by the
24
correctness of transforming a word form into its common root form, given a test set.
Ideally, a good stemmer will stem all words from the same semantic group to the same
stem. But due to the irregularities which are prominent in all natural languages, all
stemmers unavoidably make mistakes, including the ones which use vocabulary lists
(Tala,
2003,
p.
11).
Furthermore,
Indonesian
language
is
still
under
development/transition period, which means new words can be added, edited, and/or
removed any time and adds more difficulty for stemming Indonesian language. The
sections below will mainly focus on describing each of Indonesian stemming algorithm
chronologically.
2.4.2.1 Nazief and Adriani‘s Algorithm
Nazief and Adriani’s algorithm (henceforth referred as NAZIEF) was developed
in 1996, using a confix (combination of prefixes and suffixes) stripping approach with a
dictionary look up which consists of a list of lemmas (Tala, 2003, p. 1). NAZIEF is
based on comprehensive morphological rules that group together and encapsulate
allowed and disallowed affixes, including prefixes, suffixes, and confixes (combination
of prefix and suffix), which are also known as circumfixes (Asian, 2005, p. 308). Below
is the logical flow of NAZIEF, as described in (Asian, 2005, pp. 308-309):
1. The input word is checked against the dictionary. If found, the input word will be
returned as the output and algorithm ends. If input word does not exist in
dictionary, the input word will be stored in a temporary variable (e.g.
CURRENT_WORD).
After each stemming rule, CURRENT_WORD will be checked
against the dictionary.
25
2. Removal of inflectional suffixes, the set of suffixes that cannot alter the meaning
of a word. This set is divided into two types: inflectional particle suffixes (-lah, kah), for example: “merekalah” and “apakah”, and inflectional possessive
pronoun suffixes (-ku, -mu, -nya), for example: “sepatuku”, “sepedamu”, and
“bajunya”. If the algorithm successfully removed inflectional particle suffix from
CURRENT_WORD,
then this step will be repeated to try to remove the inflectional
possessive pronoun suffix.
3. Removal of derivational suffixes (-i, -an), the set of suffixes that may alter the
meaning a word. For example: “dinikmati” and “makanan”. The removed suffix
will be stored temporarily for future recoding step.
4. Removal of derivational prefixes, which may alter the meaning of a word. This
set is divided into two types: simple prefixes (di-, ke-, se-), for example
“dimakan”, “kemakan”, “sejalan”. This set of prefixes can be removed
immediately without affecting CURRENT_WORD, and complex prefixes (te-, be-,
me-, pe-), for example “penahan”, “menyelam”, “bersimpati”, and “terbuai”,
which needs an additional process to determine what change is needed, using a
specific rule table. If the CURRENT_WORD is still not found in the dictionary, step
4 will be attempted again to check for stacking prefixes.
5. If CURRENT_WORD is not found after recursive attempt of step 4, then recoding is
performed.
6. If CURRENT_WORD is still not found after step 5, then the removed suffix at step
3 will be checked. If “-an” was removed and the last letter is “-k”, then “-k” will
be removed and attempt step 4. If CURRENT_WORD is still not found, then the
26
removed suffix at step 3 will be restored, and re-attempt step 4. If still not found,
the algorithm will return original input word.
NAZIEF achieved 92.1% baseline accuracy on unique words test set (p. 312).
2.4.2.2 Arifin and Setiono’s Algorithm
In 2002, Arifin and Setiono constructed an algorithm (henceforth referred as
ARIF) which is a variant of NAZIEF. Overall, the algorithm still preserves dictionary
checking, prefix and suffix stripping, and recoding, but with affix restoration
functionality as an addition (Asian, 2009, p. 64). ARIF limits affix removal by two
prefixes, and three suffixes (p. 64). The logical flow of ARIF is described below:
1. First, the input word will be checked against the dictionary. If exists then the
input word is returned as result lemma. The input word will be saved temporarily
in a variable, for example CURRENT_WORD.
2. Prefix removal is done recursively with the loop limit of two. Each process
detects whether in the CURRENT_WORD exist known prefixes (be-, di-, ke-, me-,
pe-, se-, te-). After each removal, the CURRENT_WORD is checked against the
dictionary, and will continue if it does not exist. Recoding is performed if
possible.
3. Suffix removal is done recursively, with the loop limit of three. Each process
detects whether there are known suffixes (-i, -kan, -an, -kah, -lah, -tah, -pun, -ku,
-mu, -nya) in the ending of the CURRENT_WORD. After each removal, the
CURRENT_WORD
will be checked against the dictionary.
4. This step runs the affix restoration part with certain order, and after each step the
result candidate will be checked against the dictionary, and recoded if possible.
27
Assuming two prefixes and three suffixes were removed in the previous steps,
The order is as follows:
a. Restore all prefixes to CURRENT_WORD.
b. Restore second prefix to CURRENT_WORD.
c. Restore all prefixes and third suffix to CURRENT_WORD.
d. Restore second prefix and third suffix to CURRENT_WORD.
e. Restore all prefixes, third suffix, and second suffix to CURRENT_WORD.
f. Restore second prefix, third suffix, and second suffix to CURRENT_WORD.
5.
If after step 4 CURRENT_WORD still remains unknown to the dictionary, the
process will return original input/word form.
Asian identified two ‘shortcomings’ for this algorithm (p. 65). First, it has the
possibility of removing a suffix that resembles the previously removed suffix identically
(i.e. identical affix removal). Secondly, ARIF is sensitive to prefix and suffix removal
order. (p. 66). ARIF achieved 88.0% accuracy on unique words test set.
2.4.2.3 Vega’s Algorithm
Berlian Vega (2001) proposed a stemmer with different approach; a purely rulebased algorithm (henceforth referred to VEGA). VEGA is dictionary-independent,
however it still uses external list of exceptions to handle exceptional forms in Indonesian
language (e.g. ‘pelajar’ is stemmed to ‘ajar’). This means that the completeness of the
rule determines the accuracy of VEGA (i.e. heavily relies on ruleset). The order of rules
is also important, because each rule will be checked sequentially, and some rules may
use other rules in its process in order to break down a given word form into smaller
28
parts. The algorithm proceeds only when one rule fails. Consider an example ruleset for
VEGA taken from (Asian, 2009, p.66):
Word form: ‘keselamatan’
(Example) Rule set:
Rule 1: word(Root)  circumfix(Root)
Rule 2: word(Root): StemWord
Rule 3: circumfix(Root)  ber-(Root), (-kan | -an)
Rule 4: circumfix(Root)  ke-(Root), -an
Rule 5: ber-(Root)  ber-, stem(Root)
Rule 6: ke-(Root)  ke-, stem(Root)
The algorithm starts form the first rule, with ‘Root’ containing the input/word
form. The ‘circumfix’ function passes ‘Root’ to Rule 3. Rule 3 then calls ber-(Root)
function (i.e. Rule 5), which then fails because prefix ‘ber-‘ is not found in the current
word. This failure causes Rule 3 to fail, and thus the algorithm proceeds to Rule 4. This
causes ke-(Root) rule (i.e. Rule 6) to be called and thus correctly returns ‘selamatan’ to
Rule 4, and Rule 4 successfully returns ‘selamat’. VEGA handles exceptional cases by
creating specific rules with literal match condition, such as ‘megawati’ to prevent ‘me... –i’ affix removal (p. 66). VEGA achieved 69.4% on unique test set (p. 73).
2.4.2.4 Ahmad, Yusoff, and Sembok’s Algorithm
Ahmad, Yusoff, and Sembok’s stemming algorithm for Malaysian language
(AHMAD), proposed in 1996. Malaysian language is similar to Indonesian language,
but there are still some differences in terms of rules and affix usage. AHMAD uses a
root word dictionary, and a set of valid affix rules to correctly stem the word form.
29
When an input (word form) is entered, it will be immediately checked against the
dictionary. If it is unknown to dictionary, then the input will be checked through all the
affix rules, one by one. If, in the end, not even one single rule is satisfied, the input will
be returned originally. AHMAD also handles recoding if met with correct requirements.
Take for example “mengalahkan” and an example affix rule “meng-...-kan”. The first
removal produces “alah”, which is invalid in the dictionary. However, because it is
possible to do a recoding, it gives the result “kalah”, which is the correct root word.
Overall, AHMAD’s accuracy depends on the ordering of affix rules, because if ordered
wrongly AHMAD can mistakenly stem certain word forms (Asian, 2009, p.68).
AHMAD achieved 88.3% accuracy on unique words test set (p. 73).
2.4.2.5 Idris’s Algorithm
In 2001, Idris expands AHMAD by adding alterations between prefix and suffix
removal. Idris managed to construct two types of affix removal; the first one is the
algorithm assumes that the word form have different prefix, for example a word form
“mematahkan”, in the first algorithm type, the result can be either “mem-atah-kan” or
“me-matah-kan”. The second type prioritizes recoding first, and with the same example,
the result can be either “mem-atah-kan”, or (with recoding) “me-patah-kan”. Asian (p.
69) analyzed that the second type has a slightly better performance, compared to the
first. Same as AHMAD, Idris’ algorithm is sensitive to rules ordering. Idris’ second type
algorithm achieved 88.8% accuracy on unique words test set (p. 73).
30
2.4.2.6 Confix-Stripping Stemmer
In 2005, Jelita Asian attempted to improve the accuracy of NAZIEF (1996),
because NAZIEF has the best scheme/approach to Indonesian stemming, and also the
best in terms of stemming accuracy. According to her analysis, NAZIEF’s error is
majorly caused by several aspects: non-root word in the dictionary, incomplete
dictionary, and hyphenated words, while the rest are caused by ineffective rules and
rules ordering (Asian, 2005, p.312). In 2007, Asian, Nazief, et al. collaborated to
produce a paper that presents the ‘Confix-Stripping Stemmer’, an improved version of
NAZIEF. The modified rules and algorithm changes were presented in a detailed,
algorithmic explanation:
1. The input is first checked against the dictionary. If the input exists in dictionary,
then the input will be returned as result lemma.
2. Inflectional particle suffixes (-kah, -lah, -tah, -pun) will be removed from the
current input, and the remains will be kept inside a string variable
(CURRENT_WORD), checked against the dictionary. If exists, then process
terminates.
3. Inflectional possessive pronoun (-ku, -mu, -nya) will be removed from
CURRENT_WORD,
and checked against the dictionary. If exists, process
terminates.
4. Derivational suffixes (-i, -kan, -an) will be removed from CURRENT_WORD, and
checked against the dictionary. If exists, process terminates.
5. This step focuses on removing derivational prefixes (beN-, di-, ke-, meN-, peN-,
se-, teN-) from CURRENT_WORD. This step is recursive because in Indonesian
morphology derivational prefix can be stacked. Some prefixes (di-, ke-, se-) are
31
considered simple, because in practice they do not change the lemma. On the
other hand, the other prefixes (beN-, meN-, peN-, teN-) do change the lemma,
and differs according to the first letter of the lemma. These transformations and
variants are listed on Table 2.2 below.
Table 2.2 Derivational prefix stripping rule set
(Asian, et al., 2007, p. 13)
Rule
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
Construct
berV...
berCAP...
berCAerV...
belajar...
beC1erC2...
terV...
terCerV...
terCP...
teC1erC2...
me{l|r|w|y}V...
mem{b|f|v}...
mempe{r|l}
mem{rV|V}...
men{c|d|j|z}...
menV...
meng{g|h|q|k}...
mengV...
menyV...
mempV...
pe{w|y}V...
perV...
perCAP...
perCAerV...
pem{b|f|v}...
pem{rV|V}...
pen{c|d|j|z}...
penV...
Return
ber-V... | be-rV...
ber-CAP... Where C!='r' and P!='er'
ber-CAerV... Where C!='r'
bel-ajar...
be-C1erC2... Where C1!={'r' | 'l'}
ter-V... | te-rV...
ter-cerV... Where C!='r'
ter-CP... Where C!='r' and P!='er'
te-C1erC2... Where C1!='r'
me-{l|r|w|y}V...
mem-{b|f|v}...
mem-pe...
me-m{rV|V}... | me-p{rV|V}...
men-{c|d|j|z}...
me-nV... | me-tV...
meng-{g|h|q|k}...
meng-V... | meng-kV...
meny-sV...
mem-pV... where V!=’e’
pe-{w|y}V...
per-V... | pe-rV...
per-CAP... where C!=’r’ and P!=’er’
per-CaerV... where C!=’r’
pem-{b|f|v}...
pe-m{rV|V}... | pe-p{rV|V}...
pen-{c|d|jz}...
pe-nV... | pe-tV...
32
28
29
30
31
32
33
peng{g|h|q}...
pengV...
penyV...
pelV...
peCP
peCerV
peng-{g|h|q}...
peng-V... | peng-kV...
peny-sV...
pe-lV... Except: “pelajar” return “ajar”
pe-CP... where C!={r|w|y|l|m|n} and P!=’er’
per-CerV. . .where C!={r|w|y|l|m|n}
There are several termination conditions for this step:
a. The prefix and the removed suffix are listed in the invalid affix pair table
below (Table 2.3).
b. The removed prefix is literally equivalent to previously removed prefix.
c. The recursive limit for this step is three.
Table 2.3 Disallowed prefix and suffixes pairs; except the “ke-“ and “-i” affix pair for
the root word “tahu”. (Asian et al., 2007, p. 6)
Prefix
berdikemeterper-
Disallowed Suffixes
-i
-an
-i and –kan
-an
-an
-an
The removed prefix will be recorded, and CURRENT_WORD will be checked
against the dictionary. If CURRENT_WORD does not exist in the dictionary and no
termination condition is satisfied, then step 5 will be repeated, with
CURRENT_WORD as
input.
33
6. If CURRENT_WORD is still not found after Step 5, then Table 2.2 will be
examined whether recoding (p. 63) is possible. In the rule set, there are several
rules that hold more than one output. Take rule 17 for example; mengV has two
outputs: meng-V or meng-kV. In step 5, the first output (i.e. left one) will always
be picked first, and this can cause error. Recoding is done to undo this kind of
error by going back to step 5 where this output selection happens; and instead
selects the other output.
7. If CURRENT_WORD still remains unknown to the dictionary, then the original
input word will be returned.
In order to solve the major error causes as stated above (i.e. non-root word in the
lookup dictionary, incomplete dictionary, and hyphenated words), Asian suggested three
approaches:
1. Improve dictionary quality by using different dictionary sources, and compares
its accuracy with the previous dictionary.
2. Add extra rules to handle hyphenated words. The main idea used to construct this
rule is, if a hyphenated word contains an exact same pair word (e.g. bulir-bulir)
then it will be stemmed to bulir. This also applies for hyphenated word with
affixes (e.g. seindah-indahnya); the affix will be removed first and then checked
whether the pair word is stemmable.
3. Modification of rules, prefixes and suffixes:
a. Rules alteration for prefixes (“ter-“, “pe-“, “mem-“, and “meng-“) which
has already been applied to Table 2.2 above. In detail, rule number 9 and
33 were added, rule number 12 and 16 were a modified version from the
previous rule.
34
b. Prefix removal will be performed before suffix removal if a given word
has an affix pair from the list below:

“be-“ and “-lah”

“be-“ and –an”

“me-“ and “-i”

“di-“ and “-i”

“pe-“ and “-i”

“ter-“ and “-i”
Compared against NAZIEF with the same dataset, the modified NAZIEF
achieves around 2-3% higher accuracy (approximately 95%).
2.4.2.7 Enhanced Confix Stripping
Arifin, Mahendra, & Ciptaningtyas (2009) extended (improved the accuracy of)
the Confix-Stripping Stemmer by solving unhandled cases with specific prefix type (p.
151), listed below:
1. “mem-p” as in mempromosikan,
2. “men-s” as in mensyukuri,
3. “menge-“ as in mengerem,
4. “penge-“ as in pengeboman,
5. “peng-k” as in pengkajian,
6. Incorrect affix removal order, resulting in unstemmed input. For example the
word pelaku is overstemmed because of the “-ku” on the last part of the word is
considered as an inflectional possessive pronoun suffix. Other example is the
35
word pelanggan, which is overstemmed because the “-an” part on the last is
considered as a derivational suffix.
In order to solve the cases above, Arifin, et al. suggested two improvements:
1. Rules modification and addition to Table 2.2, in order to fit the specific
unhandled cases above.
2. Extra process of stemming, which is called loopPengembalianAkhiran (p. 151),
henceforth referenced as LPA. This extra step is appended after the last step of
CS Stemmer’s stemming process, specifically after the recoding attempt has
failed (i.e. Step 8, in the CS Stemmer). After each step, a dictionary lookup is
performed to check if the processed input listed in the dictionary. The detailed
flow of LPA is as follows:
a. Return CURRENT_WORD to the state before recoding, and return all
prefixes that have been removed in the prefix removal process, and
perform a dictionary lookup.
b. Redo the prefix removal process.
c. Return the previously removed suffixes in order: derivational, personal
pronoun, and particle suffixes. On each order of suffix restoration, step d
to step e is performed. An important exception is made for the
derivational suffix “-kan”. Firstly, only the ‘-k’ is restored and step d and
e is performed. However, if fails, then the rest (i.e. ‘-an’) will be restored,
and step d and e is performed.
d. Redo the prefix removal process, and perform recoding if possible.
e. If dictionary lookup fails, execute step a and restore the next order of
suffix according to step c.
36
2.5
State of the Art Overview
The progress of Indonesian stemming methods differs in many ways. Some
proposed a difference approach, some developed/attempt to upgrade existing methods,
and some modified the existing approach. In summary, the development progress is
presented in a form of table below:
Table 2.4 Stemming progress overview
Method
1996
AHMAD
(Malay)
Dictionary- and Rulebased
Affix removal rule, with
dictionary lookup for Malay
88.3
1996
NAZIEF
Dictionary- and Rulebased
Affix and confix removal
rule, with dictionary lookup
92.1
2001
VEGA
Affix removal rule, with
hardcoded exception
69.4
2001
IDRIS
Dictionary- and Rulebased
Improvement from
AHMAD, uses 2 dictionaries
88.8
2002
ARIF
Dictionary- and Rulebased
Variant of NAZIEF,
always remove prefix first
88
2007
CS STEMMER
Dictionary- and Rulebased
Improvement from
NAZIEF addition & modified
rules
94
Dictionary- and Rulebased
Improvement from CS
STEMMER, added process
-
2009
ENHANCED
CS
Stemming Approach
Pure Rule-based
Work
Acc.
(%)
Year