Download chapter 2 literature review

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Agglutination wikipedia , lookup

Chinese grammar wikipedia , lookup

French grammar wikipedia , lookup

Antisymmetry wikipedia , lookup

Stemming wikipedia , lookup

Focus (linguistics) wikipedia , lookup

Esperanto grammar wikipedia , lookup

Sentence spacing wikipedia , lookup

Semantic holism wikipedia , lookup

Word-sense disambiguation wikipedia , lookup

Japanese grammar wikipedia , lookup

Latin syntax wikipedia , lookup

Polish grammar wikipedia , lookup

Romanian grammar wikipedia , lookup

Cognitive semantics wikipedia , lookup

Morphology (linguistics) wikipedia , lookup

Bound variable pronoun wikipedia , lookup

Pipil grammar wikipedia , lookup

Pleonasm wikipedia , lookup

Sloppy identity wikipedia , lookup

Parsing wikipedia , lookup

Untranslatability wikipedia , lookup

Malay grammar wikipedia , lookup

Transformational grammar wikipedia , lookup

Spanish grammar wikipedia , lookup

Junction Grammar wikipedia , lookup

Transcript
CHAPTER 2
LITERATURE REVIEW
2.1
Fundamental Theory
2.1.1 Artificial Intelligence
Poole and Mackworth say (2010:3) “Artificial intelligence, or AI, is the field
that studies the synthesis and analysis of computational agents that act intelligently.”
One of the branches of Artificial Intelligent is Natural Language Processing. Rajesh
and Reddy (2009:421) say “Natural Language Processing is a subfield of
computational logistics, aims at designing and building software that will analyze,
understand and generate natural human language, so that in the future we will be able
to interface with computers, in both written and spoken contexts using natural human
languages, instead of computer languages.”
2.1.1.1 Natural Language Processing
Akerkar and Sajja (2010:323) say “Natural Language Processing (NLP) is a
theoretically motivated variety of computational techniques for analyzing and
representing naturally occurring texts at one or more levels of linguistic analysis for
the purpose of achieving human-like language processing for a range of tasks or
applications.”
There are seven levels for linguistic analysis such as:
2.1.1.1.1
Phonological
Akerkar and Joshi (2008:71) say “Phonetics is the interpretation of speech
sounds within and across words. It is the study of language in terms of the
relationships between phonemes whereas phonemes are the smallest distinct soundunits in a given language. Phonetic knowledge is used, for example, for building
speech-recognizing systems.”
7
8
2.1.1.1.2
Morphology
Akerkar and Sajja (2010:71) say “It is the study of the meaningful parts of
words. It deals with componential nature of words, which are composed of
morphemes. Morphemes are the smallest elements of meaning in a language.
Morphological knowledge is used, for example, for automatic stemming, truncation
or masking of words.”
2.1.1.1.3
Lexicology
Akerkar and Sajja (2010:71) say “Lexicology is the study of words. This
level refers to parts-of-speech tagging or the use of lexicons. Lexicons are utilized in
IR system to ensure that a common vocabulary is used in selecting appropriate
indexing or searching terms or phrases.”
2.1.1.1.4
Syntactic
Akerkar and Sajja (2010:71) say “The syntactic level of linguistic analysis is
concerned with how words arrange themselves in construction. Syntax is the study of
the rules, or patterned relations, that govern the way the words in a sentence are
arranged. Syntactic rules are used in parsing algorithms. Meaning can be derived
from word’s position and role in sentence. The structure of a sentence conveys
meaning and relationship between words, even if we do not know what their
dictionary meanings are. All this is conveyed by the syntax of the sentence.”
2.1.1.1.5
Semantics
Akerkar and Sajja (2010:71) say “Semantics involves the study of the
meaning of word. This is more complex level of linguistic analysis. The study of the
meaning of isolated words may be termed lexical semantics. The study of meaning is
also related to syntax at the level of the sentence and to discourse at the level of text.
By using both syntactic and semantic levels of analysis, Natural Language
Processing systems can identify automatically phrases of two or more words that
when looked at separately have quite different meanings.”
9
2.1.1.1.6
Discourse Analysis
Akerkar and Sajja (2010:71) say “Although syntax and semantics work with
sentence-length units, the discourse level of NLP works with units of text longer than
a sentence” This level relies on the concept of predictability. It uses document
structure to further analyze the text as a whole. By understanding the structure of a
document, Natural Language Processing systems can make certain assumptions.
Examples from information science are the resolving of anaphora and ellipsis and the
examination of the effect on proximity searching.”
2.1.1.1.7
Pragmatics
Akerkar and Sajja (2010:72) say “Pragmatics is often understood as the study
of how the context (or world knowledge) influences meaning. This level is in some
ways far more complex and work intensive than all the other levels. This level
depends on a body of knowledge about the world that comes from outside the
document. Though it is easy for people to choose the right sense of the word, it is
extremely difficult to program a computer with all the world knowledge necessary to
do the same.”
2.1.1.2 Discourse
Mitkov (2010:599) says “According to the Longman dictionary, discourse is
(1) a serious speech or piece of writing on a particular subject, (2) serious
conversation or discussion between people, or (3) the language used in particular
types of speech or writing.” The term ‘serious’ here means that the text produced is
not a random collection of symbols or words but related and meaningful sentences
which have to do with the fact that discourse is expected to be both cohesive and
coherent. Tofiloski (2009:11) also defines discourse in two ways: as a unit of
language which is above the sentence or simply it is a group of sentences, and as
having a particular focus. Discourse can be represented as paragraphs, section,
chapters, parts, or stories (Tofiloski 2009:15).
2.1.1.3 Paragraph
Oshima and Hogue (2007:3) define paragraph as a group of related
statements that a writer develops about a subject where the first sentence states the
10
specific point or idea of the topic and the rest of the sentences in the paragraph
support that point. In accordance with Oshima and Hogue (2007), Rustipa (2010:3)
emphasizes that the number of sentences in a paragraph is unimportant however it
should be long enough to develop the main idea through its organizational pattern.
Paragraph has three major structural parts, such as topic sentence which uses Theme
as the topic and Rheme as the controlling idea, supporting sentence which explains
the topic sentence and concluding sentence which signals the end of the paragraph
(Rustipa, 2010:4).
2.1.1.4 Cohesion
Cohesive is linguistic ‘device’ that helps to establish links among the
sentences of a text while cohesive links are the types of link that these devices
construct (Mitkov, 2010:600).
Mitkov (2010:600) says “Cohesion in texts is more about linking sentences
or, more generally, textual units through cohesive devices such as anaphors and
lexical repetitions.”
2.1.1.4.1
Anaphora Resolution
Anaphora resolution is the process of determining the antecedent of an
anaphor. Mitkov (2010: 611-614) defines “anaphora as the linguistic phenomenon of
pointing back to a previously mentioned item in text.” Anaphor is the entity that
pointing back to antecedent where antecedent is the entity to which anaphora refers
or stands for.
To implement the anaphora resolution, CogNIAC rules are used.
2.1.1.4.1.1
CogNIAC Rules
Clark, Fox, and Lappin (2010:619) say that “CogNIAC employs a set of
‘high-confidence’ rules which are successively applied to the pronoun under
consideration. The processing of a pronoun terminates after the application of the
first relevant rule.”
In order to increase the accuracy of CogNIAC rules, agreement constraints
are applied. The constraints are:
11
i. Gender Agreement
Gender agreement compares anaphor’s gender to candidate’s gender. If both
genders are not the same then the candidate will not be considered.
ii. Number Agreement
Number agreement will classify the entity into singular and plural. If the
result of the number agreement is not the same then the candidate will not be
considered anymore.
After the agreement constraints have been applied, CogNIAC rules are
applied to help resolving anaphora. CogNIAC consists of a set of rules which are
applied to the pronouns of a text. Baldwin (1996:3) explains the core rules of
CogNIAC which are given below:
1. Unique in Discourse: If there is a single possible antecedent i in the read-in
portion of the entire discourse, then pick i as the antecedent.
2. Reflexive: Pick nearest possible antecedent in read-in portion of current sentence
if the anaphor is a reflexive pronoun.
3. Unique in Current + Prior: If there is a single possible antecedent i in the prior
sentence and the read-in portion of the current sentence, then pick i as the
antecedent.
4. Possessive Pro: If the anaphor is a possessive pronoun and there is a single exact
string match i of the possessive in the prior sentence, then pick i as the
antecedent.
5. Unique Current Sentence: If there is a single possible antecedent in the read-in
portion of the current sentence, then pick i as the antecedent.
6. Unique Subject/ Subject Pronoun: If the subject of the prior sentence contains
a single possible antecedent i, and the anaphor is the subject of its sentence, then
pick i as the antecedent.
7. Cb-Picking: If there is a Cb i in the current finite clause that is also a candidate
antecedent, then pick i as the antecedent.
8. Pick Most Recent: Pick the most recent potential antecedent in the text.
12
2.1.1.5 Coherence
Coherence in writing means the sentences must hold together to form smooth
movement between sentences. One way to achieve coherence in a paragraph is to use
the same nouns and pronouns consistently (Oshima and Hogue, 2007:79).
Thornbury (2005:38) explains that “In English, sentences (and the clauses of
which they are composed) have a simple two-way division between what the
sentence is about (its topic) and what the writer or speaker wants to tell you about
that topic (the comment). The topic and comment are also called Theme and Rheme.”
The previous researches are used Centering Theory and Entity Transition Value.
2.1.1.5.1
Thematic Development
Thornbury (2005:38) explains Theme is subject of the sentence and is typically
realized by noun phrase. Rheme is the new information of the sentence and it is used
to explain the topic or Theme. This definition is supported by Rustipa (2010:4) who
says that the topic or Theme is the subject of each sentence in the paragraph while
Rheme is the controlling idea to limit the topic in every sentence.
The way Theme of clause is developed is known as Thematic Development.
Theme of clause is taken from Theme or Rheme of previous sentences (Rustipa,
2010:7). There are three types of Thematic Development pattern as follows:

Theme Reiteration or Constant Theme Pattern
This pattern shows that the first Theme is picked up and repeated in the
beginning of the next clause. The figure is as follow:
13
Figure 2.1 Theme Reiteration Pattern
Source: (Rustipa, 2010:7)

Zig Zag Linear Theme Pattern
It is a pattern when the subject matter in the Rheme of one clause is taken up as
the Theme of the following clause. The figure is as follow:
Figure 2.2 Zig Zag Linear Theme Pattern
Source: (Rustipa, 2010:8)

Multiple Theme / Split Rheme Pattern
In this pattern, a Rheme may include a number of different pieces of information,
each of which may be taken up as the Theme in a number of subsequent clauses. The
figure of this pattern is as follows:
14
Figure 2.3 Multiple Theme / Split Rheme Pattern
Source: (Rustipa, 2010:8)
2.1.1.5.2
Centering Theory
Centering Theory is a theory about local discourse coherence which occurs
between adjacent utterances or sentences. The idea of Centering Theory is that each
sentence (utterance) has focused entity as center of the sentence (utterance) (Clark,
Fox, and Lappin, 2010:607). Centering Theory tracks the movement of single entity
or noun as it changes in salience from sentence to sentence. Salience is determined
by the highest rank that an entity has in one sentence. Subject has higher rank than
object which has higher rank than other entity in grammatical role. The order of
preferred grammatical role is shown as follow (Tofiloski, 2009:33):
Subject > Indirect Object > Direct Object > Others
SubjectFigure
 Indirect
Object  Direct
 Others
2.4 Grammatical
Role Object
Ordering
Source: (Tofiloski, 2009:33)
In an utterance U, the list of entities in each U are called Forward Looking
Center, abbreviated as CF. The most salient entity in U is called Preferred Center or
simply CP. The Backward Looking Center, abbreviated as CB is the most salient
entity in the previous utterance (Un-1) that is recognized in the current utterance U.
CB can be NULL when only one utterance is introduced and when the highest rank
of entity in previous utterance (Un-1) does not recognized in current utterance (Un).
15
The main objective of Centering Theory is to cluster the level of coherence
between utterances. Utterance with the topic of preceding utterance is more coherent
than utterance which feature topic shift. The level of coherence is obtained from
comparison between preceding topic and current topic in utterances. A table below is
used to classified local coherence of utterance (Miltsakaki and Kukich, 2004:7).
Table 2.1 Standard Centering Theory Transitions
Source: (Miltsakaki and Kukich, 2004:7)
CB(Un) = CB(Un-1)
CB(Un) ≠ CB(Un-1)
CB(Un) = CP(Un)
Continue
Smooth-shift
CB(Un) ≠ CP(Un)
Retain
Rough-shift
Centering Theory classifies the degree of coherence based on sequences of
utterance transitions. The degree of coherence classifies the utterance transitions into
four basic types:
1. Continue: This occurs when an entity has been the Preferred Center for
consecutive three utterances including the current utterance.
2. Retain: This occurs when there is a new Preferred Center yet the
previous Preferred Center occurs in the current utterance.
3. Smooth-shift: This occurs when the Preferred Center of the current
utterance was the highest ranked entity in the previous utterance that is
realized in the current utterance but the Backward Looking Centers of the
previous two utterances are not the same entity.
4. Rough-shift: This occurs when the Preferred Center was not the highest
ranked entity in the previous utterance that also appears in the current
utterance, as well as not having the same Backward Looking Centers.
2.1.1.5.3
Entity Transition Value
Entity Transition Value was proposed by Milan Tofiloski in 2009 as a project
to extend Centering Theory. The focus of the study is to be able to track all entities
realized in an utterance and show the benefits of such model while the traditional
Centering Theory is only tracking the most salient entity. The main idea of his
16
research is to establish a hard distinction between coherence and co-reference or
anaphora resolution and to identify how coherent is the text after all the referents had
been resolved (Tofiloski, 2009:34).
In measuring entity coherence, sentences are represented as vectors where the
elements of the vector are corresponding to the words of the sentence. Consider the
example text “Flint likes Rapunzel. Rapunzel is a singer.” is illustrated into matrix (a
series of vectors) in the form of entity grids:
Table 2.2 A Representation of the Text Incorporating Linear Order
Words in Sentence
Flint
likes
Rapunzel
is
a
singer
S1
X
X
X
-
-
-
S2
0
0
X
X
X
X
In table 2.2, sentence 1 (S1) has a vector of (X, X, X, -, -, -) where ‘X’ means
the entity is realized in the sentence, while ‘–’ means the entity is not realized in the
sentence. The next step to measure entity coherence is to remove non-entity in
vector. The table below shows the representation of the text with omitted non-entity
word. Thus, only words which entities are considered for representation within the
entity coherence model.
Table 2.3 A Representation of the Text with Non-Entity Words Removed from the
Representation Vector
Words in Sentence
Sentences
Flint
Rapunzel
singer
S1
X
X
0
S2
0
X
X
The next step is to define maximum salience value. Salience describes a way
to arrange entities realized in an utterance. The arrangement is ranked according to
topic or prominence at any point in the text. Entity salience ranks will be considered
17
according to linear order (Tofiloski, 2009:44). Tofiloski (2009:47) explains
“Maximum salience is determined by the utterance that realizes the most number of
entity and using that value as ceiling for salience.” From example above, the first
utterance has 2 entities and second utterance has 2 entities, so that the maximum
salience is 2. The next occurred entity in an utterance will have salience that is
obtained from maximum salience decreased by previous occurred entity.
Table 2.4 Example of Vector and Each Entity’s Salience
Words in Sentence
Sentences
Flint
Rapunzel
Singer
S1
2
1
0
S2
0
2
1
The table above shows the representation of the text into a series of vectors
that includes the salience of the word in the sentence. Larger value corresponds to a
higher salience. Each row corresponds to a sentence while the column corresponds to
the salience of that word in the sentence.
The vector representation is a straightforward way to track the salience and
realization of all the entities in a text, and how their values change over the text. The
main benefit of a vector representation is that the mathematical structure is wellstudied and common. Vectors and their operations are used in a variety of disciplines
that create simulations and models of real-world processes (Tofiloski, 2009:41).
The next step in measuring entity coherence is to track salience change
between sentences by comparing entity salience values. Consider the following five
sentences as a sample text:
Merida was a student in computer science.
China is a large country.
Merida went to a private school in Jakarta.
Beijing is the capital of China.
She likes computer science.
18
Below is the vector representation for each sentence:
The next step is to measure the entity transition by using below formulas. The
formula is to compute the transition value of an individual entity
adjacent utterances
and
between two
, as follows:
(1
)
The next step is to define transition weight. The purpose behind the entity
transitions being weighted is to prefer utterances which have their most salient
entities in common over utterances that have their lowest salient entities in common.
The idea is to add a ‘mass’ to each entity according to its salience. The formula
below is used to define transition weight between two utterances.
Calculate the transition weight
for favoring utterance transitions
having more entities in common. The numerator is the sum of the number of entities
in
that are realized in
with the number of entities in
that are realized in
.
This is designed to capture two utterances that realize the same number of entities
should be considered more similar (Tofiloski, 2009:48).
(2)
19
The final transition metric is found in Transition between Utterances which
computes the average transition between all realized entities multiplied by
,
which is the transition weight that corrects for the sparsity of the vector in the
number of realized entities in an utterance (Tofiloski, 2009:49).
(3
)
2.1.2 Supported Library
2.1.2.1 Python Programming Language
Python is a popular open source programming language used for both
standalone programs and scripting applications in a wide variety of domain. It is free,
portable, powerful, and remarkably easy and fun to use. Python has many benefits
for its users. (Lutz, 2009:3-4) has summarized the primary factors cited by Python
users which seem to be like these:
1. Software quality
For many, Python’s focus on readability, coherence, and software quality
in general sets it apart from other tools in the scripting world. Python code is
designed to be readable, and hence reusable and maintainable than traditional
scripting languages. The uniformity of Python code makes it easy to
understand, even if you did not write it. In addition, Python has deep support
for more advance software reuse mechanisms, such as object-oriented
programming (OOP).
2.
Developer productivity
Python boosts developer productivity many times beyond compiled or
statically typed languages such as C, C++, and Java. Python code is typically
one-third to one-fifth the size of equivalent C++ or Java code. That means
there is less to type, less to debug and less to maintain after the fact. Python
programs also run immediately, without the lengthy compile and link steps
required by some other tools, further boosting programmer speed.
20
3. Program portability
Most python programs run unchanged on all major computer platforms.
Porting python code between Linux and Windows, for example, is usually just
a matter of copying a script’s code between machines. Moreover, python
offers multiple options for coding portable graphical user interfaces, database
access programs, web-based systems, and more. Even operating system
interfaces, including program launches and directory processing, are as
portable in Python as they can possibly be.
4. Support libraries
Python comes with a large collection of prebuilt and portable functionality,
known as the standard library. This library supports an array of applicationlevel programming tasks from text pattern matching to network scripting. In
addition, python can be extended with both homegrown libraries and a vast
collection of third-party application support software. Python’s third party
domain offers tools for website construction, numeric programming, serial
port access, game development, and much more. The NumPhy extension, for
instance, has been described as a free and more powerful equivalent to the
Matlab numeric programming system.
5. Component integration
Python scripts can easily communicate with other parts of an application,
using a variety of integration mechanisms. Such integrations allow python to
be used as a product customization and extension tool. Today, python code
can invoke C and C++ libraries, can be called from C and C++ programs, can
integrate with Java and .NET components, can communicate over frameworks
such as COM, can interface with devices over serial ports, and can interact
over networks with interfaces like SOAP, XML-RPC, and CORBA. It is not a
standalone tool.
6. Enjoyment
Because of python’s ease of use and built-in toolset, it can make the act of
programming more pleasure than chore. Although this may be an intangible
benefit, its effect on productivity is an important asset.
21
2.1.2.2 Django Framework
Django is a web framework based on the Model-View-Controller (MVC)
design pattern. The models are Python classes developer uses to interact with the
database layer, controllers are the layer that handles application logic and sending
responses to requests, and views are what the user sees and interacts with. Django is
written in the Python programming language. It is extremely easy to learn and use,
and is very lightweight and straightforward (Holovaty, 2014).
2.1.2.3 Natural Language Toolkit
NLTK is a suite of Python modules distributed under the GPL open source
license via nltk.org. NLTK comes with a large collection of corpora, extensive
documentation, and hundreds of exercises, making NLTK unique in providing a
comprehensive framework for students to develop a computational understanding of
language. NLTK’s code base of 100,000 lines of Python code includes support for
corpus access, tokenizing, stemming, tagging, chunking, parsing, clustering,
classification, language modeling, semantic interpretation, unification, and much else
besides (Bird, Klein, and Loper, 2009:62).
Table 2.5 Language processing tasks and corresponding NLTK modules with
examples of functionality
Source: (Loper, Klein, and Bird, 2014)
Language processing
task
Accessing corpora
NLTK modules
Functionality
nltk.corpus
Standardized interfaces
to corpora and lexicons
String processing
Collocation discovery
nltk.tokenize,
Tokenizers, sentence
nltk.stem
tokenizers, stemmers
nltk.collocations
t-test, chi-squared,
point-wise mutual
information
22
Language processing
task
Part-of-speech tagging
NLTK modules
Functionality
nltk.tag
n-gram, backoff, Brill,
HMM, TnT
Classification
n l t k . c l a s s i f y,
Decision tree, maximum
nltk.cluster
e n t r o p y, n a i v e B a y e s ,
EM, k-means
Chunking
nltk.chunk
Regular expression, ngram, named entity
Parsing
nltk.parse
Chart, feature-based,
unification,
probabilistic,
dependency
Semantic interpretation
nltk.sem,
Lambda calculus, first -
nltk.inference
order logic, model
checking
Evaluation metrics
nltk.metrics
Precision, recall,
agreement coefficients
Probability and
nltk.probability
estimation
Frequency distributions,
smoothed probability
distributions
Applications
nltk.app,
Graphical concordancer,
nltk.chat
parsers, WordNet
browser, chatbots
Linguistic fieldwork
nltk.toolbox
Manipulate data in SIL
Toolbox format
NLTK was designed with four primary goals, which are as follow (Loper,
Klein, and Bird, 2014):
1. Simplicity
23
To provide an intuitive framework along with substantial building blocks,
giving users a practical knowledge of Natural Language Processing without getting
stuck in the tedious house-keeping usually associated with processing annotated
language data.
2. Consistency
To provide a uniform framework with consistent interfaces and data
structures, and easily guessable method names.
3. Extensibility
To provide a structure into which new software modules can be easily
accommodated, including alternative implementations and competing approaches to
the same task.
4. Modularity
To provide components that can be used independently without needing to
understand the rest of the toolkit.
2.1.2.4 Part-of-speech Tags
In corpus
linguistics, part-of-speech
tagging (POS
tagging),
also
called grammatical tagging or word-category disambiguation, is the process of
marking up a word in a text as corresponding to a particular part of speech, based on
both its definition, as well as its context.
NLTK Project (2014) explains, “This package contains classes and interfaces
for part of speech tagging. The result after using this package is tag and token. Tag is
a case sensitive string that specifies some property of a token, such as its part of
speech. For example, the word ‘fly’. This package will give information (tag, token)
such as (fly, NN). This package defines several taggers, which take a token list
(typically a sentence), assign a tag to each token then return the result of taggers
token. This package is using unigram tagger tag each word *w* by checking what
the most frequent tag for *w* in a training corpus.” A method pos_tag() is called to
get tag and token information.
For example, pos_tag(word_tokenize("John's
big
idea
isn't all that bad.")). The result from the example is [('John', 'NNP'),
24
("'s", 'POS'), ('big', 'JJ'), ('idea', 'NN'), ('is', 'VBZ'), ("n't", 'RB'), ('all', 'DT'), ('that',
'DT'), ('bad', 'JJ'), ('.', '.')].
Natural Language Toolkit provides the documentation for each tag. The most
common part of speech tag schemes are those developed for the Penn Treebank and
Brown Corpus (Marsden, 2014).
Table 2.6 Part of Speech Tags
Source: (Marsden, 2014)
POS Tag
Description
Example
CC
coordinating conjunction
and
CD
cardinal number
1, third
DT
determiner
the
EX
existential there
there is
FW
foreign word
d’hoevre
IN
preposition/subordinating
in, of, like
conjunction
JJ
adjective
big
JJR
adjective, comparative
bigger
JJS
adjective, superlative
biggest
LS
list marker
1)
MD
modal
could, will
NN
noun, singular or mass
door
NNS
noun plural
doors
NNP
proper noun, singular
John
NNPS
proper noun, plural
Vikings
PDT
predeterminer
b o t h t h e b o ys
25
POS Tag
Description
Example
POS
possessive ending
friend‘s
PRP
personal pronoun
I, he, it
PRP$
possessive pronoun
m y, h i s
RB
adverb
h o w e v e r , u s u a l l y,
naturally, here,
good
RBR
adverb, comparative
better
RBS
adverb, superlative
best
RP
particle
give up
TO
to
to go, to him
UH
interjection
uhhuhhuhh
VB
verb, base form
take
VBD
verb, past tense
took
VBG
verb, gerund/present participle
taking
VBN
verb, past participle
taken
VBP
verb, sing. present, non -3d
take
VBZ
verb, 3rd person sing. present
takes
WDT
wh-determiner
which
WP
wh-pronoun
who, what
WP$
possessive wh-pronoun
whose
WRB
wh-abverb
where, when
26
2.1.2.5 Pattern
Pattern is package in a Python version 2.4 or above with a focus on ease-ofuse for natural language processing (tagger/chunker, n-gram search, sentiment
analysis, WordNet), machine learning (vector space model, k-means clustering,
Naïve Bayes + k-NN + SVM classifiers), network analysis (graph centrality and
visualization), and web mining (Google + Twitter + Wikipedia, web spider, HTML
DOM parser).
Pattern usually requires several independent toolkits chained together in a
practical application. Pattern is more related to toolkits such as NLTK where it is
geared towards integration in the user’s own programs. In addition, it does not
specialize in one domain but provides general cross-domain functionality.
Smedt and Daelemans (2012:1-2) have revealed that pattern has several
advantages. First, the syntax is straightforward. There are function names and
parameters to make the commands self-explanatory. Second, the documentation
assumes no prior knowledge (except for a background in Python programming) so
that it is valuable as a learning environment for students as well as in research
projects with a short development cycle.
Figure 2.5 Example workflow of pattern
As shown in above Figure, pattern is organized in separate modules that can
be chained together. For example, text from Wikipedia (pattern.web) can be parsed
for part-of-speech tags (pattern.en), queried by syntax and semantics (pattern.search),
and used to train a classifier (pattern.vector).
27
Table 2.7 Language processing task
Source: (Smedt and Daelemans, Pattern For Python, 2012:2064)
Language
processing
task
Web data
Pattern
Functionality
modules
pattern.web
mining
A s yn c h r o n o u s
requests,
a
uniform API for web services
(Google,
Bing,
Twitter,
Facebook,
Wikipedia,
W i k t i o n a r y,
Flickr,
RSS),
a
HTML DOM parser, HTML tag
stripping
functions,
a
web
crawler,
webmail,
caching,
Unicode support
POS tagging
pattern.en
(English)
A fast part-of-speech tagger for
English
(identifies
adjectives,
verbs,
sentence),
tools
nouns,
etc.
sentiment
for
in
a
a n a l ys i s ,
English
verb
and
noun
conjugation
singularization & pluralization,
and a WordNet interface
POS tagging
pattern.nl
(Dutch)
A fast part-of-speech tagger for
Dutch
(identifies
adjectives,
sentence),
and
tools
verbs,
nouns,
etc.
sentiment
for
conjugation
in
a
a n a l ys i s ,
Dutch
and
verb
noun
singularization & pluralization
N-gram
pattern
pattern.search
Search
queries
can
include
mixture of words, phrases,
a
28
Language
processing
task
Pattern
Functionality
modules
matching
part-of-speech-tags,
taxonomy
terms, and control characters to
extract relevant information
Vector space
pattern.vector
compute
modeling
metrics
(using a
(cosine,
TF-IDF,
distance
Euclidean,
Manhattan,
Document
Hamming)
and
dimension
and a Corpus
reduction
(Latent
Semantic
class)
A n a l ys i s )
semantic
networks,
Graph data
pattern.graph
structuring
Modeling
shortest path finding, subgraph
(using Node,
partitioning,
Edge and
eigenvector
Graph
centrality
and
betweenness centrality
classes)
Descriptive
pattern.metrics
Evaluation metrics including a
statistics
code
functions
a c c u r a c y, p r e c i s i o n a n d r e c a l l ,
Database
confusion
matrix,
inter -rater
W
a grraepe pmeerns t f o r( F lCe Si sVs ’ f i l eksa p paan)d,
S
n di l aMr iYt yS Q (LL de va teanbsahst eesi n ,
s tQr iLnIgT Es iam
pattern.db
profiler,
functions
for
Dice) and readability (Flesch)
The API usage example of pattern.en library is:
>>from pattern.en import parse
>>print(parse('I
ate
pizza.',
relations=True,
lemmata=False, chunks=True))
>> I/PRP/B-NP/O/NP-SBJ-1 ate/VBD/B-VP/O/VP-1
pizza/NN/B-NP/O/NP-OBJ-1 ././O/O/O
29
The word “I” is identified with PRP (personal pronoun) tag, NP (Noun Phrase)
chunk, and SBJ (subject) role. The word “ate” is identified with VBD (verb, past
tense) tag and VP (Verb Phrase) chunk. Furthermore, the word “pizza” is identified
with NN (noun, singular or mass) tag, NP (Noun Phrase) chunk, and OBJ (object)
role.
2.1.2.6 Inflect
According to Python Software Foundation (2014), inflect is a package to
correctly generate plurals, singular nouns, ordinals, indefinite articles; convert
numbers to words. API usage of Inflect is shown below:
>> import inflect
>> p = inflect.engine()
>> print(p.singular_noun(“men”))
The print result from the code above will be “man”.
2.1.2.7 Naive Bayes Classifier
Naive Bayes Classifier is a probability classifier based on Bayes’ theorem
with the assumption that attributes or features are conditionally independent of each
other, given the class (Russell & Norvig, 2009:808). The full joint distribution of
Naive Bayes model can be written as (Russell & Norvig, 2009:499):
(4)
The formula above can be simplified as:
(5)
Where:
is the conditional or posterior probability of class given attributes
is the prior probability of class
is the probability of attribute
given class
30
For example, assume that there are two classes: male ( ) and female ( ).
Then, a person named Drew need to be classified (either male or female) based on a
small database. Drew has a characteristic of height below 170 cm and short hair. The
database is shown below.
Table 2.8 Example of Data
Over 170
Hair
cm
Length
Drew
No
Short
Male
Claudia
Yes
Long
Female
Drew
No
Long
Female
Drew
No
Long
Female
Alberto
Yes
Short
Male
Karin
No
Long
Female
Nina
Yes
Short
Female
Sergio
Yes
Long
Male
Name
Sex
Based on the database, the probability of male person given characteristics of Drew
is
The probability of female person given characteristic of Drew is
31
Because the probability of Drew as male is higher than probability of Drew as
female, then Drew is classified as male with the characteristics he possesses.
2.1.2.8 Synset and Lemmatization Process
Synset is a set of words that interchangeable in some context without
changing the truth value of preposition in which they are embedded. For example, in
English, ‘swap’ has same meaning with ‘barter’, ‘swap’, and ‘trade’ (Princeton
University, 2014). The examples of synset usage are:
>>from nltk.corpus import wordnet as wn
>>>wn.synsets('dog')
[Synset('dog.n.01'), Synset('frump.n.01'),
Synset('dog.n.03'), Synset('cad.n.01'),
Synset('frank.n.02'), Synset('pawl.n.01'),
Synset('andiron.n.01'), Synset('chase.v.01')]
A synset is identified with a 3-part name of the form: word.pos.nn. From the
example above, it can be seen that a word ‘dog’ is interchangeable in some context
with the words given above such as ‘dog’ with definition 1, ‘frump’, ‘dog’ with
definition 2, ‘cad’, ‘frank’, ‘pawl’, ‘andiron’, and ‘chase’.
>>from nltk.corpus import wordnet as wn
>>wn.synset('kin.n.01').lowest_common_hypernyms(wn.s
ynset('mother.n.01’))
[Synset('relative.n.01')]
32
From the example above, it can be seen that the word ‘kin’ and the word
‘mother’ have a lowest single hypernym which is the word ‘relative’.
Lemmatization is an algorithm process to determine the lemma of given
words. Lemma is lower case ASCII text of word as found in the WordNet database
index files. The process involves finding lemmas and disambiguate. Finding lemma
is to determine each word in a sentence could be mapped to another words. For
example, in English, ‘saw’ lemmatized to noun lemma ‘saw’ and verb lemma ‘see’.
Disambiguate is to consult the output of part of speech tagger. For example, in
English, the verb ‘saw’ as a past-tense verb, lemmatization concludes to see, seen,
and seeing (Smedt, Asch, and Dealemans, 2010:18).
2.1.2.9 HTML5
MIT, ERCIM, and Keio (2011) explain “HTML5 defines the fifth major
revision of the core language of the World Wide Web: the Hypertext Markup
Language (HTML).” Moreover, MIT, ERCIM, Keio, and Beihang (2015) say
“HTML5 brings to the Web video and audio tracks without needing plugins;
programmatic access to a resolution-dependent bitmap canvas, which is useful for
rendering graphs, game graphics, or other visual images on the fly; native support for
scalable vector graphics (SVG) and math (MathML); annotations important for East
Asian typography (Ruby); features to enable accessibility of rich applications; and
much more.”
2.1.2.10
CSS
MIT, ERCIM, Koie, and Beihang (2015) explain “CSS is the language for
describing the presentation of Web pages, including colors, layout, and fonts. It
allows one to adapt the presentation to different types of devices, such as large
screens, small screens, or printers. CSS is independent of HTML and can be used
with any XML-based markup language. The separation of HTML from CSS makes it
easier to maintain sites, share style sheets across pages, and tailor pages to different
environments. This is referred to as the separation of structure (or: content) from
presentation.”
33
2.1.2.11
jQuery
The jQuery Foundation (2014) explains “jQuery is a fast, small, and featurerich JavaScript library. It makes things like HTML document traversal and
manipulation, event handling, animation, and Ajax much simpler with an easy-to-use
API that works across a multitude of browsers.” For example function $(“p”).hide()
is used to hide all elements p or tag <p>.
2.1.2.12
Ajax
Ajax stands for Asynchronous JavaScript and XML. Ullman and Dykes
(2007: 2) explain “Ajax is a set of programming techniques or a particular approach
to Web programming. These programming techniques involve being able to
seamlessly update a Web page or a section of a Web application with input from the
server, but without the need for an immediate page refresh.” For example: Ajax()
method is used to send or receive data JSON to controller.
2.1.2.13
Javascript
Flanagan (2011:1) defines JavaScript as a high-level, dynamic, untyped
interpreted programming language that is well-suited to object-oriented and
functional programming styles. JavaScript derives its syntax from Java, its first-class
functions from Scheme, and its prototype based inheritance from self. It is most
commonly used as part of web browsers, whose implementations allow client-side
scripts to interact with the user, control the browser, communicate asynchronously,
and alter the document content that is displayed. It is also used in server-side network
programming with runtime environments such as Node.js, game development and
the creation of desktop and mobile applications.
2.1.3 Software Development Process
2.1.3.1 Extreme Programming (XP)
Extreme Programming (XP) is the most widely used approach to agile software
development. Extreme Programming uses an object-oriented approach as its
preferred development paradigm and encompasses a set of rules and practices that
34
occur within the context of four framework activities: planning, design, coding, and
testing (Pressman, 2010:73).
Figure 6 The Extreme Programming Process
Key XP activities are summarized as follows (Pressman, 2010: 73-77):
1. Planning. The planning activity begins with listening to a requirements
gathering activity that enables the technical members of the XP team to
understand the business context for the software and to get a broad feel
for required output and major features and functionality. It consists of the
creation of user stories that describe required output, features, and
functionality for software to be built.
2. Design. XP design rigorously follows the KIS (keep it simple) principle.
A simple design is always preferred over a more complex representation.
XP encourages the use of CRC cards as an effective mechanism for
thinking about the software in an object-oriented context and also
refactoring which is the process of changing a software system in such a
way that it does not alter the external behavior of the code yet improves
the internal structure.
35
3. Coding. After stories are developed and preliminary design work is done,
the team develops a series of unit tests that will exercise each of the
stories that is to be included in the current release (software increment). A
key concept during the coding activity (and one of the most talked about
aspects of XP) is pair programming. XP recommends that two people
work together at one computer workstation to create code for a story. This
provides a mechanism for real time problem solving and real-time quality
assurance (the code is reviewed as it is created).
4. Testing. The unit tests that are created should be implemented using a
framework that enables them to be automated (hence, they can be
executed easily and repeatedly). XP acceptance tests, also called customer
tests, are specified by the customer and focus on overall system features
and functionality that are visible and reviewable by the customer.
2.2
Related Works
2.2.1 Journal Reference
2.2.1.1 Anaphora Resolution by Singh, Lakhmani, Mathur, Morwal
In 2014, a research in anaphora resolution is done by Singh, Lakhmani,
Mathur, and Morwal to compare two computational methods based on recency and
animistic knowledge. The first model uses the concept of Centering Theory approach
for implementing recency factor. The second model uses the concept of Lappin Leass
for recency factor. Both of the computational model use Gazetteer method for
providing knowledge of Animistic factor. These approaches are tested with 3 data
sets on different genre (Singh, Lakhmani, Mathur, and Morwal, 2014:51-57) .The
results of experiment are shown on the table below:
36
Genre 1 (Short Stories)
Table 2.9 Result of Experiment Performed by Model 1 on Short Stories
Source:
Correctly
Data
Total
Total
Total
Set
Sentences
Word
Anaphors
Story
16
165
12
12
100%
28
364
50
33
66%
Resolved
Accuracy
Anaphor
1
Story
2
Overall Accuracy
83%
Table 2.10 Result of Experiment Performed by Model 2 on Short Stories
Source:
Correctly
Data
Total
Total
Total
Set
Sentences
Word
Anaphors
Story
16
165
12
12
100%
28
364
50
36
72%
Resolved
Accuracy
Anaphor
1
Story
2
Overall Accuracy
86%
37
Genre 2 (News Articles)
Table 2.11 Result of Experiment Performed by Model 1 on News Articles
Source:
Correctly
Data
Total
Total
Total
Set
Sentences
Word
Anaphors
News 1
31
372
29
15
52%
News 2
20
387
42
29
69%
Resolved
Accuracy
Anaphor
Overall Accuracy
61%
Table 2.12 Result of Experiment Performed by Model 2 on News Articles
Source:
Correctly
Data
Total
Total
Total
Set
Sentences
Word
Anaphors
News 1
31
372
29
21
73%
News 2
20
387
42
26
62%
Overall Accuracy
Resolved
Accuracy
Anaphor
68%
38
Genre 3 (Biography)
Table 2.13 Result of Experiment Performed by Model 1 on Biography
Source:
Total
Data Set
Sentenc
es
Biography
Total
Word
Total
Correctly
Anaphor
Resolved
s
Anaphor
Accuracy
18
368
9
7
78%
10
280
13
11
85%
1
Biography
2
Overall Accuracy
82%
Table 2.14 Result of Experiment Performed by Model 2 on Biography
Source:
Total
Data Set
Sentenc
es
Biography
Total
Word
Total
Correctly
Anaphor
Resolved
s
Anaphor
Accuracy
18
368
9
7
78%
10
280
13
9
70%
1
Biography
2
Overall Accuracy
74%
From the above results it is concluded that the accuracy of the computational
model depends on the genre of input data. Moreover, the accuracy is affected by
several factors such as recency, animistic knowledge (which are used in the
experiment), number agreement, and gender agreement.
39
2.2.1.2 CogNIAC
CogNIAC (COGnition eNIAC) is used to resolve pronouns with limited
knowledge and linguistic resources. CogNIAC presents a high precision anaphora
resolution that it is capable of greater than 90% precision with 60% recall for some
pronoun (Tyne and Wu , 2004:3). CogNIAC suggested the system resolves anaphora
without the requirement of general world knowledge. It is done by being very
sensitive with ambiguity because it requires unique antecedent and it only resolves
pronouns when the rules are satisfied. To measure the accuracy of the algorithm
three experiments have been done with the results as follows:
Table 2.15 Performance of Individual Rules in Experiment 1 (Narrative Text)
Source: (Baldwin, 1996:7)
Recall
Precision
(correct/actual)
(#correct/#guessed)
1) unique in Discourse
11% (32/298)
100% (32/32)
2) Reflexive
3% (10/298)
100% (10/10)
3) unique Current +
35% (104/298)
96% (104/110)
4) Possessive Pro
1% (2/298)
100% (2/2)
5) Unique Current
6% (18/298)
81% (18/22)
8% (24/298)
80% (24/30)
7) Cb-Picking
4% (13/298)
4% (13/298)
8) Pick Most Recent
10%(29/298)
10%(29/298)
Rule
Prior
Sentence
6) unique Subject/
Subject Pro
40
Table 2.16 Performance of CogNIAC on 42 Wall Street Journal Articles Resolving
3rd Person Pronominal Reference to people
Source: (Baldwin, 1996:8)
Recall
Precision
78%
89%
(126/162)
(126/142)
CogNIAC on WSJ corrected for
89%
91%
gender
(144/162)
(144/158)
CogNIAC on Wall Street Journal
Table 2.17 The Results for CogNIAC for All Pronouns in the First 15 Articles of the
MUC-6 Evaluations
Source: (Baldwin, 1996:9)
Rule
Recall (for just
pros)
Precision
Quoted Speech
11% (13/114)
(87%) 13/15
1) Uniq in Discourse
4% (5/114)
(100%) 5/5
3) Uniq Curr + Prior
50% (57/114)
(72%) 57/79
Search Back
1% (1/114)
(33%) 1/3
2) Reflexive
0% (0/114)
0/0
5) Uniq Curr Sentence
4% (5/114)
(70%) 5/7
Subject Same Clause
4% (4/114)
(57%) 4/7
In conclusion, the rules of CogNIAC performed quite well with the precision
of 90% with good recall for the first two experiments. In the third experiment, out of
the rules of CogNIAC began to decrease.
2.2.1.3 Centering Theory
Hasler (2008:1-8) also conducted a study about Centering Theory but in
different application. This study aims to investigate new method for assessing the
coherence of computer-aided summaries. To do this, a metric for Centering Theory,
41
which is known as a theory of local coherence and salience, was developed. This
metric is applied to obtain a score for each summary. The summary with the higher
score is considered more coherent. The role of Centering Theory in this study is to
check two consecutive sentences to find out its relation or in this case is called
transition. In each sentence or utterance, the Backward Center (CB), the Forward
Center (CF), and also Preferred Center (CP) are determined. Then the CB and CP are
compared in order to get the centering theory transition.
Furthermore, the author formulated a metric to accurately reflect the relation
between Centering Theory transitions and coherence. The metric represents the
positive and negative effects of the presence of certain transitions in summaries.
Table 2.18 Transition Weight for Summary Evaluation
Source: (Hasler, 2008:5)
Transition
Weight
Continue
3
Retain
2
No Transition (Indirect)
1
Smooth Shift
-1
Rough Shift
-2
No Transition (No CB)
-5
Each transition is multiplied by its weight then the total of transition is sum
up. After that, it is divided by the number of transitions present, which is the total
number of utterances – 1.
According to the study, the analysis of discourse, study the relation between
sentences is right in help the learning process of students in writing paragraph. It is
probable that this method would be appropriate in helping students to learn to write
paragraph well with the implementation of coherence and cohesion.
42
2.2.1.4 Entity Transition Value
Tofiloski (2009:6-50) develop Entity Transition Value method to determine
coherence of text through tracking all entities and utterances. Utterance will be
represented using a vector while noun phrase will be represented entity. The value in
the vector represents the salience of each entity in one utterance (sentence). To
define the salience of each entity, maximum salience should be determined first.
Maximum salience is obtained from maximum total entity appear in one utterance.
Then, each appeared entity will have salience value that is given from maximum
salience subtracted by iteration of appeared entity in one utterance. After that,
transition value of sentences will be defined. The value of that transition will be in
range of 0-1. The value which is closer to 0 represents coherent entity transition
while the value which is closer to 1 represents incoherent entity transition.
2.2.2 Comparison
Cohesion and coherence comparison of previous methods will be shown
below:
2.2.2.1 Cohesion
Table 2.19 Comparison of Cohesion Methods
Singh, Lakhmani, Mathur,
Criteria
Morwal Anaphora Resolution
CogNIAC
P r o p o s e d S ys t e m
World
No
No
Resolve anaphora based on
Using 8 rules to resolve
Knowledge
Requirement
Process
recency and animistic factor.
Receny factor uses
Centering or Lappin Leass
anaphora:
1.
Unique in
discourse
43
Singh, Lakhmani, Mathur,
Criteria
Morwal Anaphora Resolution
CogNIAC
P r o p o s e d S ys t e m
concept while animistic
2.
Reflexive
factor uses Gazetteer
3.
Unique in current
method.
+ prior
4.
Possesive
pronoun
5.
Unique current
sentence
6.
Unique
subject/subject pronoun
7.
Cb-picking
8.
Pick most recent
2.2.2.2 Coherence
Table 2.20 Comparison of Coherence Methods
Criteria
Centering Theory
Entity Transition Value
Entity
Needed
Needed
Read in
Current Sentence and
All of the sentences
portion
Previous Sentence
Salience
Is obtained from
Is obtained from Max
grammatical role:
Salience decrease by
Subject > Indirect
count of appeared entity
Object > Direct Object >
in a sentence
Others
Order
Based on grammatical
Linear ordering
role
Process
1. CF is obtained from
1. Maximum salience is
44
Criteria
Centering Theory
Entity Transition Value
current sentence’s
obtained from the total
entities.
of entities in one
2. CB is obtained from
entities in previous
sentence that also
sentence which has the
most entities.
2. Transition value of a
realized in current
single entity (ei) between
sentence.
adjacent sentences or
3. CP is defined from the
most salient entity in
current sentences.
utterances (Ui).
3. Transition weight
(Wtrans) is obtained from
realized entities in
adjacent sentences or
utterances over total of
non-redundant entities.
4. Transition value between
two adjacent utterances.
Result
Level of Coherence:
Continue > Retain >
Smooth Shift > Rough
Shift
Value range is from 0 – 1