* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download chapter 2 literature review
Agglutination wikipedia , lookup
Chinese grammar wikipedia , lookup
French grammar wikipedia , lookup
Antisymmetry wikipedia , lookup
Focus (linguistics) wikipedia , lookup
Esperanto grammar wikipedia , lookup
Sentence spacing wikipedia , lookup
Semantic holism wikipedia , lookup
Word-sense disambiguation wikipedia , lookup
Japanese grammar wikipedia , lookup
Latin syntax wikipedia , lookup
Polish grammar wikipedia , lookup
Romanian grammar wikipedia , lookup
Cognitive semantics wikipedia , lookup
Morphology (linguistics) wikipedia , lookup
Bound variable pronoun wikipedia , lookup
Pipil grammar wikipedia , lookup
Sloppy identity wikipedia , lookup
Untranslatability wikipedia , lookup
Malay grammar wikipedia , lookup
Transformational grammar wikipedia , lookup
CHAPTER 2 LITERATURE REVIEW 2.1 Fundamental Theory 2.1.1 Artificial Intelligence Poole and Mackworth say (2010:3) “Artificial intelligence, or AI, is the field that studies the synthesis and analysis of computational agents that act intelligently.” One of the branches of Artificial Intelligent is Natural Language Processing. Rajesh and Reddy (2009:421) say “Natural Language Processing is a subfield of computational logistics, aims at designing and building software that will analyze, understand and generate natural human language, so that in the future we will be able to interface with computers, in both written and spoken contexts using natural human languages, instead of computer languages.” 2.1.1.1 Natural Language Processing Akerkar and Sajja (2010:323) say “Natural Language Processing (NLP) is a theoretically motivated variety of computational techniques for analyzing and representing naturally occurring texts at one or more levels of linguistic analysis for the purpose of achieving human-like language processing for a range of tasks or applications.” There are seven levels for linguistic analysis such as: 2.1.1.1.1 Phonological Akerkar and Joshi (2008:71) say “Phonetics is the interpretation of speech sounds within and across words. It is the study of language in terms of the relationships between phonemes whereas phonemes are the smallest distinct soundunits in a given language. Phonetic knowledge is used, for example, for building speech-recognizing systems.” 7 8 2.1.1.1.2 Morphology Akerkar and Sajja (2010:71) say “It is the study of the meaningful parts of words. It deals with componential nature of words, which are composed of morphemes. Morphemes are the smallest elements of meaning in a language. Morphological knowledge is used, for example, for automatic stemming, truncation or masking of words.” 2.1.1.1.3 Lexicology Akerkar and Sajja (2010:71) say “Lexicology is the study of words. This level refers to parts-of-speech tagging or the use of lexicons. Lexicons are utilized in IR system to ensure that a common vocabulary is used in selecting appropriate indexing or searching terms or phrases.” 2.1.1.1.4 Syntactic Akerkar and Sajja (2010:71) say “The syntactic level of linguistic analysis is concerned with how words arrange themselves in construction. Syntax is the study of the rules, or patterned relations, that govern the way the words in a sentence are arranged. Syntactic rules are used in parsing algorithms. Meaning can be derived from word’s position and role in sentence. The structure of a sentence conveys meaning and relationship between words, even if we do not know what their dictionary meanings are. All this is conveyed by the syntax of the sentence.” 2.1.1.1.5 Semantics Akerkar and Sajja (2010:71) say “Semantics involves the study of the meaning of word. This is more complex level of linguistic analysis. The study of the meaning of isolated words may be termed lexical semantics. The study of meaning is also related to syntax at the level of the sentence and to discourse at the level of text. By using both syntactic and semantic levels of analysis, Natural Language Processing systems can identify automatically phrases of two or more words that when looked at separately have quite different meanings.” 9 2.1.1.1.6 Discourse Analysis Akerkar and Sajja (2010:71) say “Although syntax and semantics work with sentence-length units, the discourse level of NLP works with units of text longer than a sentence” This level relies on the concept of predictability. It uses document structure to further analyze the text as a whole. By understanding the structure of a document, Natural Language Processing systems can make certain assumptions. Examples from information science are the resolving of anaphora and ellipsis and the examination of the effect on proximity searching.” 2.1.1.1.7 Pragmatics Akerkar and Sajja (2010:72) say “Pragmatics is often understood as the study of how the context (or world knowledge) influences meaning. This level is in some ways far more complex and work intensive than all the other levels. This level depends on a body of knowledge about the world that comes from outside the document. Though it is easy for people to choose the right sense of the word, it is extremely difficult to program a computer with all the world knowledge necessary to do the same.” 2.1.1.2 Discourse Mitkov (2010:599) says “According to the Longman dictionary, discourse is (1) a serious speech or piece of writing on a particular subject, (2) serious conversation or discussion between people, or (3) the language used in particular types of speech or writing.” The term ‘serious’ here means that the text produced is not a random collection of symbols or words but related and meaningful sentences which have to do with the fact that discourse is expected to be both cohesive and coherent. Tofiloski (2009:11) also defines discourse in two ways: as a unit of language which is above the sentence or simply it is a group of sentences, and as having a particular focus. Discourse can be represented as paragraphs, section, chapters, parts, or stories (Tofiloski 2009:15). 2.1.1.3 Paragraph Oshima and Hogue (2007:3) define paragraph as a group of related statements that a writer develops about a subject where the first sentence states the 10 specific point or idea of the topic and the rest of the sentences in the paragraph support that point. In accordance with Oshima and Hogue (2007), Rustipa (2010:3) emphasizes that the number of sentences in a paragraph is unimportant however it should be long enough to develop the main idea through its organizational pattern. Paragraph has three major structural parts, such as topic sentence which uses Theme as the topic and Rheme as the controlling idea, supporting sentence which explains the topic sentence and concluding sentence which signals the end of the paragraph (Rustipa, 2010:4). 2.1.1.4 Cohesion Cohesive is linguistic ‘device’ that helps to establish links among the sentences of a text while cohesive links are the types of link that these devices construct (Mitkov, 2010:600). Mitkov (2010:600) says “Cohesion in texts is more about linking sentences or, more generally, textual units through cohesive devices such as anaphors and lexical repetitions.” 2.1.1.4.1 Anaphora Resolution Anaphora resolution is the process of determining the antecedent of an anaphor. Mitkov (2010: 611-614) defines “anaphora as the linguistic phenomenon of pointing back to a previously mentioned item in text.” Anaphor is the entity that pointing back to antecedent where antecedent is the entity to which anaphora refers or stands for. To implement the anaphora resolution, CogNIAC rules are used. 2.1.1.4.1.1 CogNIAC Rules Clark, Fox, and Lappin (2010:619) say that “CogNIAC employs a set of ‘high-confidence’ rules which are successively applied to the pronoun under consideration. The processing of a pronoun terminates after the application of the first relevant rule.” In order to increase the accuracy of CogNIAC rules, agreement constraints are applied. The constraints are: 11 i. Gender Agreement Gender agreement compares anaphor’s gender to candidate’s gender. If both genders are not the same then the candidate will not be considered. ii. Number Agreement Number agreement will classify the entity into singular and plural. If the result of the number agreement is not the same then the candidate will not be considered anymore. After the agreement constraints have been applied, CogNIAC rules are applied to help resolving anaphora. CogNIAC consists of a set of rules which are applied to the pronouns of a text. Baldwin (1996:3) explains the core rules of CogNIAC which are given below: 1. Unique in Discourse: If there is a single possible antecedent i in the read-in portion of the entire discourse, then pick i as the antecedent. 2. Reflexive: Pick nearest possible antecedent in read-in portion of current sentence if the anaphor is a reflexive pronoun. 3. Unique in Current + Prior: If there is a single possible antecedent i in the prior sentence and the read-in portion of the current sentence, then pick i as the antecedent. 4. Possessive Pro: If the anaphor is a possessive pronoun and there is a single exact string match i of the possessive in the prior sentence, then pick i as the antecedent. 5. Unique Current Sentence: If there is a single possible antecedent in the read-in portion of the current sentence, then pick i as the antecedent. 6. Unique Subject/ Subject Pronoun: If the subject of the prior sentence contains a single possible antecedent i, and the anaphor is the subject of its sentence, then pick i as the antecedent. 7. Cb-Picking: If there is a Cb i in the current finite clause that is also a candidate antecedent, then pick i as the antecedent. 8. Pick Most Recent: Pick the most recent potential antecedent in the text. 12 2.1.1.5 Coherence Coherence in writing means the sentences must hold together to form smooth movement between sentences. One way to achieve coherence in a paragraph is to use the same nouns and pronouns consistently (Oshima and Hogue, 2007:79). Thornbury (2005:38) explains that “In English, sentences (and the clauses of which they are composed) have a simple two-way division between what the sentence is about (its topic) and what the writer or speaker wants to tell you about that topic (the comment). The topic and comment are also called Theme and Rheme.” The previous researches are used Centering Theory and Entity Transition Value. 2.1.1.5.1 Thematic Development Thornbury (2005:38) explains Theme is subject of the sentence and is typically realized by noun phrase. Rheme is the new information of the sentence and it is used to explain the topic or Theme. This definition is supported by Rustipa (2010:4) who says that the topic or Theme is the subject of each sentence in the paragraph while Rheme is the controlling idea to limit the topic in every sentence. The way Theme of clause is developed is known as Thematic Development. Theme of clause is taken from Theme or Rheme of previous sentences (Rustipa, 2010:7). There are three types of Thematic Development pattern as follows: Theme Reiteration or Constant Theme Pattern This pattern shows that the first Theme is picked up and repeated in the beginning of the next clause. The figure is as follow: 13 Figure 2.1 Theme Reiteration Pattern Source: (Rustipa, 2010:7) Zig Zag Linear Theme Pattern It is a pattern when the subject matter in the Rheme of one clause is taken up as the Theme of the following clause. The figure is as follow: Figure 2.2 Zig Zag Linear Theme Pattern Source: (Rustipa, 2010:8) Multiple Theme / Split Rheme Pattern In this pattern, a Rheme may include a number of different pieces of information, each of which may be taken up as the Theme in a number of subsequent clauses. The figure of this pattern is as follows: 14 Figure 2.3 Multiple Theme / Split Rheme Pattern Source: (Rustipa, 2010:8) 2.1.1.5.2 Centering Theory Centering Theory is a theory about local discourse coherence which occurs between adjacent utterances or sentences. The idea of Centering Theory is that each sentence (utterance) has focused entity as center of the sentence (utterance) (Clark, Fox, and Lappin, 2010:607). Centering Theory tracks the movement of single entity or noun as it changes in salience from sentence to sentence. Salience is determined by the highest rank that an entity has in one sentence. Subject has higher rank than object which has higher rank than other entity in grammatical role. The order of preferred grammatical role is shown as follow (Tofiloski, 2009:33): Subject > Indirect Object > Direct Object > Others SubjectFigure Indirect Object Direct Others 2.4 Grammatical Role Object Ordering Source: (Tofiloski, 2009:33) In an utterance U, the list of entities in each U are called Forward Looking Center, abbreviated as CF. The most salient entity in U is called Preferred Center or simply CP. The Backward Looking Center, abbreviated as CB is the most salient entity in the previous utterance (Un-1) that is recognized in the current utterance U. CB can be NULL when only one utterance is introduced and when the highest rank of entity in previous utterance (Un-1) does not recognized in current utterance (Un). 15 The main objective of Centering Theory is to cluster the level of coherence between utterances. Utterance with the topic of preceding utterance is more coherent than utterance which feature topic shift. The level of coherence is obtained from comparison between preceding topic and current topic in utterances. A table below is used to classified local coherence of utterance (Miltsakaki and Kukich, 2004:7). Table 2.1 Standard Centering Theory Transitions Source: (Miltsakaki and Kukich, 2004:7) CB(Un) = CB(Un-1) CB(Un) ≠ CB(Un-1) CB(Un) = CP(Un) Continue Smooth-shift CB(Un) ≠ CP(Un) Retain Rough-shift Centering Theory classifies the degree of coherence based on sequences of utterance transitions. The degree of coherence classifies the utterance transitions into four basic types: 1. Continue: This occurs when an entity has been the Preferred Center for consecutive three utterances including the current utterance. 2. Retain: This occurs when there is a new Preferred Center yet the previous Preferred Center occurs in the current utterance. 3. Smooth-shift: This occurs when the Preferred Center of the current utterance was the highest ranked entity in the previous utterance that is realized in the current utterance but the Backward Looking Centers of the previous two utterances are not the same entity. 4. Rough-shift: This occurs when the Preferred Center was not the highest ranked entity in the previous utterance that also appears in the current utterance, as well as not having the same Backward Looking Centers. 2.1.1.5.3 Entity Transition Value Entity Transition Value was proposed by Milan Tofiloski in 2009 as a project to extend Centering Theory. The focus of the study is to be able to track all entities realized in an utterance and show the benefits of such model while the traditional Centering Theory is only tracking the most salient entity. The main idea of his 16 research is to establish a hard distinction between coherence and co-reference or anaphora resolution and to identify how coherent is the text after all the referents had been resolved (Tofiloski, 2009:34). In measuring entity coherence, sentences are represented as vectors where the elements of the vector are corresponding to the words of the sentence. Consider the example text “Flint likes Rapunzel. Rapunzel is a singer.” is illustrated into matrix (a series of vectors) in the form of entity grids: Table 2.2 A Representation of the Text Incorporating Linear Order Words in Sentence Flint likes Rapunzel is a singer S1 X X X - - - S2 0 0 X X X X In table 2.2, sentence 1 (S1) has a vector of (X, X, X, -, -, -) where ‘X’ means the entity is realized in the sentence, while ‘–’ means the entity is not realized in the sentence. The next step to measure entity coherence is to remove non-entity in vector. The table below shows the representation of the text with omitted non-entity word. Thus, only words which entities are considered for representation within the entity coherence model. Table 2.3 A Representation of the Text with Non-Entity Words Removed from the Representation Vector Words in Sentence Sentences Flint Rapunzel singer S1 X X 0 S2 0 X X The next step is to define maximum salience value. Salience describes a way to arrange entities realized in an utterance. The arrangement is ranked according to topic or prominence at any point in the text. Entity salience ranks will be considered 17 according to linear order (Tofiloski, 2009:44). Tofiloski (2009:47) explains “Maximum salience is determined by the utterance that realizes the most number of entity and using that value as ceiling for salience.” From example above, the first utterance has 2 entities and second utterance has 2 entities, so that the maximum salience is 2. The next occurred entity in an utterance will have salience that is obtained from maximum salience decreased by previous occurred entity. Table 2.4 Example of Vector and Each Entity’s Salience Words in Sentence Sentences Flint Rapunzel Singer S1 2 1 0 S2 0 2 1 The table above shows the representation of the text into a series of vectors that includes the salience of the word in the sentence. Larger value corresponds to a higher salience. Each row corresponds to a sentence while the column corresponds to the salience of that word in the sentence. The vector representation is a straightforward way to track the salience and realization of all the entities in a text, and how their values change over the text. The main benefit of a vector representation is that the mathematical structure is wellstudied and common. Vectors and their operations are used in a variety of disciplines that create simulations and models of real-world processes (Tofiloski, 2009:41). The next step in measuring entity coherence is to track salience change between sentences by comparing entity salience values. Consider the following five sentences as a sample text: Merida was a student in computer science. China is a large country. Merida went to a private school in Jakarta. Beijing is the capital of China. She likes computer science. 18 Below is the vector representation for each sentence: The next step is to measure the entity transition by using below formulas. The formula is to compute the transition value of an individual entity adjacent utterances and between two , as follows: (1 ) The next step is to define transition weight. The purpose behind the entity transitions being weighted is to prefer utterances which have their most salient entities in common over utterances that have their lowest salient entities in common. The idea is to add a ‘mass’ to each entity according to its salience. The formula below is used to define transition weight between two utterances. Calculate the transition weight for favoring utterance transitions having more entities in common. The numerator is the sum of the number of entities in that are realized in with the number of entities in that are realized in . This is designed to capture two utterances that realize the same number of entities should be considered more similar (Tofiloski, 2009:48). (2) 19 The final transition metric is found in Transition between Utterances which computes the average transition between all realized entities multiplied by , which is the transition weight that corrects for the sparsity of the vector in the number of realized entities in an utterance (Tofiloski, 2009:49). (3 ) 2.1.2 Supported Library 2.1.2.1 Python Programming Language Python is a popular open source programming language used for both standalone programs and scripting applications in a wide variety of domain. It is free, portable, powerful, and remarkably easy and fun to use. Python has many benefits for its users. (Lutz, 2009:3-4) has summarized the primary factors cited by Python users which seem to be like these: 1. Software quality For many, Python’s focus on readability, coherence, and software quality in general sets it apart from other tools in the scripting world. Python code is designed to be readable, and hence reusable and maintainable than traditional scripting languages. The uniformity of Python code makes it easy to understand, even if you did not write it. In addition, Python has deep support for more advance software reuse mechanisms, such as object-oriented programming (OOP). 2. Developer productivity Python boosts developer productivity many times beyond compiled or statically typed languages such as C, C++, and Java. Python code is typically one-third to one-fifth the size of equivalent C++ or Java code. That means there is less to type, less to debug and less to maintain after the fact. Python programs also run immediately, without the lengthy compile and link steps required by some other tools, further boosting programmer speed. 20 3. Program portability Most python programs run unchanged on all major computer platforms. Porting python code between Linux and Windows, for example, is usually just a matter of copying a script’s code between machines. Moreover, python offers multiple options for coding portable graphical user interfaces, database access programs, web-based systems, and more. Even operating system interfaces, including program launches and directory processing, are as portable in Python as they can possibly be. 4. Support libraries Python comes with a large collection of prebuilt and portable functionality, known as the standard library. This library supports an array of applicationlevel programming tasks from text pattern matching to network scripting. In addition, python can be extended with both homegrown libraries and a vast collection of third-party application support software. Python’s third party domain offers tools for website construction, numeric programming, serial port access, game development, and much more. The NumPhy extension, for instance, has been described as a free and more powerful equivalent to the Matlab numeric programming system. 5. Component integration Python scripts can easily communicate with other parts of an application, using a variety of integration mechanisms. Such integrations allow python to be used as a product customization and extension tool. Today, python code can invoke C and C++ libraries, can be called from C and C++ programs, can integrate with Java and .NET components, can communicate over frameworks such as COM, can interface with devices over serial ports, and can interact over networks with interfaces like SOAP, XML-RPC, and CORBA. It is not a standalone tool. 6. Enjoyment Because of python’s ease of use and built-in toolset, it can make the act of programming more pleasure than chore. Although this may be an intangible benefit, its effect on productivity is an important asset. 21 2.1.2.2 Django Framework Django is a web framework based on the Model-View-Controller (MVC) design pattern. The models are Python classes developer uses to interact with the database layer, controllers are the layer that handles application logic and sending responses to requests, and views are what the user sees and interacts with. Django is written in the Python programming language. It is extremely easy to learn and use, and is very lightweight and straightforward (Holovaty, 2014). 2.1.2.3 Natural Language Toolkit NLTK is a suite of Python modules distributed under the GPL open source license via nltk.org. NLTK comes with a large collection of corpora, extensive documentation, and hundreds of exercises, making NLTK unique in providing a comprehensive framework for students to develop a computational understanding of language. NLTK’s code base of 100,000 lines of Python code includes support for corpus access, tokenizing, stemming, tagging, chunking, parsing, clustering, classification, language modeling, semantic interpretation, unification, and much else besides (Bird, Klein, and Loper, 2009:62). Table 2.5 Language processing tasks and corresponding NLTK modules with examples of functionality Source: (Loper, Klein, and Bird, 2014) Language processing task Accessing corpora NLTK modules Functionality nltk.corpus Standardized interfaces to corpora and lexicons String processing Collocation discovery nltk.tokenize, Tokenizers, sentence nltk.stem tokenizers, stemmers nltk.collocations t-test, chi-squared, point-wise mutual information 22 Language processing task Part-of-speech tagging NLTK modules Functionality nltk.tag n-gram, backoff, Brill, HMM, TnT Classification n l t k . c l a s s i f y, Decision tree, maximum nltk.cluster e n t r o p y, n a i v e B a y e s , EM, k-means Chunking nltk.chunk Regular expression, ngram, named entity Parsing nltk.parse Chart, feature-based, unification, probabilistic, dependency Semantic interpretation nltk.sem, Lambda calculus, first - nltk.inference order logic, model checking Evaluation metrics nltk.metrics Precision, recall, agreement coefficients Probability and nltk.probability estimation Frequency distributions, smoothed probability distributions Applications nltk.app, Graphical concordancer, nltk.chat parsers, WordNet browser, chatbots Linguistic fieldwork nltk.toolbox Manipulate data in SIL Toolbox format NLTK was designed with four primary goals, which are as follow (Loper, Klein, and Bird, 2014): 1. Simplicity 23 To provide an intuitive framework along with substantial building blocks, giving users a practical knowledge of Natural Language Processing without getting stuck in the tedious house-keeping usually associated with processing annotated language data. 2. Consistency To provide a uniform framework with consistent interfaces and data structures, and easily guessable method names. 3. Extensibility To provide a structure into which new software modules can be easily accommodated, including alternative implementations and competing approaches to the same task. 4. Modularity To provide components that can be used independently without needing to understand the rest of the toolkit. 2.1.2.4 Part-of-speech Tags In corpus linguistics, part-of-speech tagging (POS tagging), also called grammatical tagging or word-category disambiguation, is the process of marking up a word in a text as corresponding to a particular part of speech, based on both its definition, as well as its context. NLTK Project (2014) explains, “This package contains classes and interfaces for part of speech tagging. The result after using this package is tag and token. Tag is a case sensitive string that specifies some property of a token, such as its part of speech. For example, the word ‘fly’. This package will give information (tag, token) such as (fly, NN). This package defines several taggers, which take a token list (typically a sentence), assign a tag to each token then return the result of taggers token. This package is using unigram tagger tag each word *w* by checking what the most frequent tag for *w* in a training corpus.” A method pos_tag() is called to get tag and token information. For example, pos_tag(word_tokenize("John's big idea isn't all that bad.")). The result from the example is [('John', 'NNP'), 24 ("'s", 'POS'), ('big', 'JJ'), ('idea', 'NN'), ('is', 'VBZ'), ("n't", 'RB'), ('all', 'DT'), ('that', 'DT'), ('bad', 'JJ'), ('.', '.')]. Natural Language Toolkit provides the documentation for each tag. The most common part of speech tag schemes are those developed for the Penn Treebank and Brown Corpus (Marsden, 2014). Table 2.6 Part of Speech Tags Source: (Marsden, 2014) POS Tag Description Example CC coordinating conjunction and CD cardinal number 1, third DT determiner the EX existential there there is FW foreign word d’hoevre IN preposition/subordinating in, of, like conjunction JJ adjective big JJR adjective, comparative bigger JJS adjective, superlative biggest LS list marker 1) MD modal could, will NN noun, singular or mass door NNS noun plural doors NNP proper noun, singular John NNPS proper noun, plural Vikings PDT predeterminer b o t h t h e b o ys 25 POS Tag Description Example POS possessive ending friend‘s PRP personal pronoun I, he, it PRP$ possessive pronoun m y, h i s RB adverb h o w e v e r , u s u a l l y, naturally, here, good RBR adverb, comparative better RBS adverb, superlative best RP particle give up TO to to go, to him UH interjection uhhuhhuhh VB verb, base form take VBD verb, past tense took VBG verb, gerund/present participle taking VBN verb, past participle taken VBP verb, sing. present, non -3d take VBZ verb, 3rd person sing. present takes WDT wh-determiner which WP wh-pronoun who, what WP$ possessive wh-pronoun whose WRB wh-abverb where, when 26 2.1.2.5 Pattern Pattern is package in a Python version 2.4 or above with a focus on ease-ofuse for natural language processing (tagger/chunker, n-gram search, sentiment analysis, WordNet), machine learning (vector space model, k-means clustering, Naïve Bayes + k-NN + SVM classifiers), network analysis (graph centrality and visualization), and web mining (Google + Twitter + Wikipedia, web spider, HTML DOM parser). Pattern usually requires several independent toolkits chained together in a practical application. Pattern is more related to toolkits such as NLTK where it is geared towards integration in the user’s own programs. In addition, it does not specialize in one domain but provides general cross-domain functionality. Smedt and Daelemans (2012:1-2) have revealed that pattern has several advantages. First, the syntax is straightforward. There are function names and parameters to make the commands self-explanatory. Second, the documentation assumes no prior knowledge (except for a background in Python programming) so that it is valuable as a learning environment for students as well as in research projects with a short development cycle. Figure 2.5 Example workflow of pattern As shown in above Figure, pattern is organized in separate modules that can be chained together. For example, text from Wikipedia (pattern.web) can be parsed for part-of-speech tags (pattern.en), queried by syntax and semantics (pattern.search), and used to train a classifier (pattern.vector). 27 Table 2.7 Language processing task Source: (Smedt and Daelemans, Pattern For Python, 2012:2064) Language processing task Web data Pattern Functionality modules pattern.web mining A s yn c h r o n o u s requests, a uniform API for web services (Google, Bing, Twitter, Facebook, Wikipedia, W i k t i o n a r y, Flickr, RSS), a HTML DOM parser, HTML tag stripping functions, a web crawler, webmail, caching, Unicode support POS tagging pattern.en (English) A fast part-of-speech tagger for English (identifies adjectives, verbs, sentence), tools nouns, etc. sentiment for in a a n a l ys i s , English verb and noun conjugation singularization & pluralization, and a WordNet interface POS tagging pattern.nl (Dutch) A fast part-of-speech tagger for Dutch (identifies adjectives, sentence), and tools verbs, nouns, etc. sentiment for conjugation in a a n a l ys i s , Dutch and verb noun singularization & pluralization N-gram pattern pattern.search Search queries can include mixture of words, phrases, a 28 Language processing task Pattern Functionality modules matching part-of-speech-tags, taxonomy terms, and control characters to extract relevant information Vector space pattern.vector compute modeling metrics (using a (cosine, TF-IDF, distance Euclidean, Manhattan, Document Hamming) and dimension and a Corpus reduction (Latent Semantic class) A n a l ys i s ) semantic networks, Graph data pattern.graph structuring Modeling shortest path finding, subgraph (using Node, partitioning, Edge and eigenvector Graph centrality and betweenness centrality classes) Descriptive pattern.metrics Evaluation metrics including a statistics code functions a c c u r a c y, p r e c i s i o n a n d r e c a l l , Database confusion matrix, inter -rater W a grraepe pmeerns t f o r( F lCe Si sVs ’ f i l eksa p paan)d, S n di l aMr iYt yS Q (LL de va teanbsahst eesi n , s tQr iLnIgT Es iam pattern.db profiler, functions for Dice) and readability (Flesch) The API usage example of pattern.en library is: >>from pattern.en import parse >>print(parse('I ate pizza.', relations=True, lemmata=False, chunks=True)) >> I/PRP/B-NP/O/NP-SBJ-1 ate/VBD/B-VP/O/VP-1 pizza/NN/B-NP/O/NP-OBJ-1 ././O/O/O 29 The word “I” is identified with PRP (personal pronoun) tag, NP (Noun Phrase) chunk, and SBJ (subject) role. The word “ate” is identified with VBD (verb, past tense) tag and VP (Verb Phrase) chunk. Furthermore, the word “pizza” is identified with NN (noun, singular or mass) tag, NP (Noun Phrase) chunk, and OBJ (object) role. 2.1.2.6 Inflect According to Python Software Foundation (2014), inflect is a package to correctly generate plurals, singular nouns, ordinals, indefinite articles; convert numbers to words. API usage of Inflect is shown below: >> import inflect >> p = inflect.engine() >> print(p.singular_noun(“men”)) The print result from the code above will be “man”. 2.1.2.7 Naive Bayes Classifier Naive Bayes Classifier is a probability classifier based on Bayes’ theorem with the assumption that attributes or features are conditionally independent of each other, given the class (Russell & Norvig, 2009:808). The full joint distribution of Naive Bayes model can be written as (Russell & Norvig, 2009:499): (4) The formula above can be simplified as: (5) Where: is the conditional or posterior probability of class given attributes is the prior probability of class is the probability of attribute given class 30 For example, assume that there are two classes: male ( ) and female ( ). Then, a person named Drew need to be classified (either male or female) based on a small database. Drew has a characteristic of height below 170 cm and short hair. The database is shown below. Table 2.8 Example of Data Over 170 Hair cm Length Drew No Short Male Claudia Yes Long Female Drew No Long Female Drew No Long Female Alberto Yes Short Male Karin No Long Female Nina Yes Short Female Sergio Yes Long Male Name Sex Based on the database, the probability of male person given characteristics of Drew is The probability of female person given characteristic of Drew is 31 Because the probability of Drew as male is higher than probability of Drew as female, then Drew is classified as male with the characteristics he possesses. 2.1.2.8 Synset and Lemmatization Process Synset is a set of words that interchangeable in some context without changing the truth value of preposition in which they are embedded. For example, in English, ‘swap’ has same meaning with ‘barter’, ‘swap’, and ‘trade’ (Princeton University, 2014). The examples of synset usage are: >>from nltk.corpus import wordnet as wn >>>wn.synsets('dog') [Synset('dog.n.01'), Synset('frump.n.01'), Synset('dog.n.03'), Synset('cad.n.01'), Synset('frank.n.02'), Synset('pawl.n.01'), Synset('andiron.n.01'), Synset('chase.v.01')] A synset is identified with a 3-part name of the form: word.pos.nn. From the example above, it can be seen that a word ‘dog’ is interchangeable in some context with the words given above such as ‘dog’ with definition 1, ‘frump’, ‘dog’ with definition 2, ‘cad’, ‘frank’, ‘pawl’, ‘andiron’, and ‘chase’. >>from nltk.corpus import wordnet as wn >>wn.synset('kin.n.01').lowest_common_hypernyms(wn.s ynset('mother.n.01’)) [Synset('relative.n.01')] 32 From the example above, it can be seen that the word ‘kin’ and the word ‘mother’ have a lowest single hypernym which is the word ‘relative’. Lemmatization is an algorithm process to determine the lemma of given words. Lemma is lower case ASCII text of word as found in the WordNet database index files. The process involves finding lemmas and disambiguate. Finding lemma is to determine each word in a sentence could be mapped to another words. For example, in English, ‘saw’ lemmatized to noun lemma ‘saw’ and verb lemma ‘see’. Disambiguate is to consult the output of part of speech tagger. For example, in English, the verb ‘saw’ as a past-tense verb, lemmatization concludes to see, seen, and seeing (Smedt, Asch, and Dealemans, 2010:18). 2.1.2.9 HTML5 MIT, ERCIM, and Keio (2011) explain “HTML5 defines the fifth major revision of the core language of the World Wide Web: the Hypertext Markup Language (HTML).” Moreover, MIT, ERCIM, Keio, and Beihang (2015) say “HTML5 brings to the Web video and audio tracks without needing plugins; programmatic access to a resolution-dependent bitmap canvas, which is useful for rendering graphs, game graphics, or other visual images on the fly; native support for scalable vector graphics (SVG) and math (MathML); annotations important for East Asian typography (Ruby); features to enable accessibility of rich applications; and much more.” 2.1.2.10 CSS MIT, ERCIM, Koie, and Beihang (2015) explain “CSS is the language for describing the presentation of Web pages, including colors, layout, and fonts. It allows one to adapt the presentation to different types of devices, such as large screens, small screens, or printers. CSS is independent of HTML and can be used with any XML-based markup language. The separation of HTML from CSS makes it easier to maintain sites, share style sheets across pages, and tailor pages to different environments. This is referred to as the separation of structure (or: content) from presentation.” 33 2.1.2.11 jQuery The jQuery Foundation (2014) explains “jQuery is a fast, small, and featurerich JavaScript library. It makes things like HTML document traversal and manipulation, event handling, animation, and Ajax much simpler with an easy-to-use API that works across a multitude of browsers.” For example function $(“p”).hide() is used to hide all elements p or tag <p>. 2.1.2.12 Ajax Ajax stands for Asynchronous JavaScript and XML. Ullman and Dykes (2007: 2) explain “Ajax is a set of programming techniques or a particular approach to Web programming. These programming techniques involve being able to seamlessly update a Web page or a section of a Web application with input from the server, but without the need for an immediate page refresh.” For example: Ajax() method is used to send or receive data JSON to controller. 2.1.2.13 Javascript Flanagan (2011:1) defines JavaScript as a high-level, dynamic, untyped interpreted programming language that is well-suited to object-oriented and functional programming styles. JavaScript derives its syntax from Java, its first-class functions from Scheme, and its prototype based inheritance from self. It is most commonly used as part of web browsers, whose implementations allow client-side scripts to interact with the user, control the browser, communicate asynchronously, and alter the document content that is displayed. It is also used in server-side network programming with runtime environments such as Node.js, game development and the creation of desktop and mobile applications. 2.1.3 Software Development Process 2.1.3.1 Extreme Programming (XP) Extreme Programming (XP) is the most widely used approach to agile software development. Extreme Programming uses an object-oriented approach as its preferred development paradigm and encompasses a set of rules and practices that 34 occur within the context of four framework activities: planning, design, coding, and testing (Pressman, 2010:73). Figure 6 The Extreme Programming Process Key XP activities are summarized as follows (Pressman, 2010: 73-77): 1. Planning. The planning activity begins with listening to a requirements gathering activity that enables the technical members of the XP team to understand the business context for the software and to get a broad feel for required output and major features and functionality. It consists of the creation of user stories that describe required output, features, and functionality for software to be built. 2. Design. XP design rigorously follows the KIS (keep it simple) principle. A simple design is always preferred over a more complex representation. XP encourages the use of CRC cards as an effective mechanism for thinking about the software in an object-oriented context and also refactoring which is the process of changing a software system in such a way that it does not alter the external behavior of the code yet improves the internal structure. 35 3. Coding. After stories are developed and preliminary design work is done, the team develops a series of unit tests that will exercise each of the stories that is to be included in the current release (software increment). A key concept during the coding activity (and one of the most talked about aspects of XP) is pair programming. XP recommends that two people work together at one computer workstation to create code for a story. This provides a mechanism for real time problem solving and real-time quality assurance (the code is reviewed as it is created). 4. Testing. The unit tests that are created should be implemented using a framework that enables them to be automated (hence, they can be executed easily and repeatedly). XP acceptance tests, also called customer tests, are specified by the customer and focus on overall system features and functionality that are visible and reviewable by the customer. 2.2 Related Works 2.2.1 Journal Reference 2.2.1.1 Anaphora Resolution by Singh, Lakhmani, Mathur, Morwal In 2014, a research in anaphora resolution is done by Singh, Lakhmani, Mathur, and Morwal to compare two computational methods based on recency and animistic knowledge. The first model uses the concept of Centering Theory approach for implementing recency factor. The second model uses the concept of Lappin Leass for recency factor. Both of the computational model use Gazetteer method for providing knowledge of Animistic factor. These approaches are tested with 3 data sets on different genre (Singh, Lakhmani, Mathur, and Morwal, 2014:51-57) .The results of experiment are shown on the table below: 36 Genre 1 (Short Stories) Table 2.9 Result of Experiment Performed by Model 1 on Short Stories Source: Correctly Data Total Total Total Set Sentences Word Anaphors Story 16 165 12 12 100% 28 364 50 33 66% Resolved Accuracy Anaphor 1 Story 2 Overall Accuracy 83% Table 2.10 Result of Experiment Performed by Model 2 on Short Stories Source: Correctly Data Total Total Total Set Sentences Word Anaphors Story 16 165 12 12 100% 28 364 50 36 72% Resolved Accuracy Anaphor 1 Story 2 Overall Accuracy 86% 37 Genre 2 (News Articles) Table 2.11 Result of Experiment Performed by Model 1 on News Articles Source: Correctly Data Total Total Total Set Sentences Word Anaphors News 1 31 372 29 15 52% News 2 20 387 42 29 69% Resolved Accuracy Anaphor Overall Accuracy 61% Table 2.12 Result of Experiment Performed by Model 2 on News Articles Source: Correctly Data Total Total Total Set Sentences Word Anaphors News 1 31 372 29 21 73% News 2 20 387 42 26 62% Overall Accuracy Resolved Accuracy Anaphor 68% 38 Genre 3 (Biography) Table 2.13 Result of Experiment Performed by Model 1 on Biography Source: Total Data Set Sentenc es Biography Total Word Total Correctly Anaphor Resolved s Anaphor Accuracy 18 368 9 7 78% 10 280 13 11 85% 1 Biography 2 Overall Accuracy 82% Table 2.14 Result of Experiment Performed by Model 2 on Biography Source: Total Data Set Sentenc es Biography Total Word Total Correctly Anaphor Resolved s Anaphor Accuracy 18 368 9 7 78% 10 280 13 9 70% 1 Biography 2 Overall Accuracy 74% From the above results it is concluded that the accuracy of the computational model depends on the genre of input data. Moreover, the accuracy is affected by several factors such as recency, animistic knowledge (which are used in the experiment), number agreement, and gender agreement. 39 2.2.1.2 CogNIAC CogNIAC (COGnition eNIAC) is used to resolve pronouns with limited knowledge and linguistic resources. CogNIAC presents a high precision anaphora resolution that it is capable of greater than 90% precision with 60% recall for some pronoun (Tyne and Wu , 2004:3). CogNIAC suggested the system resolves anaphora without the requirement of general world knowledge. It is done by being very sensitive with ambiguity because it requires unique antecedent and it only resolves pronouns when the rules are satisfied. To measure the accuracy of the algorithm three experiments have been done with the results as follows: Table 2.15 Performance of Individual Rules in Experiment 1 (Narrative Text) Source: (Baldwin, 1996:7) Recall Precision (correct/actual) (#correct/#guessed) 1) unique in Discourse 11% (32/298) 100% (32/32) 2) Reflexive 3% (10/298) 100% (10/10) 3) unique Current + 35% (104/298) 96% (104/110) 4) Possessive Pro 1% (2/298) 100% (2/2) 5) Unique Current 6% (18/298) 81% (18/22) 8% (24/298) 80% (24/30) 7) Cb-Picking 4% (13/298) 4% (13/298) 8) Pick Most Recent 10%(29/298) 10%(29/298) Rule Prior Sentence 6) unique Subject/ Subject Pro 40 Table 2.16 Performance of CogNIAC on 42 Wall Street Journal Articles Resolving 3rd Person Pronominal Reference to people Source: (Baldwin, 1996:8) Recall Precision 78% 89% (126/162) (126/142) CogNIAC on WSJ corrected for 89% 91% gender (144/162) (144/158) CogNIAC on Wall Street Journal Table 2.17 The Results for CogNIAC for All Pronouns in the First 15 Articles of the MUC-6 Evaluations Source: (Baldwin, 1996:9) Rule Recall (for just pros) Precision Quoted Speech 11% (13/114) (87%) 13/15 1) Uniq in Discourse 4% (5/114) (100%) 5/5 3) Uniq Curr + Prior 50% (57/114) (72%) 57/79 Search Back 1% (1/114) (33%) 1/3 2) Reflexive 0% (0/114) 0/0 5) Uniq Curr Sentence 4% (5/114) (70%) 5/7 Subject Same Clause 4% (4/114) (57%) 4/7 In conclusion, the rules of CogNIAC performed quite well with the precision of 90% with good recall for the first two experiments. In the third experiment, out of the rules of CogNIAC began to decrease. 2.2.1.3 Centering Theory Hasler (2008:1-8) also conducted a study about Centering Theory but in different application. This study aims to investigate new method for assessing the coherence of computer-aided summaries. To do this, a metric for Centering Theory, 41 which is known as a theory of local coherence and salience, was developed. This metric is applied to obtain a score for each summary. The summary with the higher score is considered more coherent. The role of Centering Theory in this study is to check two consecutive sentences to find out its relation or in this case is called transition. In each sentence or utterance, the Backward Center (CB), the Forward Center (CF), and also Preferred Center (CP) are determined. Then the CB and CP are compared in order to get the centering theory transition. Furthermore, the author formulated a metric to accurately reflect the relation between Centering Theory transitions and coherence. The metric represents the positive and negative effects of the presence of certain transitions in summaries. Table 2.18 Transition Weight for Summary Evaluation Source: (Hasler, 2008:5) Transition Weight Continue 3 Retain 2 No Transition (Indirect) 1 Smooth Shift -1 Rough Shift -2 No Transition (No CB) -5 Each transition is multiplied by its weight then the total of transition is sum up. After that, it is divided by the number of transitions present, which is the total number of utterances – 1. According to the study, the analysis of discourse, study the relation between sentences is right in help the learning process of students in writing paragraph. It is probable that this method would be appropriate in helping students to learn to write paragraph well with the implementation of coherence and cohesion. 42 2.2.1.4 Entity Transition Value Tofiloski (2009:6-50) develop Entity Transition Value method to determine coherence of text through tracking all entities and utterances. Utterance will be represented using a vector while noun phrase will be represented entity. The value in the vector represents the salience of each entity in one utterance (sentence). To define the salience of each entity, maximum salience should be determined first. Maximum salience is obtained from maximum total entity appear in one utterance. Then, each appeared entity will have salience value that is given from maximum salience subtracted by iteration of appeared entity in one utterance. After that, transition value of sentences will be defined. The value of that transition will be in range of 0-1. The value which is closer to 0 represents coherent entity transition while the value which is closer to 1 represents incoherent entity transition. 2.2.2 Comparison Cohesion and coherence comparison of previous methods will be shown below: 2.2.2.1 Cohesion Table 2.19 Comparison of Cohesion Methods Singh, Lakhmani, Mathur, Criteria Morwal Anaphora Resolution CogNIAC P r o p o s e d S ys t e m World No No Resolve anaphora based on Using 8 rules to resolve Knowledge Requirement Process recency and animistic factor. Receny factor uses Centering or Lappin Leass anaphora: 1. Unique in discourse 43 Singh, Lakhmani, Mathur, Criteria Morwal Anaphora Resolution CogNIAC P r o p o s e d S ys t e m concept while animistic 2. Reflexive factor uses Gazetteer 3. Unique in current method. + prior 4. Possesive pronoun 5. Unique current sentence 6. Unique subject/subject pronoun 7. Cb-picking 8. Pick most recent 2.2.2.2 Coherence Table 2.20 Comparison of Coherence Methods Criteria Centering Theory Entity Transition Value Entity Needed Needed Read in Current Sentence and All of the sentences portion Previous Sentence Salience Is obtained from Is obtained from Max grammatical role: Salience decrease by Subject > Indirect count of appeared entity Object > Direct Object > in a sentence Others Order Based on grammatical Linear ordering role Process 1. CF is obtained from 1. Maximum salience is 44 Criteria Centering Theory Entity Transition Value current sentence’s obtained from the total entities. of entities in one 2. CB is obtained from entities in previous sentence that also sentence which has the most entities. 2. Transition value of a realized in current single entity (ei) between sentence. adjacent sentences or 3. CP is defined from the most salient entity in current sentences. utterances (Ui). 3. Transition weight (Wtrans) is obtained from realized entities in adjacent sentences or utterances over total of non-redundant entities. 4. Transition value between two adjacent utterances. Result Level of Coherence: Continue > Retain > Smooth Shift > Rough Shift Value range is from 0 – 1