Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
A Link Grammar for an Agglutinative Language Ozlem Istek & Ilyas Cicekli Bilkent University, TURKEY Outline Link Grammar Formalism Some Distinctive Features of Turkish Syntax The System Architecture Of Turkish Parser and Our Adapted Link Grammar Formalism Method for Handling the Syntactic Roles of the Words with Derivations Evaluation Concluding Remarks RANLP-2007 2 Link Grammar Link grammar is a formal grammatical system developed by Sleator and Temperley The syntax of a language is defined by a grammar that includes the words and their linking requirements. The grammar is defined in a dictionary file and each of the linking requirements of words is expressed in terms of connectors A given sentence is accepted by the system if the linking requirements of all the words are satisfied (connectivity), none of the links between the words cross each other (planarity) and there is at most one link between any pair of words (exclusion) RANLP-2007 3 Link Grammar – Example The linkage requirements of three Turkish words: yedi : O- & S-; - ate kadın : S+ ; - the woman portakalı : O+; - the orange A linkage for a sentence containing these three words +--------------S-------------------+ | +-------O------+ | | | Kadın portakalı yedi The woman RANLP-2007 the orange (The woman ate the orange) ate 4 Turkish Syntax The basic word order is SOV, but order of constituents may change according to the discourse context. Turkish is head-final -- modifiers precede modified item. an adjective (modifier) precedes the head noun (modified item) in a noun phrase. In the basic word order of the sentence, the subject and the object (modifiers) precede the verb (modified item). Although the head-final property can be violated at major constituent levels (SOV) of a sentence, it is preserved at sub-clause levels and smaller syntactic structures. kırmızı red RANLP-2007 şapkalı with hat kız girl (the girl with the red hat) 5 Turkish Syntax (cont.) Turkish is agglutinative. Words can take many derivational suffixes and each of these derivations can take its inflectional suffixes. Inflectional suffixes have important grammatical roles. A significant amount of interaction between syntax and morphotactics. uygarlaştı He got civilized. uygar-laş-tı uygar+Noun+A3sg+Pnon+Nom^DB+Verb+Become+Pos+Past+A3sg RANLP-2007 6 Motivation for New Formalism In standart link grammar formalism, linking requirements are defined for words. When we consider all possible derivations and inflections for Turkish words, the number of possible words will be huge. The words in the same category behave similarly at the syntactical level. We preferred to use linking requirements based on the classes of words and their inflections (and derivations are treated as separate words) RANLP-2007 7 System Architecture of Turkish Parser Input Sentence Morphological Analysis Stripping Lexical Parts Separating Derivation Boundaries Create Sentence List Parse Sentences with Link Grammar All possible linkages RANLP-2007 Linking Requirements for Turkish Word Classes and Derivations 8 System Architecture (cont.) Morphological Analysis: All the words in the input sentence are analyzed by the fully functional Turkish morphological analyzer. oku oku+Verb+Pos+Past+A2sg (read) uygarlaşmak uygar+Noun+A3sg+Pnon+Nom (to get civilized) ^DB+Verb+Become+Pos^DB+Noun+Inf1+A3sg+Pnon+Nom Stripping Lexical Parts: Lexical parts of the words are removed for all types of words except conjunctions. In fact, Turkish link grammar is designed for the classes of word types and their feature structures oku+Verb+Pos+Past+A2sg Verb+Pos+Past+A2sg RANLP-2007 9 System Architecture (cont.) Separating Derivation Boundaries: The words are separated at derivational boundaries and the part of speech tag of each derived form is marked in order to indicate its position in that word. Each token starts with a part of speech tag together with a position mark, and continues with inflectional feature structures. Noun+A3sg+P1pl+Loc ^DB+Adj+Rel ^DB+Noun+Zero+A3sg+Pnon+Gen NounRoot+A3sg+P1pl+Loc AdjDB NounDBEnd+A3sg+Pnon+Gen RANLP-2007 10 System Architecture (cont.) Parsing Sentences: Turkish link grammar contains linking requirements for: Each representation of the sentence is fed into the parser. A sentence is parsed with respect to the designed Turkish link grammar. each part of speech tag, and each part of speech tag followed by one of the strings “Root”, “DB”, or “DBEnd”. A linking requirement for a token depend on the part of speech tag of the token, and the inflection suffixes in that token. RANLP-2007 11 Turkish Link Grammar Linking requirements are defined for a part of speech tag and inflectional suffixes. Noun+A3sg+Pnon+Nom : linking requirements for nouns with +A3sg+Pnon+Nom inflections Noun+A3sg+Pnon+Acc : linking requirements for nouns with +A3sg+Pnon+Acc inflections Verb+Pos+Past+A1sg : linking requirements for verbs with +Pos+Past+A1sg inflections Verb+Pos+Past+A2sg : linking requirements for verbs with +Pos+Past+A2sg inflections RANLP-2007 12 Linking Requirements for Derivations In order to preserve the syntactic roles that the intermediate derived forms of a word play, they are treated as separate words in the grammar. In order to indicate that they are the intermediate derivations of the same word, all of them are linked with the special “DB” (derivational boundary) connector. Noun+A3sg+P1pl+Loc ^DB+Adj+Rel ^DB+Noun+Zero+A3sg+Pnon+Gen +----------DB----------+---DB---+ | | | NounRoot+A3sg+P1pl+Loc AdjDB NounDBEnd+A3sg+Pnon+Gen RANLP-2007 13 Linking Requirements for Derivations (cont.) A derived word consists of root word, intermediate derived forms and last derived form. Root Word only contributes left linking requirements of that word, and it is connected to the right with a DB connector. Intermediate Derived Forms also only contribute left linking requirements of that word, and it is connected to the left and right with a DB connector. Last Derived Form contributes both left and right linking requirements of that word, and it is connected to the left with a DB connector. RANLP-2007 14 Linking Requirements for Derivations (cont.) For each part of speech tag, we will need three more linking requirements for three positions in derived words (root, intermediate and last) Example: Noun Inflections : LeftLinkingRs & RightLinkingRs NounRoot Inflections : LeftLinkingRs & DBNounDB Inflections : LeftLinkingRs & DB- & DB+ NounDBEnd Inflections : LeftLinkingRs & RightLinkingRs & DBRANLP-2007 15 Evaluation We tested the developed Turkish parser with a set of 250 sentences. Average number of words in the sentences is 5.19. Average number of parses per sentence is 7.49. For 84.31% of the sentences, their result sets contain the correct parse. Average ordering of the correct parse in the result set was 1.78. For 62.39% of the sentences, the first parse is the correct parse For 80.94% of the sentences, one of the first three parses is correct. RANLP-2007 16 Conclusions A Turkish grammar is developed in the link grammar formalism. The developed Turkish link grammar is not a lexical grammar. We used the morphological feature structures and the word classes. We preserved the syntactic roles of the intermediate derived forms of words in our system by separating the derived words from their derivational boundaries and treating each intermediate form as a distinct word. Our linking requirements are defined for morphological categories. Our current system does not use a POS tagger, and its addition will improve the performance in terms of both time and precision. RANLP-2007 17