* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download lexc
Chinese grammar wikipedia , lookup
Georgian grammar wikipedia , lookup
Kannada grammar wikipedia , lookup
Interpretation (logic) wikipedia , lookup
Swedish grammar wikipedia , lookup
Lexical semantics wikipedia , lookup
Spanish grammar wikipedia , lookup
Context-free grammar wikipedia , lookup
Regular expression wikipedia , lookup
Old Irish grammar wikipedia , lookup
Latin syntax wikipedia , lookup
Distributed morphology wikipedia , lookup
French grammar wikipedia , lookup
Compound (linguistics) wikipedia , lookup
Serbo-Croatian grammar wikipedia , lookup
Arabic grammar wikipedia , lookup
Portuguese grammar wikipedia , lookup
Agglutination wikipedia , lookup
Junction Grammar wikipedia , lookup
Malay grammar wikipedia , lookup
Ancient Greek grammar wikipedia , lookup
Romanian nouns wikipedia , lookup
Scottish Gaelic grammar wikipedia , lookup
Vietnamese grammar wikipedia , lookup
Turkish grammar wikipedia , lookup
Zulu grammar wikipedia , lookup
Polish grammar wikipedia , lookup
Yiddish grammar wikipedia , lookup
The lexc Language Prepare to partition your brain to learn a whole new formalism. The lexc language “lexc” stands for “LEXicon Compiler” lexc is a high-level, declarative programming language lexc is different from regular expressions and xfst the syntax is different the assumptions are different the special characters are different This is all fertile ground for confusion! BUT, the lexc compiler produces STANDARD Xerox networks these networks are fully compatible with networks from xfst you can sometimes choose to use lexc or xfst for building a network Why a Separate lexc Language? Lexc is intended for use by lexicographers. Regular expressions in xfst are often hard to read, especially big ones Typing spaces between all the letters , e.g. e l e p h a n t , to be concatenated in xfst is a nuisance, especially if you need to type 40,000 words You can also write {elephant} for concatenation in xfst regular expressions, but that’s a nuisance too. Lexc is more efficient for compiling large natural-language lexicons (it optimizes the union operation) Lexc has better error messages But remember: lexc is just another formalism for defining finite-state languages and relations you can (and will) use lexc and xfst together in building significant applications Lexc descriptions are written in a file xfst[]: read regex < myfile.regex Regex File xfst[]: read lexc < myfile-lexc.txt lexc File • Recall that a regular-expression file must contain just a regular expression, terminated with a semicolon and a newline • A Lexc file can be compiled inside xfst, using “read lexc”. The resulting network is pushed on The Stack. • A Lexc file must contain just a lexc description, written according to the syntax of lexc The lexc Source File: Multichar_Symbols The lexc compiler and the xfst regular-expression compiler have completely opposite assumptions about multicharacter symbols: In xfst Regular Expressions, the default is to treat a string of symbols written together, e.g. %+Noun or cat or “[Noun]” , as a single symbol. Concatenation of separate symbols is indicated by manually separating symbols with white space, e.g. [ c a t ], or by using the curly-brace notation, e.g. {cat}. In lexc, in contrast, the default is to treat strings, e.g. cat, as a concatenation of three symbols. Any multicharacter symbols must be explicitly declared at the top of the source file. Multichar_Symbols declaration Multichar_Symbols [Noun] [Verb] [Adj] [Adv] [Sg] [Pl] [1P] [2P] [3P] ^FEAT1 ^FEAT2 The Multichar_Symbols statement is formally optional and is placed at the top of your lexc source file. You can declare as many multicharacter symbols as you find necessary or useful. The compiler uses this declaration to separate the strings of your lexc program into symbols. You are strongly encouraged to include a non-alphabetic character in the spelling of each multicharacter symbol to help them stand out visually. Multichar_Symbols declaration Multichar_Symbols +Noun +Verb +Adj +Adv +Sg +Pl +1P +2P +3P ^FEAT1 ^FEAT2 Try to use a consistent convention for spelling multicharacter symbols. Symbols to be seen by users might be spelled +Noun or [Noun]. In several Xerox systems, feature-like used internally in grammars but which do not appear in output strings are often spelled with an initial circumflex ^ (“hat”) sign. Multichar_Symbols declaration Multichar_Symbols [Noun] [Verb] [Adj] [Adv] [Sg] [Pl] ! Number tags [1P] [2P] [3P] ! Person tags ^FEAT1 ^FEAT2 ! “Feature” tags, used internally The Multichar_Symbols declaration can extend over multiple lines and has no terminator. The Body of your lexc Program The body of a lexc program is composed of LEXICONs. There should always be one LEXICON named Root. It corresponds to the Start State in the resulting Network. LEXICON Root dog N; cat N; bird N; If you don’t define a LEXICON Root, lexc will try to use the first LEXICON in the file as the Start State. Done confuse “LEXICON Root” with the roots of your natural language. Entries in a LEXICON Each defined LEXICON must have at least one entry. An entry consists of two parts and is terminated with a semicolon data continuation-class ; The data part has to fit one of four formats: string e.g. upper:lower < regular-expression > e.g. empty e.g. e.g. dog swim:swam < a b* c > upper:lower Entries The upper:lower entries are the simplest way to specify portions of the network where the upper-side and lower-side differ. They are especially useful for irregularies/suppletions. Multichar_Symbols +Verb +Past +Noun +Sg +Pl LEXICON Root swim+Verb+Past:swam #; go+Verb+Past:went #; child+Noun+Pl:children #; ox+Noun+Pl:oxen #; upper:lower Entries In upper:lower entries, you can overtly indicate where the epsilons should go. Multichar_Symbols +Verb +Past +Noun +Sg +Pl +Nom LEXICON Root poder+Verb:pod0r FutCond ; Don’t confuse the lexc upper:lower notation with the regular-expression colon notation. Regular Expressions in lexc Any data written as a regular expression must be surrounded with angle brackets, e.g. <elephant> CC ; < a b* c+> CC ; Inside angle brackets, you revert to all the assumptions suitable for xfst regular expressions, including the treatment of multicharacter symbols vs. concatenation of symbols. This is fertile ground for errors. Continuation Classes The Continuation Class is just the name of a defined LEXICON or #, indicating end-of-word (a final state). Multichar_Symbols [Noun] [Sg] [Pl] LEXICON Root dog N; cat N; LEXICON N [Noun][Sg]:0 [Noun][Pl]:s #; #; Thinking About lexc LEXICONS A LEXICON should hold a coherent class of morphemes The entries in a lexc LEXICON are unioned together by the compiler; the order of the entries in a LEXICON is not significant. Think of each LEXICONs as a potential “target” Entries “point at” a LEXICON via the ContinuationClass But each entry in a LEXICON could itself point to a different ContinuationClass During development, you may have to subdivide lexicons Avoid having copies of the same material (if possible) You may change an entry in one place and forget to change the copy Thinking About lexc LEXICONS II Some developers adopt a more restricted lexc style in which all the entries in the same LEXICON share the same continuation class. They Flag believe that it makes their grammars easier to maintain. Diacritics (to be discussed later in the week) help facilitate this style of programming. Formally Speaking . . . Lexc syntax is a kind of right-recursive context-free phrasestructure grammar. Context-free phrase-structure grammars can in general describe context-free languages (i.e. languages beyond finite-state power), including languages with balanced parentheses. But with the right-recursive limitation, a phrase-structure grammar can define only finite-state languages. Lexc can describe only finite-state languages. Lexc descriptions compile into finite-state networks. Optional Morphemes via By-Pass LEXICON Root Vroot ; LEXICON Vroot kant dir don pens V; V; V; V; LEXICON V AdLex ; Vend ; ! bypass LEXICON AdLex ad Vend ; LEXICON Vend as is os us u i #; #; #; #; #; #; Lexc Idiom: Optional Morphemes via “Escape” Entries LEXICON Vroot kant AdLex ; dir AdLex ; don AdLex ; pens AdLex ; LEXICON AdLex ad Vend ; Vend ; ! escape LEXICON Vend as is os us u i #; #; #; #; #; #; Lexc Idiom: Loops LEXICON Nroot LEXICON Plur kat N; j hund N; elefant N; Case ; LEXICON N LEXICON Case eg N; ! loop et N; ! loop in N; ! loop Nend ; LEXICON Nend o Case ; ! Opt. plural ending Plur ; n #; #; ! Opt. case ending Stem compounding (loops) in lexc LEXICON Nroot LEXICON Plur kat N; j hund N; elefant N; LEXICON N Case ; LEXICON Case Nroot ; Nend ; LEXICON Nend o Case ; ! Opt. plural ending Plur ; n #; #; Special Characters in lexc Overall, there are far fewer special characters in lexc than in regular expressions. In lexc, the following are special: Special Literalized : used in upper:lower notation %: ; terminates an entry %; < begins a regular expression %< > ends a regular expression 0 denotes the empty string (epsilon) %0 ! introduces a comment line %! # continuation-class for end-of-word %# % literalizing prefix %% %> Lexc source files Lexc sources files were traditionally ISO-8859-1 (Latin1) text files, typically edited with xemacs, vi, Notepad, etc. In the latest versions of lexc, your source files can be in Unicode (UTF-8). Lexc programs for natural language can get very large Typically 8000 to 12000 entries for verbs Tens of thousands of entries for nouns and proper nouns Using lexc and xfst together Write a lexc source file (e.g. mysrc-lexc.txt) using xemacs or a similar text editor Write suitable alternation rules (e.g. a regex file mysrc-rul.regex) Then, from the xfst interface xfst[]: read regex < mysrc-rul.regex xfst[]: define Rules xfst[]: read lexc < mysrc-lexc.txt xfst[]: define Lexicon xfst[]: read regex Lexicon .o. Rules ; xfst[]: apply up testword There is a separate lexc interface Until recently, it was necessary to use a separate “lexc interface” to compile lexc files. Now lexc files can be compiled easily from inside xfst, using the ‘read lexc’ command. The resulting network is pushed onto The Stack. Most users do not need to know about the separate lexc interface. Review: Using lexc and xfst together Lexc lexicons build “words” (strings) using union and concatenation Entries within a LEXICON are unioned (the order of entries is not significant) The LEXICON Root corresponds to the start state The special # continuation class corresponds to final states Other continuation classes translate into concatenation By Xerox convention, upper-side strings consist of a baseform and “tags” By convention, a surface (or more surfacy) form appears on the lower-side The surfacy forms generated by lexc may still be rather abstract, hyper-regular, or “morphophonemic”. They may sometimes contain multicharacter symbols. Replace Rules (perhaps a whole cascade of them) map from the surfacy strings produced by lexc to real surface strings; rules are applied using composition. Composition can also be used to “filter” out various kinds of overgeneration. A Typical Finite-State System Filters (xfst) .o. Core Lexicon (lexc) .o. Orthographical or Phonological Alternation Rules (xfst) A System may be a Union of Subsystems Nouns (lexc) Verbs (lexc) .o. .o. Noun Rules Verb Rules Adjs (lexc) Numbers (lexc) Then, in xfst: define Final NounFST | VerbFST | AdjFST | NumberFST ; Review: Outputs and Inputs Unix Pipe: cat wordlist.in | sort | uniq -c | sort -rnb > myfile.out The output of one routine is the input to the next NOT reversible Cascade of Replace Rules: read regex [ N -> m || _ p ] .o. [ p -> m || m _ ] ; Reversible/bidirectional relation apply down: the output of the first rule is the input to the second the lower side of the top rule is the upper side of the bottom rule Review: Up and Down In xfst (regular expressions): In lexc (data) a:b swim[Verb][Pres]:swam [Pl]:s upper:lower [ a .x. b ] a -> b a <- b Review: Xerox Conventions The upper-side language, and all intermediate languages, have to be defined by the linguist writing the grammar. Upper (lexical) language: baseform[Tag][Tag][Tag] (mapping via rules) Lower (surface) language: orthographical-string The surface language is usually determined for you by the standard orthography. Review: Up and Down with Composition A rule that refers to tags on its lower side, e.g. “[Sg]” <- “+Sing” .o. baseform+Tag+Tag+Sing surfacy-form .o. A rule that refers to a surfacy form on its upper side, e.g. s -> z || Vowel _ Vowel An FST defined using lexc Review: Think in terms of Languages and Relations Lexical Language Core Lexicon FST Surfacy Language Rule1 Intermediate Language 1 Rule2 Intermediate Language 2 … Rule n Final Surface Language Exercises Write a simple lexc grammar that handles regular English noun plurals. Use the tags +Noun, +Sg and +Pl or, probably better, [Noun], [Sg] and [Pl] on the upper (lexical) side. Write a lexc grammar for Esperanto Nouns (4.2.8) Write a lexc grammar for Esperanto Adjectives (p. 222) Then combine the nouns and adjectives into one lexc description Advanced: try to do the Bambona example (3.5.4) but using lexc rather than xfst for the morphotactics. Old Slides on The lexc Interface Most users do not need the lexc interface anymore Most users find it confusing to use both the xfst and the lexc interfaces. This lexc interface can still be needed If you are using twolc rules If you are maintaining a legacy system that uses twolc rules If you want to use the lexc ‘check-all’ feature 99% of all users will be better off using xfst, and compiling their lexc files with the ‘read lexc’ command xfst[]: read lexc < myfile-lexc.txt The lexc interface Invoke the lexc interface by simply entering ‘lexc’ at the UNIX prompt. prompt% lexc You communicate with the interface using lexc commands. Type ‘?’ to see all the possible commands. Invoke ‘help commandname’ to see some terse online documentation. Enter ‘quit’ to leave lexc and return to the operating system. lexc: quit The Three lexc Registers To understand lexc commands, you must understand that they refer to and operate on networks held in three registers, visualized as SOURCE Typically used to store a lexicon FST. RULES Typically used to store a rule FST or FSTs. RESULT Typically used to store the result of composing the rule FST(s) under the source FST. Basic lexc commands compile-source filename SOURCE compiles the lexc source code in filename and stores the resulting network in SOURCE read-rules filename RULES reads the binary file filename and stores the network(s) in RULES. The binary file may be from xfst or twolc. compose-result RESULT composes the RULE FST(s) under the SOURCE FST and stores the resulting FST in RESULT Some Other lexc Commands read-source load a pre-compiled binary network into the SOURCE register save-source filename store network in SOURCE to file save-result filename store network in RESULT to file lookup word (equivalent to the xfst ‘apply up’) lookdown word (equivalent to the xfst ‘apply down’) result-to-source move the network from the RESULT register to SOURCE