* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download lexc
Chinese grammar wikipedia , lookup
Georgian grammar wikipedia , lookup
Kannada grammar wikipedia , lookup
Interpretation (logic) wikipedia , lookup
Swedish grammar wikipedia , lookup
Lexical semantics wikipedia , lookup
Spanish grammar wikipedia , lookup
Context-free grammar wikipedia , lookup
Regular expression wikipedia , lookup
Old Irish grammar wikipedia , lookup
Latin syntax wikipedia , lookup
Distributed morphology wikipedia , lookup
French grammar wikipedia , lookup
Compound (linguistics) wikipedia , lookup
Serbo-Croatian grammar wikipedia , lookup
Arabic grammar wikipedia , lookup
Portuguese grammar wikipedia , lookup
Agglutination wikipedia , lookup
Junction Grammar wikipedia , lookup
Malay grammar wikipedia , lookup
Ancient Greek grammar wikipedia , lookup
Romanian nouns wikipedia , lookup
Scottish Gaelic grammar wikipedia , lookup
Vietnamese grammar wikipedia , lookup
Turkish grammar wikipedia , lookup
Zulu grammar wikipedia , lookup
Polish grammar wikipedia , lookup
Yiddish grammar wikipedia , lookup
The lexc Language
Prepare to partition your brain to learn a whole
new formalism.
The lexc language
“lexc” stands for “LEXicon Compiler”
lexc is a high-level, declarative programming language
lexc is different from regular expressions and xfst
the syntax is different
the assumptions are different
the special characters are different
This is all
fertile ground
for confusion!
BUT, the lexc compiler produces STANDARD Xerox networks
these networks are fully compatible with networks from xfst
you can sometimes choose to use lexc or xfst for building a network
Why a Separate lexc Language?
Lexc is intended for use by lexicographers.
Regular expressions in xfst are often hard to read, especially big ones
Typing spaces between all the letters , e.g. e l e p h a n t , to be concatenated in xfst is a
nuisance, especially if you need to type 40,000 words
You can also write {elephant} for concatenation in xfst regular expressions, but that’s a
nuisance too.
Lexc is more efficient for compiling large natural-language lexicons (it optimizes the
union operation)
Lexc has better error messages
But remember:
lexc is just another formalism for defining finite-state languages and relations
you can (and will) use lexc and xfst together in building significant applications
Lexc descriptions are written in a file
xfst[]: read regex < myfile.regex
Regex File
xfst[]: read lexc < myfile-lexc.txt
lexc File
• Recall that a regular-expression file must contain just a regular
expression, terminated with a semicolon and a newline
• A Lexc file can be compiled inside xfst, using “read lexc”. The
resulting network is pushed on The Stack.
• A Lexc file must contain just a lexc description, written
according to the syntax of lexc
The lexc Source File: Multichar_Symbols
The lexc compiler and the xfst regular-expression compiler have
completely opposite assumptions about multicharacter symbols:
In xfst Regular Expressions, the default is to treat a string of symbols
written together, e.g. %+Noun or cat or “[Noun]” , as a single
symbol. Concatenation of separate symbols is indicated by
manually separating symbols with white space, e.g. [ c a t ], or by
using the curly-brace notation, e.g. {cat}.
In lexc, in contrast, the default is to treat strings, e.g. cat, as a
concatenation of three symbols. Any multicharacter symbols must
be explicitly declared at the top of the source file.
Multichar_Symbols declaration
Multichar_Symbols [Noun] [Verb] [Adj] [Adv]
[Sg] [Pl]
[1P] [2P] [3P] ^FEAT1 ^FEAT2
The Multichar_Symbols statement is formally optional and is
placed at the top of your lexc source file. You can declare as many
multicharacter symbols as you find necessary or useful. The
compiler uses this declaration to separate the strings of your lexc
program into symbols. You are strongly encouraged to include a
non-alphabetic character in the spelling of each multicharacter
symbol to help them stand out visually.
Multichar_Symbols declaration
Multichar_Symbols
+Noun +Verb +Adj +Adv +Sg
+Pl +1P +2P +3P ^FEAT1 ^FEAT2
Try to use a consistent convention for spelling multicharacter
symbols. Symbols to be seen by users might be spelled +Noun or
[Noun]. In several Xerox systems, feature-like used internally in
grammars but which do not appear in output strings are often
spelled with an initial circumflex ^ (“hat”) sign.
Multichar_Symbols declaration
Multichar_Symbols
[Noun]
[Verb]
[Adj]
[Adv]
[Sg] [Pl]
! Number tags
[1P] [2P] [3P]
! Person tags
^FEAT1 ^FEAT2
! “Feature” tags, used internally
The Multichar_Symbols declaration can extend over multiple lines
and has no terminator.
The Body of your lexc Program
The body of a lexc program is composed of LEXICONs.
There should always be one LEXICON named Root. It
corresponds to the Start State in the resulting Network.
LEXICON Root
dog
N;
cat
N;
bird
N;
If you don’t define a LEXICON Root, lexc will try to use the first
LEXICON in the file as the Start State.
Done confuse “LEXICON Root” with the roots of your natural
language.
Entries in a LEXICON
Each defined LEXICON must have at least one entry.
An entry consists of two parts and is terminated with a semicolon
data
continuation-class ;
The data part has to fit one of four formats:
string
e.g.
upper:lower
< regular-expression >
e.g.
empty
e.g.
e.g.
dog
swim:swam
< a b* c >
upper:lower Entries
The upper:lower entries are the simplest way to specify portions of
the network where the upper-side and lower-side differ. They are
especially useful for irregularies/suppletions.
Multichar_Symbols +Verb +Past +Noun +Sg +Pl
LEXICON Root
swim+Verb+Past:swam
#;
go+Verb+Past:went
#;
child+Noun+Pl:children
#;
ox+Noun+Pl:oxen
#;
upper:lower Entries
In
upper:lower entries, you can overtly indicate where the
epsilons should go.
Multichar_Symbols +Verb +Past +Noun +Sg +Pl +Nom
LEXICON Root
poder+Verb:pod0r
FutCond ;
Don’t confuse the lexc upper:lower notation with
the regular-expression colon notation.
Regular Expressions in lexc
Any data written as a regular expression must be surrounded with
angle brackets, e.g.
<elephant>
CC ;
< a b* c+>
CC ;
Inside angle brackets, you revert to all the assumptions suitable for
xfst regular expressions, including the treatment of multicharacter
symbols vs. concatenation of symbols.
This is fertile ground for errors.
Continuation Classes
The
Continuation Class is just the name of a defined
LEXICON or #, indicating end-of-word (a final state).
Multichar_Symbols [Noun] [Sg] [Pl]
LEXICON Root
dog
N;
cat
N;
LEXICON N
[Noun][Sg]:0
[Noun][Pl]:s
#;
#;
Thinking About lexc LEXICONS
A LEXICON should hold a coherent class of morphemes
The entries in a lexc LEXICON are unioned together by the compiler; the
order of the entries in a LEXICON is not significant.
Think of each LEXICONs as a potential “target”
Entries “point at” a LEXICON via the ContinuationClass
But each entry in a LEXICON could itself point to a different
ContinuationClass
During development, you may have to subdivide lexicons
Avoid having copies of the same material (if possible)
You may change an entry in one place and forget to change the copy
Thinking About lexc LEXICONS II
Some
developers adopt a more restricted lexc style in which all
the entries in the same LEXICON share the same continuation
class.
They
Flag
believe that it makes their grammars easier to maintain.
Diacritics (to be discussed later in the week) help facilitate
this style of programming.
Formally Speaking . . .
Lexc syntax is a kind of right-recursive context-free phrasestructure grammar.
Context-free phrase-structure grammars can in general describe
context-free languages (i.e. languages beyond finite-state power),
including languages with balanced parentheses.
But with the right-recursive limitation, a phrase-structure grammar
can define only finite-state languages.
Lexc can describe only finite-state languages.
Lexc descriptions compile into finite-state networks.
Optional Morphemes via By-Pass
LEXICON Root
Vroot ;
LEXICON Vroot
kant
dir
don
pens
V;
V;
V;
V;
LEXICON V
AdLex ;
Vend ; ! bypass
LEXICON AdLex
ad
Vend ;
LEXICON Vend
as
is
os
us
u
i
#;
#;
#;
#;
#;
#;
Lexc Idiom: Optional Morphemes via
“Escape” Entries
LEXICON Vroot
kant
AdLex ;
dir
AdLex ;
don
AdLex ;
pens
AdLex ;
LEXICON AdLex
ad
Vend ;
Vend ; ! escape
LEXICON Vend
as
is
os
us
u
i
#;
#;
#;
#;
#;
#;
Lexc Idiom: Loops
LEXICON Nroot
LEXICON Plur
kat
N;
j
hund
N;
elefant
N;
Case ;
LEXICON N
LEXICON Case
eg
N;
! loop
et
N;
! loop
in
N;
! loop
Nend ;
LEXICON Nend
o
Case ; ! Opt. plural ending
Plur ;
n
#;
#;
! Opt. case ending
Stem compounding (loops) in lexc
LEXICON Nroot
LEXICON Plur
kat
N;
j
hund
N;
elefant
N;
LEXICON N
Case ;
LEXICON Case
Nroot ;
Nend ;
LEXICON Nend
o
Case ; ! Opt. plural ending
Plur ;
n
#;
#;
Special Characters in lexc
Overall, there are far fewer special characters in lexc than in regular expressions.
In lexc, the following are special:
Special
Literalized
:
used in upper:lower notation
%:
;
terminates an entry
%;
<
begins a regular expression
%<
>
ends a regular expression
0
denotes the empty string (epsilon)
%0
!
introduces a comment line
%!
#
continuation-class for end-of-word
%#
%
literalizing prefix
%%
%>
Lexc source files
Lexc
sources files were traditionally ISO-8859-1 (Latin1) text files, typically edited with xemacs, vi, Notepad,
etc.
In the latest versions of lexc, your source files can be in
Unicode (UTF-8).
Lexc
programs for natural language can get very large
Typically
8000 to 12000 entries for verbs
Tens of thousands of entries for nouns and proper nouns
Using lexc and xfst together
Write a lexc source file (e.g. mysrc-lexc.txt) using xemacs or a
similar text editor
Write suitable alternation rules (e.g. a regex file mysrc-rul.regex)
Then, from the xfst interface
xfst[]:
read regex < mysrc-rul.regex
xfst[]:
define Rules
xfst[]:
read lexc < mysrc-lexc.txt
xfst[]:
define Lexicon
xfst[]:
read regex Lexicon .o. Rules ;
xfst[]:
apply up testword
There is a separate lexc interface
Until
recently, it was necessary to use a separate “lexc
interface” to compile lexc files.
Now
lexc files can be compiled easily from inside xfst,
using the ‘read lexc’ command. The resulting network is
pushed onto The Stack.
Most
users do not need to know about the separate lexc
interface.
Review: Using lexc and xfst together
Lexc lexicons build “words” (strings) using union and concatenation
Entries within a LEXICON are unioned (the order of entries is not significant)
The
LEXICON Root
corresponds to the start state
The special # continuation class corresponds to final states
Other continuation classes translate into concatenation
By Xerox convention, upper-side strings consist of a baseform and “tags”
By convention, a surface (or more surfacy) form appears on the lower-side
The surfacy forms generated by lexc may still be rather abstract, hyper-regular, or
“morphophonemic”. They may sometimes contain multicharacter symbols.
Replace Rules (perhaps a whole cascade of them) map from the surfacy strings
produced by lexc to real surface strings; rules are applied using composition.
Composition can also be used to “filter” out various kinds of overgeneration.
A Typical Finite-State System
Filters (xfst)
.o.
Core Lexicon
(lexc)
.o.
Orthographical or
Phonological
Alternation Rules
(xfst)
A System may be a Union of Subsystems
Nouns (lexc)
Verbs (lexc)
.o.
.o.
Noun
Rules
Verb
Rules
Adjs (lexc)
Numbers (lexc)
Then, in xfst:
define Final NounFST | VerbFST | AdjFST | NumberFST ;
Review: Outputs and Inputs
Unix Pipe:
cat wordlist.in | sort | uniq -c | sort -rnb > myfile.out
The output of one routine is the input to the next
NOT reversible
Cascade of Replace Rules:
read regex [ N -> m || _ p ] .o. [ p -> m || m _ ] ;
Reversible/bidirectional relation
apply down: the output of the first rule is the input to the second
the lower side of the top rule is the upper side of the bottom rule
Review: Up and Down
In xfst (regular expressions):
In lexc (data)
a:b
swim[Verb][Pres]:swam
[Pl]:s
upper:lower
[ a .x. b ]
a -> b
a <- b
Review: Xerox Conventions
The upper-side language, and all intermediate
languages, have to be defined by the linguist
writing the grammar.
Upper
(lexical) language:
baseform[Tag][Tag][Tag]
(mapping via rules)
Lower
(surface) language:
orthographical-string
The surface language is usually determined
for you by the standard orthography.
Review: Up and Down with Composition
A rule that refers to tags on
its lower side, e.g.
“[Sg]” <- “+Sing”
.o.
baseform+Tag+Tag+Sing
surfacy-form
.o.
A rule that refers to a surfacy
form on its upper side, e.g.
s -> z || Vowel _ Vowel
An FST defined
using lexc
Review: Think in terms of Languages and
Relations
Lexical Language
Core Lexicon FST
Surfacy Language
Rule1
Intermediate Language 1
Rule2
Intermediate Language 2
…
Rule n
Final Surface Language
Exercises
Write a simple lexc grammar that handles regular English noun
plurals. Use the tags +Noun, +Sg and +Pl or, probably better,
[Noun], [Sg] and [Pl] on the upper (lexical) side.
Write a lexc grammar for Esperanto Nouns (4.2.8)
Write a lexc grammar for Esperanto Adjectives (p. 222)
Then combine the nouns and adjectives into one lexc description
Advanced: try to do the Bambona example (3.5.4) but using lexc
rather than xfst for the morphotactics.
Old Slides on The lexc Interface
Most users do not need the lexc interface anymore
Most users find it confusing to use both the xfst and the lexc
interfaces.
This lexc interface can still be needed
If you are using twolc rules
If you are maintaining a legacy system that uses twolc rules
If you want to use the lexc ‘check-all’ feature
99% of all users will be better off using xfst, and compiling their
lexc files with the ‘read lexc’ command
xfst[]: read lexc < myfile-lexc.txt
The lexc interface
Invoke the lexc interface by simply entering ‘lexc’ at the UNIX
prompt.
prompt% lexc
You communicate with the interface using lexc commands. Type ‘?’
to see all the possible commands. Invoke ‘help commandname’ to
see some terse online documentation.
Enter ‘quit’ to leave lexc and return to the operating system.
lexc: quit
The Three lexc Registers
To understand lexc commands, you must understand that they refer
to and operate on networks held in three registers, visualized as
SOURCE
Typically used to store a lexicon FST.
RULES
Typically used to store a rule FST or FSTs.
RESULT
Typically used to store the result of composing
the rule FST(s) under the source FST.
Basic lexc commands
compile-source filename
SOURCE
compiles the lexc source code in filename and
stores the resulting network in SOURCE
read-rules filename
RULES
reads the binary file filename and stores the
network(s) in RULES. The binary file may be
from xfst or twolc.
compose-result
RESULT
composes the RULE FST(s) under the SOURCE
FST and stores the resulting FST in RESULT
Some Other lexc Commands
read-source
load a pre-compiled binary network into
the SOURCE register
save-source filename
store network in SOURCE to file
save-result filename
store network in RESULT to file
lookup word
(equivalent to the xfst ‘apply up’)
lookdown word
(equivalent to the xfst ‘apply down’)
result-to-source
move the network from the RESULT
register to SOURCE