Download Document

Document related concepts

Japanese grammar wikipedia , lookup

Portuguese grammar wikipedia , lookup

Arabic grammar wikipedia , lookup

Latin syntax wikipedia , lookup

Old Irish grammar wikipedia , lookup

Untranslatability wikipedia , lookup

Lithuanian grammar wikipedia , lookup

Modern Greek grammar wikipedia , lookup

Junction Grammar wikipedia , lookup

Zulu grammar wikipedia , lookup

Spanish grammar wikipedia , lookup

Distributed morphology wikipedia , lookup

Comparison (grammar) wikipedia , lookup

Old Norse morphology wikipedia , lookup

Compound (linguistics) wikipedia , lookup

Ojibwe grammar wikipedia , lookup

Swedish grammar wikipedia , lookup

Yiddish grammar wikipedia , lookup

Symbol grounding problem wikipedia , lookup

Ancient Greek grammar wikipedia , lookup

Serbo-Croatian grammar wikipedia , lookup

Pleonasm wikipedia , lookup

Old English grammar wikipedia , lookup

Contraction (grammar) wikipedia , lookup

Esperanto grammar wikipedia , lookup

Scottish Gaelic grammar wikipedia , lookup

Russian declension wikipedia , lookup

Turkish grammar wikipedia , lookup

French grammar wikipedia , lookup

Inflection wikipedia , lookup

Parsing wikipedia , lookup

Polish grammar wikipedia , lookup

Agglutination wikipedia , lookup

Stemming wikipedia , lookup

Pipil grammar wikipedia , lookup

Malay grammar wikipedia , lookup

Morphology (linguistics) wikipedia , lookup

Transcript
Dictionaries and Grammar
Questions to Address
• Do we include all forms of a particular
word, or do we include only the base word
and derive its forms?
• How are the grammatical rules of a
language represented?
• How do we represent the parts of speech
that go with particular grammatical rules?
Morphology
Definitions
• Morphology
– The study of the patterns used to form words
– E.g. inflection, derivation, and compounds
• Morpheme - Minimal meaning-bearing unit
– Could be a stem or an affix
• Stem {“unthinkable” “realization” “distrust”}
– The part of a word that contains the root meaning (E.g. cat)
• Affixes {-s, un-, de-, -en, -able, -ize, -hood}
– a linguistic element added to a word modify the meaning
– E.g.: prefix (unbuckle), suffix (buckled), infix (absobloodylutely),
and circumfix (gesagt in German for said).
– Affixes can attach to other affixes (boyishness)
Knowing Words
•
When we know a word, we know its
1.
2.
3.
4.
•
Phonological sound sequences
Semantic meanings
Morphological relationships
Syntactic categories and proper structure of a sentence
Morphological relationships adjust word meanings
–
–
–
–
–
–
–
Person
Number
Case
Tense
Degree
Gender
Part of Speech
Jill waits.
Jill carried two buckets.
The chair’s leg is broken.
Jill is waiting there now.
Jill ran faster than Jack.
Jill is female
Jill is a proper noun
These are the kind of things we want our computers to figure out
Units of Meaning
• How many morphemes do each of the following
sentences have?
–
–
–
–
“I have two cats”
“She wants to leave soon”
“He walked across the room”
“Her behavior was unbelievable”
• Free Morphemes {eye, think, run, apple}
• Bound Morphemes {-able, un-, -s, -tion, -ly}
Affix Examples
• Prefixes from Karuk, a Hokan language of California
 [pasip]
 [nipasip]
 [/upasip]
“Shoot!”
“I shoot”
“She/he shoots”
• Suffixes from Mende spoken in Liberia and Sierra Leone




[pElE]
[pElEi]
[mEmE]
[mEmEi]
“house”
“the house”
“glass”
“the glass”
• Infixes from Bontoc spoken in the Phillipines




[fikas]
[fumikas]
[fusul]
[fumusal]
“strong”
“she is becoming strong”
“enemy”
“she is becoming an enemy”
Turkish Morpology
Uygarlastiramadiklarimizdanmissinizcasina
Meaning: `behaving as if you are among those
whom we could not civilize’
•
•
•
•
Uygar `civilized’ + las `become’ + tir `cause’
+ ama `not able’ + dik `past’
+ lar ‘plural’+ imiz ‘p1pl’ + dan ‘abl’
+ mis ‘past’ + siniz ‘2pl’ + casina ‘as if’
How does the Mind Store Meanings?
• Hypotheses
– Full listing: We store all words individually
– Minimum redundancy: We store morphemes and how they relate
• Analysis
– Determine if people understand new words based on root meanings
– Observe whether children have difficulty learning exceptions
– Regular form: government/govern, Irregular form: department/depart
• Evidence suggests
– The mind represents words and affix meanings separately
– Linguists observe that affixes were originally separate words that
speakers slur together over time
General Observations about
Lexicons
• Meanings are continually changing
• Roots and Morphemes do not have to occur in a fixed position
in relation to other elements.
• How many words do people know?
– Shakespeare uses 15,000 words
– A typical high school student knows 60,000
(learning 10 words a day from 12 months to 18 years)
• How many English words are there?
– Over 300,000 words without Morphemes in 1988
Computational Morphology
Speech recognition requires a language dictionary
How many words would it contain?
• Consider all of the morphemes of the word ‘true’
– true, truer, truest, truly, untrue, truth, truthful, truthfully, untruthfully,
untruthfulness
– Untruthfulness = un- + true + -th + -ful + -ness
• Productive morphemes
– An affix that at a point in time spread rapidly through the language
– Consider goose and geese versus cat and cats
• The former was an older way to indicate plurals
• The latter is a more recent way that spread throughout
• If we store morpheme rules, not all words, we can
– Reduce storage requirements and simplify creating entire dictionaries
– More closely mimic how the mind does it
– Be able to automatically understand newly encountered word forms
Morphology Rules
• There are rules used to form complex words from their roots
– ‘re-’ only precedes verbs (rerun, release, return)
– ‘-s’ indicates plurals
– ‘-ed’ indicates past tense
• Affix Rules
– Regular: follow productive affix rules
– Irregular: don’t follow productive affix rules
• Nouns
– Regular: (cat, thrush), (cats, thrushes), (cat’s thrushes’)
– Irregular: (mouse, ox), (mice, oxen)
Observation: More frequent words resist changes that result from
productive affixes and take irregular forms (E.g. am, is, are).
Exceptions: A singer sings, and a writer writes. Why doesn’t a
whisker whisk, a spider spid, or a finger fing?
Parsing
Identify components and underlying structure
• Morphological parsing
– Identifies stem and affixes and how they relate
– Example:
• fish  fish + Noun + Singular or goose + Verb
• fish  fish +Noun +Plural
• fish  fish +Verb +Singular
– Bracketing: indecipherable  [in [[de [cipher]] able]]
• Why do we parse?
–
–
–
–
–
spell-checking: Is muncheble a real word?
Identify a word’s part-of-speech (pos)
Sentence parsing and machine translation
Identify word stems for data mining search operations
Speech recognition and text to speech
Parsing Applications
• Lexicon
– Create a word list
– Include both stems and affixes (with the part of speech)
• Morphotactics
– Models how morphemes can be affixed to a stem.
– E.g., plural morpheme follows noun in English
• Orthographic rules
– Defines spelling modifications during affixation
– E.g. true  tru in context of true  truthfully
Grammatical Morphemes
• New forms are rarely added to closed
morpheme classes
• Examples
– prepositions
– articles
– conjunctions
at, for, by
a, the
and, but, or
Morphological Parsing (stemming)
• Goal: Break the surface input into morphemes
• foxes
– Fox is a noun stem
– It has -es as a plural suffix
• rewrites
– Write is the verb stem
– It has re- as a prefix meaning to do again
– It has a –s suffix indicating a continuing activity
Inflectional Morphology
Does not change the grammatical category
• Nouns
– plural marker: -s (dog + s = dogs)
– possessive marker: -’s (dog + ’s = dog’s)
• Verbs
–
–
–
–
3rd person present singular: -s (walk + s = walks)
past tense: -ed (walk + ed = walked)
progressive: -ing (walk + ing = walking)
past participle: -en or -ed (eat + en = eaten)
• Adjectives
– comparative: -er (fast + er = faster)
– superlative: -est (fast + est = fastest)
• In English
– Meaning transformations are predictable
– All inflectional affixes are suffices
– Inflectional affixes are attached after any derivational (next slide) affixes
• E.g. modern + ize + s = modernizes; not modern + s + ize
Concatenative and Non-concatenative
• Concatenative morphology combines by concatentation
– prefixes and suffixes
• Non-concatentative morphology combines in complex ways
– circumfixes and infixes
– templatic morphology
• words change by internal changes to the root
• E.g. (Arabic, Hebrew) ktb (write), kuttib (will have been written)
ktb
C V C C V C  kuttib
ui
Templative Example
Verbal Inflective Morphology
• Verbal inflection
– Main verbs (sleep, like, fear) are relatively regular
 Standard morphemes: -s, ing, ed
 These morphemes are productive: Emails, Emailing, Emailed
– Combination with nouns for syntactical agreement
 I am, we are, they were
• There are exceptions
–
–
–
–
Eat (will eat, eats, eating, ate)
Catch (will catch, catches, catching, caught)
Be (will be, is, being, was)
Have (will have, has, having, had)
• General Observations about English
– There are approximately 250 Irregular verbs that occur
– Other languages have more complex verbal inflection rules
Nominal Inflective Morphology
• Plural forms (s or es)
• Possessives (cat’s or cats’)
• Regular Nouns
– Singular (cat, bush)
– Plural (cats, bushes)
– Possessive (cat’s bushes’)
• Irregular Nouns
– Singular (mouse, ox)
– Plural (mice, oxen)
Derivational Morphology
• Word stem combines with grammatical morpheme
– Usually produces word of different class
– Complex rules that are less productive with many exceptions
– Sometimes meanings of derived terms are hard to predict (E.g. hapless)
• Examples: verbs to nouns
– generalize, realize  generalization, realization
– Murder, spell  murderer, speller
• Examples: verbs and nouns to adjectives
– embrace, pity embraceable, pitiable
– care, wit  careless, witless
• Example: adjectives  adverbs
– happy  happily
• More complicated to model than inflection
– Less productive: science-less, concern-less, go-able, sleep-able
Derivational Morphology Examples
Level 1
• Examples: ize, ization,
ity, ic, al, ity, ion, y,
ate, ous, ive, ation
• Observations
– Can attach to non-words
(e.g. fratern-al, paternal)
– Often changes stem’s
stress and vowel quality
Level 2
• Examples: hood, ness,
ly, s, ing, ish, ful, ly,
less, y (adj.)
• Observations
– Never precede Level 1
suffixes
– Never change stress or
vowel quality
– Almost always attach to
words that exists
Level 1 + Level 1: histor-ic-al, illumina-at-tion, indetermin-at-y;
Level 1 + Level 2: fratern-al-ly, transform-ate-ion-less;
Level 2 + Level 2: weight-less-ness
Big one: antidisestablishmenterrianism (if I spelled it right)
Adjective Morphology
• Standard Forms
–
–
–
–
–
–
–
Big, bigger, biggest
Cool, cooler, coolest, cooly
Red, redder, reddest
Clear, clearer, clearest, clearly, unclear, unclearly
Happy, happier, happiest, happily
Unhappy, unhappier, unhappiest, unhappily
Real, unreal, really
• Exceptions: unbig, redly, realest
Identify and Classify Morphemes
• In each group
– Two words have a different morphological structure
– One word has a different type of suffix
– One word has no suffix at all
• Perform the following tasks
– 1.Isolate the suffix that two of the words share.
– 2.Identify whether it is (i) free or bound; (ii) prefix, infix, suffix;
(iii) inflectional or derivational.
– 3.Give its function/meaning.
– 4.Identify the word that has no suffix
– 5.Identify the word that has a suffix which is different from the
others in each group.
a.
rider
colder
silver
actor
b.
tresses
melodies
Bess’s
guess
c.
running
foundling
handling
fling
d.
tables
lens
witches
calculates
Computational Techniques
• Regular Grammars
• Finite State Automata
• Finite State Transducer
• Parsing – Top down and bottom up
Regular Grammars
• Grammar: Rules that define legal characters strings
• A regular grammar accepts regular expressions
• A regular expression must satisfy the following:
–
–
–
–
The grammar with no strings is regular
The grammar that accepts the empty string is regular
A single character is a regular grammar
If r1 and r2 are regular grammars, then r1 union r2, and r1
concatenated with r2 are regular grammars
– If r is a regular grammar, then r* ( where * means zero or more
occurrences) is regular
Notations to Express Regular
Expressions
•
•
•
•
•
•
Conjunction: abc
Disjunction: [a-zA-Z], gupp(y|ies)
Counters: a*, a+, ?, a{5}, a{5,8}, a{5,}
Any character: a.b
Not: [^0-9]
Anchors: /^The dog\.$/
– Note: the backslash before the period is an escape character
– Other escape characters include \*, \?, \n, \t, \\, \[, \], etc.
• Operators
– \d equivalent to [0-9], \D equivalent to [^0-9]
– \w equivalent to [a-zA-z0-9 ], \W equivalent to [^\w]
– \s equivalent to [ \r\t\n\f], \S equivalent to [^s]
• Substitute one regular expression for another: s/regExp1/regExp2/
Examples of Regular Expressions
• All strings ending with two zeroes
• All strings containing three consecutive zeroes
• All strings that every block of five consecutive
symbols have at least two zeroes
• All strings that the tenth symbol from the right is a
one
• The set of all modular five numbers
Finite State Automata (FSA)
FSA’s recognize grammars that are regular
Definition: A FSA consists of
1.
2.
3.
4.
5.
a set of states (Σ)
a starting state (q0)
a set of final or accepting states (F  Q)
a finite set of symbols (Q)
a transition function ((q,i) ) that maps QxΣ to Q.
It switches from a from-state to a to-state, based
on one of the valid symbols
Synonyms: Finite Automata, Finite State Machine
Recognition
Determine if the machine accepts a particular string
i.e. Is a string in the language?
• Traditionally, Turing used a tape reader to depict a FSA
• Algorithm
–
–
–
–
–
–
Begin in the start state
Examine the current input character
Consult the table
Go to a new state and update the tape pointer.
Until you run out of tape.
The machine accepts the string processing stops in a final state
Graphs and State Transition Tables
• What can we can say about this machine?
–
–
–
–
–
It has 5 states
At least b,a, and ! are in its alphabet
q0 is the start state
q4 is an accept state
It has 5 transitions
• Questions
– Which strings does it accept? baaaa, aaabaaa, ba
– Is this the only FSA that can accept this language?
State Transition Table
Annotated Directed Graph
An FSA only can accept regular strings.
Question: Can you think of a string that is not regular?
Recognizer Implementation
index = beginning of tape
state = start state
DO
IF transition[index, tape[index]] is empty
RETURN false
state = transition[index, tape[index]]
index = index + 1
UNTIL end of tap is reached
IF state is a final state
RETURN true
ELSE RETURN false
Key Points Regarding FSAs
• This algorithm is a state-space search algorithm
– Implementation uses simple table lookups
– Success occurs when at the end of a string, we reach a final state
• The results are always deterministic
– There is one unique choice at each step
– The algorithm recognizes all regular languages
• Perl, Java, etc. use a regular expression algorithm
– Create a state transition table from the expression
– pass the table to the FSA interpreter
• FSA algorithms
– Recognizer: determines if a string is in the language
– Generator: Generates all strings in the language
Non-Deterministic FSA
• Deterministic: Given a state and symbol, only one transition is possible
• Nondeterministic:
– Given a state and a symbol, multiple transitions are possible
– Epsilon transitions: those which DO NOT examine or advance the tape
• The Nondeterministic FSA recognizes a string if:
– At least one transition sequence ends at a final state
– Note: all sequences DO NOT have to end at a final state
– Note: String rejection occurs only when NO sequence ends at a final state
Examples
ε
Concatenation
Closure Closure
Union
Using NFSAs
Input
State
0
1
2
3
4
b
1
0
0
0
0
a
0
2
2,3
0
0
!
0
0
0
4
0
e
0
0
0
0
0
NFSA Recognition of “baaa!”
Breadth-first Recognition
of “baaa!”
Nondeterministic FSA Example
b a
q0
q1 q2
a a
!
q2 q3
\
q4
Other FSA Examples
Dollars and Cents
Exercise: Create a FSA for the following regular expressions
(0|1)*
[a-f1-9]
abc{5}
Non Deterministic FSA Recognizer
Recognizer (index, state)
LOOP
IF end of tape THEN
IF state is final RETURN true
ELSE RETURN false
IF no possible transitions RETURN false
IF there is only one transition
state = transition[index, tape[index]]
IF not an epsilon transition THEN index++
ELSE
FOR each possible transition not considered
result = CALL recognizer(nextState,nextIndex)
IF result = true RETURN true
END LOOP
RETURN false
FSA’s and Morphology
• Apply an FSA to each word in the dictionary to capture the
morphological forms.
• Groups of words with common morphology can share FSAs
Building a Lexicon with a FSA
Derivational Rules
Simple Morphology Example
unq0
q1
adj-root
q2
e
  er 


(un)adj roots( est )
  ly 


Stop states: q2 and q3
-er –est -ly
q3
From
To
Output
0
1
un
0
1
NULL
1
2
adj-root-list
2
3
er;est;ly
An Extended Example
unq0
e
adj-root-1
-er –est -ly
q1
q2
q5
adj-root-1
q3
q4 -er –est
adj-root-2
Adj-root1: clear, happy, real
Adj-root2: big, red
  er 


(un)adj roots1( est )
  ly 


From To
Output
0
1
un
0
3
NULL
1
2
adj-root-list-1
2
5
er;est;ly
3
2
adj-root-list-1
3
4
adj-root-list-2
4
5
er;est
  er 
adj roots2 ( )
 est 
Representing Derivational Rules
nouns
noun
ity, ness
ize
ation
er
verbs
verb
ative
ive
able
adjectives
ful
adj
ly
ly
adverbs
adverb
Finite State Transducer (FST)
• Definition: A FST is a 5-tuple consisting of
– Q: set of states {q0,q1,q2,q3,q4}
– : an alphabet of complex symbols
•
•
•
•
Each complex symbol contains two simple symbols
The first symbol is from an input alphabet i  I
The second symbol is from an output alphabet o  O
 is in I x O, ε is the null character
– q0: a start state
– F: a set of final states in Q {q4}
– (q,i:o): a transition function mapping Q x  to Q
• Concept: Translates and writes to a second tape
a:o
b:m
a:o
a:o
!:?
q0
q1
q2
q3
Example: baaaamoooo
q4
Transition Example
c:c
a:a
t:t
+N:ε
+PL:s
• c:c means read a c on one tape and write a c on the
other
• +N:ε means read a +N symbol on one tape and write
nothing on the other
• +PL:s means read +PL and write an s
On-line demos
• Finite state automata demos
http://www.xrce.xerox.com/competencies/content
-analysis/fsCompiler/fsinput.html
• Finite state morphology
http://www.xrce.xerox.com/competencies/content
-analysis/demos/english
• Some other downloadable FSA tools:
http://www.research.att.com/sw/tools/fsm/
Lexicon for L0
Rule based languages
Top Down Parsing
Driven by the grammar, working down
S
NP
VP
NP
Nom
Pro
Verb
Det
Noun
Noun
I
prefer
a
morning
flight
S → NP VP, NP→Pro, Pro→I, VP→V NP, V→prefer, NP→Det Nom, Det→a,
Nom→Noun Nom, Noun→morning, Noun→flight
[S [NP [Pro I]] [VP [V prefer] [NP [Det a] [Nom [N morning] [N flight]]]]]
Bottom Up Parsing
Driven by the words, working up
The Bottom Up Parse
The Grammar
0) S  E $
1)E  E + T | E - T | T
2)T  T * F | T / F | F
3) F  num | id
1)id - num * id
2)F - num * id
3)T - num * id
4)E - num * id
5)E - F * id
6)E - T * id
7)E - T * F
8)E - T
9)E
10)S  correct sentence
Note: If there is no rule that applies, backtracking is necessary
Top-Down and Bottom-Up
• Top-down
– Advantage: Searches only trees that are legal
– Disadvantage: Tries trees that don’t match the words
• Bottom-up
– Advantage: Only forms trees matching the words
– Disadvantage: Tries trees that make no sense globally
• Efficient combined algorithms
– Link top-down expectations with bottom-up data
– Example: Top-down parsing with bottom-up filtering
Stochastic Language Models
A probabilistic view of language modeling
• Problems
– A Language model cannot cover all grammatical rules
– Spoken language is often ungrammatical
• Solution
– Constrain search space emphasizing likely word sequences
– Enhance the grammar to recognize intended sentences even
when the sequence doesn't satisfy the rules
Probabilistic Context-Free Grammars (PCFG)
Goal: Assist in discriminating among competing choices
• Definition: G = (VN, VT, S, P, p);
VN = non-terminal set of symbols
VT = terminal set of symbols
S = start symbol
p = set of rule probabilities
R = set of rules
P(S ->W |G): S is the start symbol, W = expression in grammar G
• Training the Grammar: Count rule occurrences in a training corpus
P(R | G) = Count(R) / ∑C(R)
PFSA (Probabilistic Finite State
Automata)
• A PFSA is a type of Probabilistic Context Free Grammar
–
–
–
–
The states are the non-terminals in a production rule
The output symbols are the observed outputs
The arcs represent a context-free rule
The path through the automata represent a parse tree
• A PCFG considers state transitions and the transition path
S1
a
a
S1
S2
b
S2
S3
b
S3
ε
Probabilistic Finite State Machines
• Probabilistic models
determine weights of the
transitions
• The sum of weights leaving a
state total to unity
• Operations
– Consider the weights to
compute the probability of a
given string or most likely path.
– The machine can ‘learn’ the
weights over time
.001
.01
Companion
Canine
.0035
Tooth
Another Example
Pronunciation decoding
[n iy]
Merging the machines together
[n iy]
Another Example