Download Document

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Navajo grammar wikipedia , lookup

Portuguese grammar wikipedia , lookup

Comparison (grammar) wikipedia , lookup

Modern Hebrew grammar wikipedia , lookup

Lexical semantics wikipedia , lookup

Spanish grammar wikipedia , lookup

Japanese grammar wikipedia , lookup

Untranslatability wikipedia , lookup

Lithuanian grammar wikipedia , lookup

Latin syntax wikipedia , lookup

Esperanto grammar wikipedia , lookup

Modern Greek grammar wikipedia , lookup

Compound (linguistics) wikipedia , lookup

Ukrainian grammar wikipedia , lookup

Zulu grammar wikipedia , lookup

Old Irish grammar wikipedia , lookup

Ojibwe grammar wikipedia , lookup

Ancient Greek grammar wikipedia , lookup

Russian declension wikipedia , lookup

Stemming wikipedia , lookup

Scottish Gaelic grammar wikipedia , lookup

Turkish grammar wikipedia , lookup

Old Norse morphology wikipedia , lookup

Agglutination wikipedia , lookup

Swedish grammar wikipedia , lookup

Distributed morphology wikipedia , lookup

Serbo-Croatian grammar wikipedia , lookup

Yiddish grammar wikipedia , lookup

Old English grammar wikipedia , lookup

French grammar wikipedia , lookup

Inflection wikipedia , lookup

Polish grammar wikipedia , lookup

Pipil grammar wikipedia , lookup

Malay grammar wikipedia , lookup

Morphology (linguistics) wikipedia , lookup

Transcript
ICS 482
Natural Language Processing
Words & Transducers-Morphology - 1
Muhammed Al-Mulhem
March 1, 2009
7/6/2017
1
Morphological
Analysis
Individual words
are analyzed into
their components
Steps of NLP
Discourse Analysis
Resolving
references Between
sentences
Pragmatic Analysis
Syntactic Analysis
Linear sequences
of words are
transformed into
structures that
show how the
words relate to
each other
7/6/2017
Semantic Analysis
To reinterpret what
was said to what was
actually meant
A transformation is
made from the input
text to an internal
representation that
reflects the meaning
2
Morphology
• Morphology: the study of the way words are
built up from smaller meaning-bearing units,
morphemes.
• Morpheme: is the minimal meaning-bearing
unit in a language
• Example
– fox: One morpheme fox.
– cats: Two morphemes, cat and –s.
7/6/2017
3
Morpheme Definitions
• There are two broad classes of morphemes:
• Stem
– The main morpheme of the word, supplying the main
meaning.
• Affix
– Add additional meaning of various kinds. It is further divided
into:
•
•
•
•
Prefixes,
Suffixes,
Infixes, and
Circumfixes.
4
Morpheme Definitions
• Prefixes:
– Prefixes precede the stem.
• Suffixes:
– Suffixes follow the stem
• Circumfixes:
– Circumfixes do both
• Infixes:
– Infixes are inserted inside the stem.
5
Examples
• Eats: composed of a stem eat and the suffix –s
• Unbuckle: composed of a stem buckle and the prefix
un• English doesn’t really have circumfixes, but many other
languages do.
• In Germany, for example, the past participle of some
verbs is formed by adding ge- to the beginning of the
stem and –t to the end.
• For example, the past participle of the verb sagen (to
say) is gesagt (said).
Morpheme Definitions
• A word can have more than one affix.
• For example: the word rewrites has the prefix re-, the
stem write, and the suffix –s.
• The word unbelievably has the stem believe plus
three affixes (un-, -able, and –ly).
• English doesn’t use more than five affixes.
• Other languages like Turkish can have words with ten
affixes.
7
Morpheme Definitions
• There are many ways to combine
morphemes to create words.
• Four of these are:
– Inflection
– Derivation
– Compounding
– Cliticization
8
Inflection
• It is the combination of a word stem with a
grammatical morpheme, usually resulting in a
word of the same class as the original stem
and usually filling some syntactic function like
agreement.
• For example,
– English has inflectional morpheme –s for marking
the plural on nouns.
– The inflectional morpheme –ed for marking the
past tense on verbs.
Derivation
• It is the combination of a word stem with
a grammatical morpheme, usually
resulting in a word of a different class,
often with a meaning hard to predict
exactly.
• For example,
– The verb computerize can take the
derivational suffix –ation to produce
the noun computerization.
Compounding
• It is the combination of a multiple word
stems together.
• For example,
– The noun doghouse is the
concatenation of the morpheme dog
with the morpheme house.
Cliticization
• It is the combination of a word stem with
a clitic.
• A clitic is a morpheme that acts
syntactically like a word but is reduced
in form and attached to another word.
• For example,
– The English morpheme ‘ve in the
word I’ve is a clitic.
Inflection Morphology
• English nouns have only two kinds of
inflection:
– an affix that marks plural and
– an affix that marks possessive
• Examples: Regular and irregular plurals.
Inflection Morphology
• While the regular plural is spelled -s after
most nouns, it is spelled -es after words
ending in:
•
•
•
•
•
-s (ibis/ibises),
-z (waltz/waltzes),
-sh (thrush/thrushes),
-ch (finch/finches), and sometimes
-x (box/boxes).
• Nouns ending in -y preceded by a consonant
change the -y to -i (butterfly/butterflies).
Inflection Morphology
• The possessive suffix is realized by:
– Apostrophe + -s for regular singular nouns
(llama’s) and plural nouns not ending in -s
(children’s) and
– A lone apostrophe after regular plural nouns
(llamas’) and some names ending in -s or -z
(Euripides’ comedies).
Inflection Morphology
• English has three kinds of verbs;
– Main verbs, (eat, sleep, impeach),
– Modal verbs (can, will, should), and
– Primary verbs (be, have, do)
• We will mostly be concerned with the main
and primary verbs, because it is these that
have inflectional endings.
Inflection Morphology
• These regular verbs (e.g. walk, or inspect)
have four morphological forms, as follow:
Inflection Morphology
• Irregular verbs in English can have as many
as eight or as few as three forms
Derivational Morphology
• A common kind of derivation in English is the
formation of new nouns from verbs or adectives
(Nominalization)
• Adjectives can also be derived from nouns and verbs
19
Cliticization Morphology
• English clitics include these auxiliary verbal
form:
• Clitics is ambiguous.
• Example: She’s can mean she is or she has.
20
Compounding Morphology
• The kind of compound morphology we have
discussed so far, in which a word is composed of a
string of morphemes concatenated together is often
called concatenative morphology.
• A number of languages have extensive nonconcatenative morphology, in which morphemes
are combined in more complex ways.
• Another kind of non-concatenative morphology is
called templatic morphology or root-and-pattern
morphology.
• Example: Read Chapter 3.
21
Aggrement
• We say that the subject noun and the main
verb in English have to agree in number,
meaning that the two must either be both
singular or both plural.
• There are other kinds of agreement
processes. For example nouns, adjectives,
and sometimes verbs in many languages are
marked for gender.
• A gender is a kind of equivalence class that is
used by the language to categorize the
nouns; each noun falls into one class.
22
Aggrement
• Many languages (for example Romance
languages like French, Spanish, or Italian)
have 2 genders, which are referred to as
masculine and feminine.
• Other languages (like most Germanic and
Slavic languages) have three (masculine,
• feminine, neuter).
• Gender is sometimes marked explicitly on a
noun; for example Spanish masculine words
often end in -o and feminine words in -a.
23
Aggrement
• But in many cases the gender is not marked
in the letters or phones of the noun itself.
Instead, it is a property of the word that must
be stored in a lexicon as in the table bellow:
24
Parsing
• Taking a surface input and identifying its
components and underlying structure
• Morphological parsing: parsing a word
into stem and affixes and identifying the
parts and their relationships
– Stem and features:
• goose  goose +N +SG or goose + V
• geese  goose +N +PL
• gooses  goose +V +3SG
– Bracketing: indecipherable 
[in [[de [cipher]] able]]
7/6/2017
25
Why parse words?
• For spell-checking
– Is muncheble a legal word?
• To identify a word’s part-of-speech (POS)
– For sentence parsing, for machine translation, …
• To identify a word’s stem
– For information retrieval
• Why not just list all word forms in a
lexicon?
7/6/2017
26
What do we need to build a morphological
parser?
• Lexicon: stems and affixes (w/
corresponding Part of Speech (POS))
• Morphotactics of the language: model of
how morphemes can be affixed to a stem
• Orthographic rules: spelling modifications
that occur when affixation occurs
– in  il in context of l (in- + legal)
7/6/2017
27
Syntax and Morphology
• Phrase-level agreement
– Subject-Verb
• Ali studies hard (STUDY+3SG)
• Sub-word phrasal structures
– ‫ولحاجاتنا‬
– ‫نا‬+‫حاجات‬+‫لـ‬+‫و‬
– and+for+need+PL+Poss:1PL
– And for our needs
28
Morphotactic Models
• English nominal inflection
plural (-s)
reg-n
q0
q1
q2
irreg-pl-n
irreg-sg-n
•reg-n: regular noun
•irreg-pl-n: irregular plural noun
•Inputs: cats, goose, geese
7/6/2017
•irreg-sg-n: irregular singular noun
29
• Derivational morphology: adjective
fragment
adj-root1
unq0

-er, -ly, -est
q1
q2
adj-root1
q3
q5
q4
-er, -est
adj-root2
• Adj-root1: clear, happy, real
• Adj-root2: big, red
7/6/2017
30
Using FSAs to Represent the Lexicon and Do
Morphological Recognition
• Lexicon: We can expand each nonterminal in our NFSA into each stem in its
class (e.g. adj_root2 = {big, red}) and
expand each such stem to the letters it
includes (e.g. red  r e d, big  b i g)
e
r

q0
q1
q2
b
d
q4
7/6/2017
q3
i
q5
g
q6
q7
-er, -est
31
Limitations
• To cover all of English will require very
large FSAs with consequent search
problems
– Adding new items to the lexicon means recomputing the FSA
– Non-determinism
• FSAs can only tell us whether a word is in
the language or not – what if we want to
know more?
– What is the stem?
– What are the affixes?
– We used this information to build our FSA:
can we get it back?
7/6/2017
32
Parsing with Finite State Transducers
• cats cat +N +PL
• Kimmo Koskenniemi’s two-level morphology
– Words represented as correspondences between
lexical level (the morphemes) and surface level (the
orthographic word)
– Morphological parsing :building mappings between
the lexical and surface levels
c
c
7/6/2017
a
a
t
t
+N +PL
s
33
Finite State Transducers
• FSTs map between one set of symbols
and another using an FSA whose
alphabet  is composed of pairs of
symbols from input and output alphabets
• In general, FSTs can be used for
– Translator (Hello:‫)مرحبا‬
– Parser/generator (Hello:How may I help you?)
– To map between the lexical and surface levels
of Kimmo’s 2-level morphology
7/6/2017
34
• FST is a 5-tuple consisting of
– Q: set of states {q0,q1,q2,q3,q4}
– : an alphabet of complex symbols, each is an
i/o pair such that i  I (an input alphabet) and o
 O (an output alphabet) and  is in I x O
– q0: a start state
– F: a set of final states in Q {q4}
– (q,i:o): a transition function mapping Q x  to
Q
– Emphatic Sheep  Quizzical Cow
a:o
b:m
a:o
a:o
!:?
q0
7/6/2017
q1
q2
q3
q4
35
FST for a 2-level Lexicon
• Example
q0
q0
q1
q1
g
7/6/2017
c
a
q2
e:o
q2
q3
e:o
t
q3
q4
s
Reg-n
Irreg-pl-n
cat
g o:e o:e s e g o o s e
q5
e
Irreg-sg-n
36
FST for English Nominal Inflection
reg-n
+N:
q1
q4
+PL:^s#
+SG:-#
q0 irreg-n-sg
q2 +N:
q5
irreg-n-pl
q3
q6
+N:
+SG:-#
q7
+PL:-s#
Combining (cascade or composition) this FSA
with FSAs for each noun type replaces e.g. regn with every regular noun representation in the
lexicon
7/6/2017
37
Orthographic Rules and FSTs
• Define additional FSTs to implement rules
such as consonant doubling (beg 
begging), ‘e’ deletion (make  making), ‘e’
insertion (watch  watches), etc.
Lexical
f
o
x
+N
+PL
Intermediate
f
o
x
^
s
Surface
f
o
x
e
s
7/6/2017
#
38
• Note: These FSTs can be used for
generation as well as recognition by
simply exchanging the input and output
alphabets (e.g. ^s#:+PL)
7/6/2017
39
Administration
• Next Sunday: Quiz 1: 20 Minutes In the
class
• Assignment 2: What was your findings
about Python?
• New Assignment (3)
7/6/2017
40
Assignment 3: Part 1
A genre for your Corpora
• Choose a Domain for your Corpora
–
–
–
–
–
–
7/6/2017
Technology and Computers
Management
–
Weather
–
–
Sport
–
Economics
–
Politics
–
Education
Health care
History
Traditional Poems
New Poems
Other suggested fields
41
Assignment 3: Part 1
A genre for your Corpora
• Put your choice on the discussion list named 'My
Corpora'.
• read other selections before
• Avoid selecting a topic that has been selected
• You might need to suggest unlisted field
– with the arrangement of the instructor
• Collect text files and keep them in one directory as
your corpora for future use
• Suggested total size (sum of sizes of all text files)
– larger than 10Mbyte of Arabic text
7/6/2017
42
Assignment 3: Part 2
List text files in a chosen directory
• Write a program that allows the user to
browse and select a directory, then the
program will list the names of the text files in
that directory. This program is needed to be
used for future assignments and the course
project. You can use any language you are
mastering. However, Python might be a good
choice
7/6/2017
43
Assignment 3: Part 3
The most used n words & their frequencies in
your corpora
• After building your corpora, you need to find
the most used 100 words in your corpora with
their frequencies. You might do that by writing
a program that let the user choose the
directory of the corpora where the text files
are located and find the most use n words.
Where n could be 100.
7/6/2017
44
Deliverables
• A Report that contains the following headings:
–
–
–
–
–
–
–
Introduction
Description & Specification
Corpora (genre) and the reason
Design: High Level architecture
Implementation Issues
Accomplished Parts
Unaccomplished Parts
– Problem faced
– Suggested
Enhancements
– Test Cases and Screen Shots
– How to compile and run the
source code
– Conclusion
– Recommendation
• The source code of the program and the executable version
of the same code. Please do not include the source code in
the report.
7/6/2017
45
‫‪Thank you‬‬
‫السالم عليكم ورحمة هللا‬
‫‪46‬‬
‫‪7/6/2017‬‬