* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Document
Navajo grammar wikipedia , lookup
Portuguese grammar wikipedia , lookup
Comparison (grammar) wikipedia , lookup
Modern Hebrew grammar wikipedia , lookup
Lexical semantics wikipedia , lookup
Spanish grammar wikipedia , lookup
Japanese grammar wikipedia , lookup
Untranslatability wikipedia , lookup
Lithuanian grammar wikipedia , lookup
Latin syntax wikipedia , lookup
Esperanto grammar wikipedia , lookup
Modern Greek grammar wikipedia , lookup
Compound (linguistics) wikipedia , lookup
Ukrainian grammar wikipedia , lookup
Zulu grammar wikipedia , lookup
Old Irish grammar wikipedia , lookup
Ojibwe grammar wikipedia , lookup
Ancient Greek grammar wikipedia , lookup
Russian declension wikipedia , lookup
Scottish Gaelic grammar wikipedia , lookup
Turkish grammar wikipedia , lookup
Old Norse morphology wikipedia , lookup
Agglutination wikipedia , lookup
Swedish grammar wikipedia , lookup
Distributed morphology wikipedia , lookup
Serbo-Croatian grammar wikipedia , lookup
Yiddish grammar wikipedia , lookup
Old English grammar wikipedia , lookup
French grammar wikipedia , lookup
Polish grammar wikipedia , lookup
Pipil grammar wikipedia , lookup
ICS 482 Natural Language Processing Words & Transducers-Morphology - 1 Muhammed Al-Mulhem March 1, 2009 7/6/2017 1 Morphological Analysis Individual words are analyzed into their components Steps of NLP Discourse Analysis Resolving references Between sentences Pragmatic Analysis Syntactic Analysis Linear sequences of words are transformed into structures that show how the words relate to each other 7/6/2017 Semantic Analysis To reinterpret what was said to what was actually meant A transformation is made from the input text to an internal representation that reflects the meaning 2 Morphology • Morphology: the study of the way words are built up from smaller meaning-bearing units, morphemes. • Morpheme: is the minimal meaning-bearing unit in a language • Example – fox: One morpheme fox. – cats: Two morphemes, cat and –s. 7/6/2017 3 Morpheme Definitions • There are two broad classes of morphemes: • Stem – The main morpheme of the word, supplying the main meaning. • Affix – Add additional meaning of various kinds. It is further divided into: • • • • Prefixes, Suffixes, Infixes, and Circumfixes. 4 Morpheme Definitions • Prefixes: – Prefixes precede the stem. • Suffixes: – Suffixes follow the stem • Circumfixes: – Circumfixes do both • Infixes: – Infixes are inserted inside the stem. 5 Examples • Eats: composed of a stem eat and the suffix –s • Unbuckle: composed of a stem buckle and the prefix un• English doesn’t really have circumfixes, but many other languages do. • In Germany, for example, the past participle of some verbs is formed by adding ge- to the beginning of the stem and –t to the end. • For example, the past participle of the verb sagen (to say) is gesagt (said). Morpheme Definitions • A word can have more than one affix. • For example: the word rewrites has the prefix re-, the stem write, and the suffix –s. • The word unbelievably has the stem believe plus three affixes (un-, -able, and –ly). • English doesn’t use more than five affixes. • Other languages like Turkish can have words with ten affixes. 7 Morpheme Definitions • There are many ways to combine morphemes to create words. • Four of these are: – Inflection – Derivation – Compounding – Cliticization 8 Inflection • It is the combination of a word stem with a grammatical morpheme, usually resulting in a word of the same class as the original stem and usually filling some syntactic function like agreement. • For example, – English has inflectional morpheme –s for marking the plural on nouns. – The inflectional morpheme –ed for marking the past tense on verbs. Derivation • It is the combination of a word stem with a grammatical morpheme, usually resulting in a word of a different class, often with a meaning hard to predict exactly. • For example, – The verb computerize can take the derivational suffix –ation to produce the noun computerization. Compounding • It is the combination of a multiple word stems together. • For example, – The noun doghouse is the concatenation of the morpheme dog with the morpheme house. Cliticization • It is the combination of a word stem with a clitic. • A clitic is a morpheme that acts syntactically like a word but is reduced in form and attached to another word. • For example, – The English morpheme ‘ve in the word I’ve is a clitic. Inflection Morphology • English nouns have only two kinds of inflection: – an affix that marks plural and – an affix that marks possessive • Examples: Regular and irregular plurals. Inflection Morphology • While the regular plural is spelled -s after most nouns, it is spelled -es after words ending in: • • • • • -s (ibis/ibises), -z (waltz/waltzes), -sh (thrush/thrushes), -ch (finch/finches), and sometimes -x (box/boxes). • Nouns ending in -y preceded by a consonant change the -y to -i (butterfly/butterflies). Inflection Morphology • The possessive suffix is realized by: – Apostrophe + -s for regular singular nouns (llama’s) and plural nouns not ending in -s (children’s) and – A lone apostrophe after regular plural nouns (llamas’) and some names ending in -s or -z (Euripides’ comedies). Inflection Morphology • English has three kinds of verbs; – Main verbs, (eat, sleep, impeach), – Modal verbs (can, will, should), and – Primary verbs (be, have, do) • We will mostly be concerned with the main and primary verbs, because it is these that have inflectional endings. Inflection Morphology • These regular verbs (e.g. walk, or inspect) have four morphological forms, as follow: Inflection Morphology • Irregular verbs in English can have as many as eight or as few as three forms Derivational Morphology • A common kind of derivation in English is the formation of new nouns from verbs or adectives (Nominalization) • Adjectives can also be derived from nouns and verbs 19 Cliticization Morphology • English clitics include these auxiliary verbal form: • Clitics is ambiguous. • Example: She’s can mean she is or she has. 20 Compounding Morphology • The kind of compound morphology we have discussed so far, in which a word is composed of a string of morphemes concatenated together is often called concatenative morphology. • A number of languages have extensive nonconcatenative morphology, in which morphemes are combined in more complex ways. • Another kind of non-concatenative morphology is called templatic morphology or root-and-pattern morphology. • Example: Read Chapter 3. 21 Aggrement • We say that the subject noun and the main verb in English have to agree in number, meaning that the two must either be both singular or both plural. • There are other kinds of agreement processes. For example nouns, adjectives, and sometimes verbs in many languages are marked for gender. • A gender is a kind of equivalence class that is used by the language to categorize the nouns; each noun falls into one class. 22 Aggrement • Many languages (for example Romance languages like French, Spanish, or Italian) have 2 genders, which are referred to as masculine and feminine. • Other languages (like most Germanic and Slavic languages) have three (masculine, • feminine, neuter). • Gender is sometimes marked explicitly on a noun; for example Spanish masculine words often end in -o and feminine words in -a. 23 Aggrement • But in many cases the gender is not marked in the letters or phones of the noun itself. Instead, it is a property of the word that must be stored in a lexicon as in the table bellow: 24 Parsing • Taking a surface input and identifying its components and underlying structure • Morphological parsing: parsing a word into stem and affixes and identifying the parts and their relationships – Stem and features: • goose goose +N +SG or goose + V • geese goose +N +PL • gooses goose +V +3SG – Bracketing: indecipherable [in [[de [cipher]] able]] 7/6/2017 25 Why parse words? • For spell-checking – Is muncheble a legal word? • To identify a word’s part-of-speech (POS) – For sentence parsing, for machine translation, … • To identify a word’s stem – For information retrieval • Why not just list all word forms in a lexicon? 7/6/2017 26 What do we need to build a morphological parser? • Lexicon: stems and affixes (w/ corresponding Part of Speech (POS)) • Morphotactics of the language: model of how morphemes can be affixed to a stem • Orthographic rules: spelling modifications that occur when affixation occurs – in il in context of l (in- + legal) 7/6/2017 27 Syntax and Morphology • Phrase-level agreement – Subject-Verb • Ali studies hard (STUDY+3SG) • Sub-word phrasal structures – ولحاجاتنا – نا+حاجات+لـ+و – and+for+need+PL+Poss:1PL – And for our needs 28 Morphotactic Models • English nominal inflection plural (-s) reg-n q0 q1 q2 irreg-pl-n irreg-sg-n •reg-n: regular noun •irreg-pl-n: irregular plural noun •Inputs: cats, goose, geese 7/6/2017 •irreg-sg-n: irregular singular noun 29 • Derivational morphology: adjective fragment adj-root1 unq0 -er, -ly, -est q1 q2 adj-root1 q3 q5 q4 -er, -est adj-root2 • Adj-root1: clear, happy, real • Adj-root2: big, red 7/6/2017 30 Using FSAs to Represent the Lexicon and Do Morphological Recognition • Lexicon: We can expand each nonterminal in our NFSA into each stem in its class (e.g. adj_root2 = {big, red}) and expand each such stem to the letters it includes (e.g. red r e d, big b i g) e r q0 q1 q2 b d q4 7/6/2017 q3 i q5 g q6 q7 -er, -est 31 Limitations • To cover all of English will require very large FSAs with consequent search problems – Adding new items to the lexicon means recomputing the FSA – Non-determinism • FSAs can only tell us whether a word is in the language or not – what if we want to know more? – What is the stem? – What are the affixes? – We used this information to build our FSA: can we get it back? 7/6/2017 32 Parsing with Finite State Transducers • cats cat +N +PL • Kimmo Koskenniemi’s two-level morphology – Words represented as correspondences between lexical level (the morphemes) and surface level (the orthographic word) – Morphological parsing :building mappings between the lexical and surface levels c c 7/6/2017 a a t t +N +PL s 33 Finite State Transducers • FSTs map between one set of symbols and another using an FSA whose alphabet is composed of pairs of symbols from input and output alphabets • In general, FSTs can be used for – Translator (Hello:)مرحبا – Parser/generator (Hello:How may I help you?) – To map between the lexical and surface levels of Kimmo’s 2-level morphology 7/6/2017 34 • FST is a 5-tuple consisting of – Q: set of states {q0,q1,q2,q3,q4} – : an alphabet of complex symbols, each is an i/o pair such that i I (an input alphabet) and o O (an output alphabet) and is in I x O – q0: a start state – F: a set of final states in Q {q4} – (q,i:o): a transition function mapping Q x to Q – Emphatic Sheep Quizzical Cow a:o b:m a:o a:o !:? q0 7/6/2017 q1 q2 q3 q4 35 FST for a 2-level Lexicon • Example q0 q0 q1 q1 g 7/6/2017 c a q2 e:o q2 q3 e:o t q3 q4 s Reg-n Irreg-pl-n cat g o:e o:e s e g o o s e q5 e Irreg-sg-n 36 FST for English Nominal Inflection reg-n +N: q1 q4 +PL:^s# +SG:-# q0 irreg-n-sg q2 +N: q5 irreg-n-pl q3 q6 +N: +SG:-# q7 +PL:-s# Combining (cascade or composition) this FSA with FSAs for each noun type replaces e.g. regn with every regular noun representation in the lexicon 7/6/2017 37 Orthographic Rules and FSTs • Define additional FSTs to implement rules such as consonant doubling (beg begging), ‘e’ deletion (make making), ‘e’ insertion (watch watches), etc. Lexical f o x +N +PL Intermediate f o x ^ s Surface f o x e s 7/6/2017 # 38 • Note: These FSTs can be used for generation as well as recognition by simply exchanging the input and output alphabets (e.g. ^s#:+PL) 7/6/2017 39 Administration • Next Sunday: Quiz 1: 20 Minutes In the class • Assignment 2: What was your findings about Python? • New Assignment (3) 7/6/2017 40 Assignment 3: Part 1 A genre for your Corpora • Choose a Domain for your Corpora – – – – – – 7/6/2017 Technology and Computers Management – Weather – – Sport – Economics – Politics – Education Health care History Traditional Poems New Poems Other suggested fields 41 Assignment 3: Part 1 A genre for your Corpora • Put your choice on the discussion list named 'My Corpora'. • read other selections before • Avoid selecting a topic that has been selected • You might need to suggest unlisted field – with the arrangement of the instructor • Collect text files and keep them in one directory as your corpora for future use • Suggested total size (sum of sizes of all text files) – larger than 10Mbyte of Arabic text 7/6/2017 42 Assignment 3: Part 2 List text files in a chosen directory • Write a program that allows the user to browse and select a directory, then the program will list the names of the text files in that directory. This program is needed to be used for future assignments and the course project. You can use any language you are mastering. However, Python might be a good choice 7/6/2017 43 Assignment 3: Part 3 The most used n words & their frequencies in your corpora • After building your corpora, you need to find the most used 100 words in your corpora with their frequencies. You might do that by writing a program that let the user choose the directory of the corpora where the text files are located and find the most use n words. Where n could be 100. 7/6/2017 44 Deliverables • A Report that contains the following headings: – – – – – – – Introduction Description & Specification Corpora (genre) and the reason Design: High Level architecture Implementation Issues Accomplished Parts Unaccomplished Parts – Problem faced – Suggested Enhancements – Test Cases and Screen Shots – How to compile and run the source code – Conclusion – Recommendation • The source code of the program and the executable version of the same code. Please do not include the source code in the report. 7/6/2017 45 Thank you السالم عليكم ورحمة هللا 46 7/6/2017