Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Part-of-Speech tagging of Northern Sotho: Disambiguating polysemous function words Gertrud Faaß [email protected] Ulrich Heid [email protected] Elsabé Taljard [email protected] DJ Prinsloo [email protected] This Talk • Prologue • Challenges for tagging Sotho texts • Objectives • Descriptive state of the art for tagging of Sotho texts – Tools – Tagsets • The ambiguity problem • Methodology • Results • Conclusions & future work Nine Official Bantu Languages of SA • Sotho Group – Northern Sotho / Sepedi – Tswana – Southern Sotho • Nguni Group – Zulu – Swati – Xhosa – Ndebele ********************* – Venda and Tsonga Noun class system 1 Cl.No CP Example 1 2 moba- mosadi ‘woman’ basadi ‘women’ 1a 2b Øbo- malome ‘uncle’ bomalome ‘uncle & co’ 3 mo- monwana ‘finger’ 4 me- menwana ‘fingers’ 5 le- lebone ‘light’ 6 ma- mabone ‘lights’ 7 se- selepe ‘axe’ 8 di- dilepe ‘axes’ 9 N- / Ø- mpša ‘dog’ / hlogo ‘head’ 10 diN- / di- dimpša ‘dogs’ / dihlogo ‘heads’ 14 bo- bodulo ‘residence’ (6) ma- madulo‘residences’ 15 go- go ruta ‘to learn’ 16 fa- fase ‘below’ 17 go- godimo ‘above’ 18 mo- morago ‘behind’ N- N- / Ø- ntle ‘outside’ pele ‘in front’ (24) ga- ga- gare ‘middle’ Concordial agreement – Northern Sotho Taljard and Bosch (2005) Challenges for tagging • Ambiguity, for example: – function words: -a- being 9-ways ambiguous, -go- up to 30(11,6,5,…)-ways • Unknown words (N+V) – noun derivation: toropo (town) -> toropong (in/at/to town) – verb derivation: next slides Challenges: unknown words • Agglutinating languages: extensive use of affixes – Example: rekišeditšwe ‘was / were sold for’ < rek- ‘buy’ (verb root) + -iš- (causative) + el- (applied) + -il- (past tense) + -w(passive) + -e (inflectional ending) Examples of suffixes and combinations for a single verb • ROOTetšane, ROOTetšanwa, ROOTetšanwe, ROOTiša, ROOTišitše, ROOTišwa, ROOTišitšwe, ROOTišana, ROOTišane, ROOTišanwa, ROOTišanwe, ROOTišega, ROOTišegile, ROOTišetša , ROOTišeditše, ROOTišetšwa, ROOTišeditšwe, ROOTišetšana, ROOTišetšane, ROOTišetšanwa, ROOTišetšanwe, ROOTišiša, ROOTišišitše, ROOTišišwa, ROOTišišitšwe, ROOTišišana, ROOTišišane, ROOTišišanwa, ROOTišišanwe, ROOToga, ROOTogile, ROOTogwa, ROOTogilwe, ROOTogana, ROOTogane, ROOToganwa, ROOToganwe, ROOTogela, ROOTogetše, ROOTogelwa, ROOTogetšwe, ROOTola, ROOTotše, ROOTolwa, ROOTotšwe, ROOTolana, ROOTolane, ROOTolanwa, ROOTolanwe, ROOTolega, ROOTolegile, ROOTolela, ROOToletše, ROOTolelwa, ROOToletšwe, ROOTolelana, ROOTolelane, ROOTolelanwa, ROOTolelanwe, ROOTolla, ROOTolotše, ROOTollwa, ROOTolotšwe, ROOTollana, ROOTollane, ROOTollanwa, ROOTollanwe, ROOTollega, ROOTollegile, ROOTollela, ROOTolletše, ROOTollelwa, ROOTolletšwe, ROOTollelana, ROOTollelane, ROOTollelanwa, ROOTollelanwe, ROOTolliša, ROOTollišitše, ROOTollišwa, ROOTollišitšwe, ROOTollišana, ROOTollišane, ROOTollišanwa, ROOTollišanwe, ROOTologa, ROOTologile, ROOTologana, ROOTologane, ROOTologanwa, ROOTologanwe, ROOTološa, ROOTološitše, ROOTološwa, ROOTološitšwe, ROOTološana, ROOTološane, ROOTološanwa, ROOTološanwe, ROOTološetša, ROOTološeditše, ROOTološetšwa, ROOTološeditšwe, ROOTološetšana, ROOTološetšane, ROOTološetšanwa, ROOTološetšanwe, ROOToša, ROOTošitše, ROOTošwa, ROOTošitšwe, ROOTošetša, ROOTošeditše, ROOTošetšwa, ROOTošeditšwe, ROOTošetšana, ROOTošetšane, ROOTošetšanwa, ROOTošetšanwe Solution for unknown verbs and nouns • Verb guesser: detection of – longest match suffix combinations – occurrences in corpora • Noun guesser: matching of – singular/plural-forms – nominal suffixes – occurrences in corpora Objectives • Tagging with a detailed tagset: class numbers – Nouns, adjectives, pronouns, concords, demonstratives • Disambiguation • Motivation: tagging used as preprocessing for: – Chunking, parsing – Lexicography (tag relatively large corpora,e.g. PSC) – Detailed linguistic research (e.g. grammar development) – Information extraction State of the art for tagging: Sotho languages • Comparison of tagsets and tools is hardly possible – Different applications of tagged material (linguistic description, lexicography, parsing, etc.) – Different number of tags – Differences in granularity Descriptive State of the Art: tagsets and tools Authors No. of tags Noun class yes/no Tool? Van Rooy and Pretorius (2003) 106 no no De Schryver and De Pauw (2007) 56 no yes Kotzé (several, e.g. 2008) partial no yes Taljard et al. (2008) 141/262 yes no This paper 25/141 yes yes Descriptive State of the Art for tagging: Sotho languages Tools: • Full – De Schryver and de Pauw (2007) Northern Sotho tagger (statistical) • Partial – Kotzé (several publications, e.g. 2008) Verbal and nominal segment (finite state) Descriptive state of the art for tagging: Sotho languages Applications of tagsets: • De Schryver and de Pauw (2007): used for lexicography • Van Rooy and Pretorius (2003): linguistic description of Setswana • Taljard et al. (2008): morphosyntactic and general linguistic description The ambiguity problem • -a-, -go-: see handout for possible readings • Local context may not identify noun class of subject concord: (Masogana) … A nwa bjalwa CS06 drink beer (Young men) … “They drink beer.” The ambiguity problem: possible solutions – Dependent on objectives • Flat tagset ignoring irrelevant details (cf. handout for -go-) • Layered tagset: granularity Tagset (cf. Handout) • Level 1 – Noun = (N) – Subject concord (CS), Object concord (CO) – Pronouns (PRO) • Level 2 – emphatic (only for pronouns) EMP – possessive (dto.) POSS • Level 3 – Classes -> N.01a, N.01, N.02, N.03, … , PERS, etc. • Example: noun of class 1 = N.01 possessive pronoun of class 6 = PRO.POSS.06 RF tagger technology (cf. Schmid and Laws (2008) • • • • Hidden Markov Model (HMM) Tagger Additional external lexicon Large, fine-grained tagsets Several levels of description: e.g. German articles: ART.Definiteness.Case.Number.Gender • Calculates joint (product) probabilities Training corpus • 45,000 tokens manually annotated word forms from two text types • Not balanced (25,000 tokens out of a novel, 2 times 10,000 tokens out of dissertations) Comparing taggers on manually annotated data • • • • Tree-Tagger (Schmidt 1994) TnT Tagger (Brants 2000) MBT Tagger (Daelemans et al. 2007) RF-Tagger (Schmid and Laws 2008) Effects of size of training corpus No more adding of training data necessary Effects of highly polysemous function words • Distribution problem • Probability guesses for scarce labels become unreliable –a: » PART (45) vs. CS.01 (1,182) » 91% incorrect labeling of PART. • Detailed discussion: • Handout: -a- refer to pages 2, 4 Alternative proposal: hybrid taggers Spoustová et al. (2007) • Combine rule-based tagging with statistical tagging For Northern Sotho: - Contextual disambiguation works fine with RF-tagger if unambiguous indicators are available – Disambiguating macros (using the same indicators) hence have little effect – Ambiguous contexts hard to account for either way: need for parsing? Results: 10-fold cross validation • Without guessers (to simulate similar conditions for TnT and MBT) – RF-tagger: 91.00% – TnT tagger: 91.01% – MBT: 87.68% • with guessers: (several thousand nouns and verbs part of the lexicon) – Tree-tagger: 92.46% – RF-tagger: 94.16% Conclusions • Different intended uses lead to different tagsets (granularity, number of tags) • Including noun class information is essential for general linguistic research, e.g. grammar development, applications of chunking/parsing • RF-Tagger performs well for our layered tagset with the existing amount of training data (45,000), over 94% correct • Ambiguous contexts and sparse data problem combined lead to a high error rate for statistical parsing - not likely to be solvable with macros – Chunking / Parsing might lead to a more adequate solution for this problem Future work • Apply RF-tagger to the PSC corpus • Evaluate results • Instead of preprocessing rules, a partial postprocessing may make sense (e.g. chunking, parsing)