* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download NooJ Semantic dictionaries - elliadd - Université de Franche
Dependency grammar wikipedia , lookup
Internalism and externalism wikipedia , lookup
Latin syntax wikipedia , lookup
Macedonian grammar wikipedia , lookup
Georgian grammar wikipedia , lookup
Lithuanian grammar wikipedia , lookup
Portuguese grammar wikipedia , lookup
Spanish grammar wikipedia , lookup
Junction Grammar wikipedia , lookup
Serbo-Croatian grammar wikipedia , lookup
Ancient Greek grammar wikipedia , lookup
Swedish grammar wikipedia , lookup
Yiddish grammar wikipedia , lookup
French grammar wikipedia , lookup
Polish grammar wikipedia , lookup
Compound (linguistics) wikipedia , lookup
Russian grammar wikipedia , lookup
Semantic holism wikipedia , lookup
Cognitive semantics wikipedia , lookup
Pipil grammar wikipedia , lookup
Integrating Semantic Dictionaries for English, French and Bulgarian into the NooJ System for the Purposes of Information Retrieval Svetla Koeva, Max Silbetztein 8th INTEX / NooJ Workshop, 30 May, 2005 Main research goals • To provide a sufficient methodology for the implementation of the natural language semantic relations into the NooJ system: – to create specialized Semantic Dictionaries for English, French and Bulgarian based on WordNet semantic relations; – to provide compete formalization of the inflection for simple and compound words included in the Wn structure. History • The integration of semantic relations into the INTEX system was initially proposed at the sixth INTEX workshop. • Later on the idea was advanced into the Joint research RILA project Information retrieval based on semantic relations – LASELDI, Université de Franche-Comté – Department of Computational Linguistics, IBL, Bulgarian Academy of Sciences. Language resources • Bulgarian grammatical dictionary (BGD) – over 83 000 lemmas and 1 100 000 word forms; • English WordNet 2.0 – 115 424 synonymous sets; • Bulgarian WordNet (BalkaNet project) – 22 867 synonymous sets; • French WordNet (EuroWordNet project) – 33 512 synonymous sets; • English dictionary – over 30 000 lemmas (not inflected); • French dictionary – extracted with INTEX. Implementation tasks • To transform the format of the BGD into the NooJ standard; • To create semantic dictionaries for Bulgarian and English; • To associate lemmas from the Bulgarian semantic dictionaries with the corresponding inflection types; • To add missing lemmas and inflection types in BGD, if any; • To create extensive dictionaries and corresponding inflection types for compounds. BGD – Information structure design • Category information – 6 classes: Noun, Verb, Adjective, Pronoun, Numeral, Others (Adverb, Preposition, Conjunction, Particle, Interjection) ; • Paradigmatic information – Personal, Transitive, Perfective, Common, …; • Grammatical information – Inflection, Conjugation, Sound alternations, …. BGD – Grammatical subclasses • Nouns - 22 subclasses with respect of their Type (Common, Proper, Singularia tantum, Pluralia tantum) and Gender; • Verbs – 32 subclasses with respect of Transitivity, Perfectiveness, and Personality; • Adjectives – 2 subclasses; • Pronouns – 26 subclasses with respect of their Type and Possessor; • Numerals – 6 sunclasses. BGD – Grammatical types • Noun – Number, Definiteness, Counting form, Case, Optional forms – 266 types; • Verb – Person, Number, Tense, Mood, Voice, Participles, Gender, Definiteness – 257 types; • Adjective – Gender, Number, Definiteness – 30 types; • Pronoun – Gender, Person, Number, Definiteness, Case, Clitic, Possessing – 28 types; • Numeral – Gender, Number, Definiteness, Approximate form, Male form – 20 types. BGD – Dictionary format а,ЧА,0 ПРИ, 7 sm0, Ok, ‘‘ абсол`ютен, ПРИ, 7 smh, Ok, '2RCия‘ `август, С+М, 10 sml, Ok, '2RCият‘ авиокомп`ания, С+Ж, 1 sf0, Ok, '2RCа‘ австр`ийски, ПРИ, 3 sfd, Ok, '2RCата‘ автоб`ус, С+М, 11 sn0, Ok, '2RCо‘ автомат`ичен, ПРИ, 7 snd, Ok, '2RCото‘ адрес`ирам, Г+Н+Т, 4 p0, Ok, '2RCи‘ агит`ирам, Г+Н+Т, 4 pd, Ok, '2RCите' Transforming BGD Perl Script Dictionary Grammatical types Transliteration of labels NooJ dictionary → aбсол`ютен, ПРИ, 7 `август, С+М, 10 авиокомп`ания, С+Ж,1 aвстр`ийски, ПРИ, 3 автоб`ус, С+М, 11 автомат`ичен, ПРИ, 7 адрес`ирам,Г+Н+Т,4 aбсолютен,A+FLX=A-7 август,N+M+FLX=N_M-10 авиокомпания,N+F+FLX=N_F-1 aвстрийски,A+FLX=A-3 автобус,N+M+FLX=N_M-11 автоматичен,A+FLX=A-7 адресирам,V+IT+FLX=V_IT-4 NooJ formal descriptions → sm0, Ok, ‘‘ smh, Ok, '2RCия‘ sml, Ok, '2RCият‘ sf0, Ok, '2RCа‘ sfd, Ok, '2RCата‘ sn0, Ok, '2RCо‘ snd, Ok, '2RCото‘ p0, Ok, '2RCи‘ pd, Ok, '2RCите‘ A-7 = <E>/sm0 + <L2><S><R>ия<S1>/smh + <L2><S><R>ият<S1>/sml + <L2><S><R>а<S1>/sf0 + <L2><S><R>ата<S1>/sfd + <L2><S><R>о<S1>/sn0 + <L2><S><R>ото<S1>/snd + <L2><S><R>и<S1>/p0 + <L2><S><R>ите<S1>/pd; WordNet semantic relations ILR POS/POS EW2.0 BulNet HYPERONYMY N/N V/V 94 844 15 838 NEAR ANTONYMY N/N A/A V/V 7 642 1 847 PART MERONYMY N/N 8 636 1 241 MEMBER MERONYMY N/N 12 205 841 PORTION MERONYMY N/N 787 107 SUBEVENT V/V 409 162 CAUSES V/V 439 104 SIMILAR TO A/A V/V 22 196 1 479 VERB GROUP V/V 1 748 848 ALSO SEE A/A V/V 3 240 895 Other relations ILR POS/POS EW2.0 BulNet BE IN STATE A/N 1 296 591 BG DERIVATIVE N/V 36 630 6 469 DERIVED A/N 6 809 1 071 PARTICIPLE A/V 401 56 REGION DOMAIN N/N V/N A/N B/N 1 280 4 USAGE DOMAIN N/N V/N A/N B/N 983 22 N/N V/N A/N B/N 6 166 638 CATEGORY DOMAIN Selected relations • Synonymy (reflexive, symmetric, and transitive relation of equivalence); • Hypernymy (inverse, asymmetric, and transitive relation between synonym sets), • Meronymy (inverse, asymmetric, and transitive relation between synonym sets): Part meronymy; Member meronymy; Portion meronymy. Selected relations • Similar to (symmetric relation between similar adjectival synsets); • Verb group (symmetric relation between semantically related verb synsets); • Also see (symmetric relation between synsets verbs or adjectives, that are close in meaning); • Category domain (asymmetric extralinguistic relation between synsets denoting a concept and the sphere of knowledge it belongs to). DELAF semantic dictionaries • These dictionaries consist of pairs of literals defined for the corresponding semantic relation: – car,automobile.N – auto,automibile.N • All possible combinations between literals in the given synsets are listed: – – – – car,automobile.N cars,automobile.N auto,automibile.N autos,automibile.N NooJ Semantic dictionaries Synonymy relation ‘a plant consisting of buildings with facilities for manufacturing’ фабрика,N+FLX=ENG20-03196165-n предпрятие,N+FLX=ENG20-03196165-n factory,N+FLX=ENG20-03196165-n mill,N+FLX=ENG20-03196165-n manufacturing plant,N+FLX=ENG20-03196165-n manufactory,N+FLX=ENG20-03196165-n NooJ Semantic dictionaries Hypernymy relation ‘the organized action of making of goods and services for sale’ производство,N+FLX=ENG20-00859333-n промишленост,N+FLX=ENG20-00859333-n индустрия,N+FLX=ENG20-00859333-n production,N+FLX=ENG20-00859333-n industry,N+FLX=ENG20-00859333-n manufacture,N+FLX=ENG20-00859333-n Inflecting wordnet <SYNSET> <ID>...</ID> <POS>...</POS> <SYNONYM> <LITERAL> otstranqwam (to remove) <SENSE>…</SENSE> <LNOTEGR>ГНТ12</LNOTEGR> </LITERAL> </SYNONYM> <ILR>...<TIPE>...</TYPE></ILR> <DEF> remove something concrete, as by lifting, pushing, taking off, etc. or remove something abstract </DEF> <BCS>...</BCS> </SYNSET> NooJ Semantic descriptions ‘the organized action of making of goods and services for sale’ ENG20-00859333-n = <E>/Hs0 + то/Hsd + <L1>а<S1>/Hp0 + <L1>ата<S1>/Hpd + <L9>мишленост<S9>/Ss0 + <L9>мишлеността<S9>/Ssd + <L9>мишлености<S9>/Sp0 + <L9>мишленостите<S9>/Spd + <B12>индустрия/Ss0 + <B12>индустрията/Ssd + <B12>индустрии/Sp0 + <B12>индустриите/Spd; ENG20-00859333-n = <E>/Hs + <B10>industry/Ss + <B10>industries/Sp0+ <B10>manifactures/Ss + <B10>manifactures/Sp; After the nice solutions • Lemmas which are not included in the BGD: – – – – Lemmas classification to existing inflection types; Formal description of new inflection types Literals in Latin; Validating WordNet. • Semantic ambiguity - literals with two inflectional descriptions in BGD; • Compound words – Formal description of inflection types; – Compounds classification. NooJ Compound semantic descriptions ENG20-04182583-n = <E>/Ss0 + <P>та/Ssd + <B>и<P><B>(и/p0 +ите/pd) + <B7>завод<P><B2>ен/Ss0 + <B7>завод<P><B2>ния/Ssh + <B7>завод<P><B2>ният/Ssl + <B7>заводи<P><B2>ни/Sа0 + <B7>заводи<P><B2>ните/Sа0 + <B7>рафинерия/Ss0 + <B7>рафинерия<P>та/Ssd + <B7>рафинерии<P><B>и/Sp0 + <B7>рафинерии<P><B>ите/Spd; Applications of the Semantic Dictionaries • Information retrieval by means of semantic equivalence with synonymy dictionaries; • Information retrieval by means of semantic specification with hyperonymy and meronymy dictionaries; • Information retrieval by means of similarity; • Information retrieval by means thematic domains affiliations; • Validation WordNet structure against its completeness and consistency. Future directions • Extensions and enhancements of the semantic dictionaries by means of: – Extension of the dictionaries coverage; – Addition of other semantic relations; – Inclusion of additional information to the entries. • Integration of multilingual semantic extraction with NooJ using the Inter-Lingual-Index relation.