Download NooJ Semantic dictionaries - elliadd - Université de Franche

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Dependency grammar wikipedia , lookup

Internalism and externalism wikipedia , lookup

Latin syntax wikipedia , lookup

Causative wikipedia , lookup

Macedonian grammar wikipedia , lookup

Georgian grammar wikipedia , lookup

Lithuanian grammar wikipedia , lookup

Portuguese grammar wikipedia , lookup

Spanish grammar wikipedia , lookup

Junction Grammar wikipedia , lookup

Serbo-Croatian grammar wikipedia , lookup

Ancient Greek grammar wikipedia , lookup

Swedish grammar wikipedia , lookup

Yiddish grammar wikipedia , lookup

French grammar wikipedia , lookup

Polish grammar wikipedia , lookup

Compound (linguistics) wikipedia , lookup

Russian grammar wikipedia , lookup

Semantic holism wikipedia , lookup

Inflection wikipedia , lookup

Cognitive semantics wikipedia , lookup

Pipil grammar wikipedia , lookup

Semantic memory wikipedia , lookup

Lexical semantics wikipedia , lookup

Transcript
Integrating Semantic Dictionaries
for English, French and Bulgarian
into the NooJ System for the
Purposes of Information Retrieval
Svetla Koeva, Max Silbetztein
8th INTEX / NooJ Workshop,
30 May, 2005
Main research goals
• To provide a sufficient methodology for the
implementation of the natural language
semantic relations into the NooJ system:
– to create specialized Semantic Dictionaries for
English, French and Bulgarian based on WordNet
semantic relations;
– to provide compete formalization of the inflection
for simple and compound words included in the
Wn structure.
History
• The integration of semantic relations into the INTEX
system was initially proposed at the sixth INTEX
workshop.
• Later on the idea was advanced into the Joint research
RILA project
Information retrieval based on semantic
relations
– LASELDI, Université de Franche-Comté
– Department of Computational Linguistics, IBL, Bulgarian
Academy of Sciences.
Language resources
• Bulgarian grammatical dictionary (BGD) – over 83
000 lemmas and 1 100 000 word forms;
• English WordNet 2.0 – 115 424 synonymous sets;
• Bulgarian WordNet (BalkaNet project) – 22 867
synonymous sets;
• French WordNet (EuroWordNet project) – 33 512
synonymous sets;
• English dictionary – over 30 000 lemmas (not
inflected);
• French dictionary – extracted with INTEX.
Implementation tasks
• To transform the format of the BGD into the NooJ
standard;
• To create semantic dictionaries for Bulgarian and
English;
• To associate lemmas from the Bulgarian semantic
dictionaries with the corresponding inflection types;
• To add missing lemmas and inflection types in BGD,
if any;
• To create extensive dictionaries and corresponding
inflection types for compounds.
BGD – Information structure
design
• Category information –
6 classes: Noun, Verb, Adjective, Pronoun, Numeral,
Others (Adverb, Preposition, Conjunction, Particle,
Interjection) ;
• Paradigmatic information –
Personal, Transitive, Perfective, Common, …;
• Grammatical information –
Inflection, Conjugation, Sound alternations, ….
BGD – Grammatical subclasses
• Nouns - 22 subclasses with respect of their
Type (Common, Proper, Singularia tantum,
Pluralia tantum) and Gender;
• Verbs – 32 subclasses with respect of
Transitivity, Perfectiveness, and Personality;
• Adjectives – 2 subclasses;
• Pronouns – 26 subclasses with respect of their
Type and Possessor;
• Numerals – 6 sunclasses.
BGD – Grammatical types
• Noun – Number, Definiteness, Counting form,
Case, Optional forms – 266 types;
• Verb – Person, Number, Tense, Mood, Voice,
Participles, Gender, Definiteness – 257 types;
• Adjective – Gender, Number, Definiteness –
30 types;
• Pronoun – Gender, Person, Number,
Definiteness, Case, Clitic, Possessing – 28
types;
• Numeral – Gender, Number, Definiteness,
Approximate form, Male form – 20 types.
BGD – Dictionary format
а,ЧА,0
ПРИ, 7 sm0, Ok, ‘‘
абсол`ютен, ПРИ, 7
smh, Ok, '2RCия‘
`август, С+М, 10
sml, Ok, '2RCият‘
авиокомп`ания, С+Ж, 1
sf0, Ok, '2RCа‘
австр`ийски, ПРИ, 3
sfd, Ok, '2RCата‘
автоб`ус, С+М, 11
sn0, Ok, '2RCо‘
автомат`ичен, ПРИ, 7
snd, Ok, '2RCото‘
адрес`ирам, Г+Н+Т, 4
p0, Ok, '2RCи‘
агит`ирам, Г+Н+Т, 4
pd, Ok, '2RCите'
Transforming BGD
Perl Script
Dictionary
Grammatical
types
Transliteration
of labels
NooJ dictionary
→
aбсол`ютен, ПРИ, 7
`август, С+М, 10
авиокомп`ания, С+Ж,1
aвстр`ийски, ПРИ, 3
автоб`ус, С+М, 11
автомат`ичен, ПРИ, 7
адрес`ирам,Г+Н+Т,4
aбсолютен,A+FLX=A-7
август,N+M+FLX=N_M-10
авиокомпания,N+F+FLX=N_F-1
aвстрийски,A+FLX=A-3
автобус,N+M+FLX=N_M-11
автоматичен,A+FLX=A-7
адресирам,V+IT+FLX=V_IT-4
NooJ formal descriptions
→
sm0, Ok, ‘‘
smh, Ok, '2RCия‘
sml, Ok, '2RCият‘
sf0, Ok, '2RCа‘
sfd, Ok, '2RCата‘
sn0, Ok, '2RCо‘
snd, Ok, '2RCото‘
p0, Ok, '2RCи‘
pd, Ok, '2RCите‘
A-7 = <E>/sm0 +
<L2><S><R>ия<S1>/smh +
<L2><S><R>ият<S1>/sml +
<L2><S><R>а<S1>/sf0 +
<L2><S><R>ата<S1>/sfd +
<L2><S><R>о<S1>/sn0 +
<L2><S><R>ото<S1>/snd +
<L2><S><R>и<S1>/p0 +
<L2><S><R>ите<S1>/pd;
WordNet semantic relations
ILR
POS/POS
EW2.0
BulNet
HYPERONYMY
N/N V/V
94 844
15 838
NEAR ANTONYMY
N/N A/A V/V
7 642
1 847
PART MERONYMY
N/N
8 636
1 241
MEMBER MERONYMY
N/N
12 205
841
PORTION MERONYMY
N/N
787
107
SUBEVENT
V/V
409
162
CAUSES
V/V
439
104
SIMILAR TO
A/A V/V
22 196
1 479
VERB GROUP
V/V
1 748
848
ALSO SEE
A/A V/V
3 240
895
Other relations
ILR
POS/POS
EW2.0
BulNet
BE IN STATE
A/N
1 296
591
BG DERIVATIVE
N/V
36 630
6 469
DERIVED
A/N
6 809
1 071
PARTICIPLE
A/V
401
56
REGION DOMAIN
N/N V/N A/N B/N
1 280
4
USAGE DOMAIN
N/N V/N A/N B/N
983
22
N/N V/N A/N B/N
6 166
638
CATEGORY DOMAIN
Selected relations
• Synonymy (reflexive, symmetric, and
transitive relation of equivalence);
• Hypernymy (inverse, asymmetric, and
transitive relation between synonym sets),
• Meronymy (inverse, asymmetric, and
transitive relation between synonym sets):
Part meronymy;
Member meronymy;
Portion meronymy.
Selected relations
• Similar to (symmetric relation between similar
adjectival synsets);
• Verb group (symmetric relation between
semantically related verb synsets);
• Also see (symmetric relation between synsets verbs or adjectives, that are close in meaning);
• Category domain (asymmetric extralinguistic
relation between synsets denoting a concept
and the sphere of knowledge it belongs to).
DELAF semantic dictionaries
• These dictionaries consist of pairs of literals defined
for the corresponding semantic relation:
– car,automobile.N
– auto,automibile.N
• All possible combinations between literals in the
given synsets are listed:
–
–
–
–
car,automobile.N
cars,automobile.N
auto,automibile.N
autos,automibile.N
NooJ Semantic dictionaries
Synonymy relation
‘a plant consisting of buildings with facilities for
manufacturing’
фабрика,N+FLX=ENG20-03196165-n
предпрятие,N+FLX=ENG20-03196165-n
factory,N+FLX=ENG20-03196165-n
mill,N+FLX=ENG20-03196165-n
manufacturing plant,N+FLX=ENG20-03196165-n
manufactory,N+FLX=ENG20-03196165-n
NooJ Semantic dictionaries
Hypernymy relation
‘the organized action of making of goods and services
for sale’
производство,N+FLX=ENG20-00859333-n
промишленост,N+FLX=ENG20-00859333-n
индустрия,N+FLX=ENG20-00859333-n
production,N+FLX=ENG20-00859333-n
industry,N+FLX=ENG20-00859333-n
manufacture,N+FLX=ENG20-00859333-n
Inflecting wordnet
<SYNSET>
<ID>...</ID>
<POS>...</POS>
<SYNONYM>
<LITERAL>
otstranqwam (to remove)
<SENSE>…</SENSE>
<LNOTEGR>ГНТ12</LNOTEGR>
</LITERAL>
</SYNONYM>
<ILR>...<TIPE>...</TYPE></ILR>
<DEF>
remove something concrete, as by lifting, pushing, taking off, etc. or remove
something abstract
</DEF>
<BCS>...</BCS>
</SYNSET>
NooJ Semantic descriptions
‘the organized action of making of goods and
services for sale’
ENG20-00859333-n = <E>/Hs0 + то/Hsd + <L1>а<S1>/Hp0
+ <L1>ата<S1>/Hpd + <L9>мишленост<S9>/Ss0 +
<L9>мишлеността<S9>/Ssd + <L9>мишлености<S9>/Sp0
+ <L9>мишленостите<S9>/Spd + <B12>индустрия/Ss0 +
<B12>индустрията/Ssd + <B12>индустрии/Sp0 +
<B12>индустриите/Spd;
ENG20-00859333-n = <E>/Hs + <B10>industry/Ss +
<B10>industries/Sp0+ <B10>manifactures/Ss +
<B10>manifactures/Sp;
After the nice solutions
• Lemmas which are not included in the BGD:
–
–
–
–
Lemmas classification to existing inflection types;
Formal description of new inflection types
Literals in Latin;
Validating WordNet.
• Semantic ambiguity - literals with two inflectional
descriptions in BGD;
• Compound words
– Formal description of inflection types;
– Compounds classification.
NooJ Compound semantic
descriptions
ENG20-04182583-n = <E>/Ss0 + <P>та/Ssd +
<B>и<P><B>(и/p0 +ите/pd) +
<B7>завод<P><B2>ен/Ss0 +
<B7>завод<P><B2>ния/Ssh +
<B7>завод<P><B2>ният/Ssl +
<B7>заводи<P><B2>ни/Sа0 +
<B7>заводи<P><B2>ните/Sа0 +
<B7>рафинерия/Ss0 + <B7>рафинерия<P>та/Ssd
+ <B7>рафинерии<P><B>и/Sp0 +
<B7>рафинерии<P><B>ите/Spd;
Applications of the Semantic
Dictionaries
• Information retrieval by means of semantic
equivalence with synonymy dictionaries;
• Information retrieval by means of semantic
specification with hyperonymy and meronymy
dictionaries;
• Information retrieval by means of similarity;
• Information retrieval by means thematic domains
affiliations;
• Validation WordNet structure against its completeness
and consistency.
Future directions
• Extensions and enhancements of the semantic
dictionaries by means of:
– Extension of the dictionaries coverage;
– Addition of other semantic relations;
– Inclusion of additional information to the entries.
• Integration of multilingual semantic extraction with
NooJ using the Inter-Lingual-Index relation.