* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download the presentation
Germanic strong verb wikipedia , lookup
Word-sense disambiguation wikipedia , lookup
Agglutination wikipedia , lookup
Old Irish grammar wikipedia , lookup
Old English grammar wikipedia , lookup
Scottish Gaelic grammar wikipedia , lookup
Zulu grammar wikipedia , lookup
Junction Grammar wikipedia , lookup
Malay grammar wikipedia , lookup
Sloppy identity wikipedia , lookup
Serbo-Croatian grammar wikipedia , lookup
Pipil grammar wikipedia , lookup
Supporting e-learning with automatic glossary extraction Experiments with Portuguese Rosa Del Gaudio, António Branco RANLP, Borovets 2007 Presentation Plan LT4eL project ● ILIAS ● Corpus ● Tool ● Grammars ● Copula ● Other Verbs ● Punctuation ● Results ● Conclusion ● LT4eL Improve retrieval and accessibility of LO in learning management systems ●Employ language technology resources and tools for the semi-automatic generation of descriptive metadata . ● Develop new functionalities such as a key word extractor and a glossary candidate detector, semantic search, tuned for the various languages addressed in the project (Bulgarian, Czech, Dutch, English, German, Maltese, Polish, Portuguese, Romanian). ● ILIAS Objective Build a Glossary in an automatic way to support elearning process. In practice this means to extract a definition from unstructured text (scientific papers, enciclopedia, web pages) ● Better access to information for student ●Accelerate the work of the tutor ● ILIAS: Glossary Candidate Detector The Corpus • 274.000 tokens • Tutorials • PhD Thesis • Scientific papers • 3 Domains evenly represented • e-learning • Technology for non experts • Calimera XML format <definingText continue="y" def="m147" def_type1="is_def" id="d5"> <markedTerm dt="y" id="m147" kw="y"> <tok base="intranet" class="word" ctag="PNM" id="t9032" sp="y">Intranet</tok> </markedTerm> <tok base="ser" class="word" ctag="V" id="t9033" msd="pi-3s" sp="y">é</tok> <tok base="uma" class="word" ctag="UM" id="t9034" msd="fs" sp="y">uma</tok> <tok base="rede" class="word" ctag="CN" id="t9035" msd="fs" sp="y">rede</tok> <tok base="desenvolver,desenvolvido" class="word" ctag="PPA" id="t9036" msd="fs" sp="y">desenvolvida</tok> <tok base="para" class="word" ctag="PREP" id="t9037" sp="y">para</tok> <tok base="processamento" class="word" ctag="CN" id="t9038" msd="ms" sp="y">processamento</tok> <tok base="de" class="word" ctag="PREP" id="t9039" sp="y">de</tok> <tok base="informação" class="word" ctag="CN" id="t9040" msd="fp" sp="y">informações</tok> <tok base="em" class="word" ctag="PREP" id="t9041" sp="y">em</tok> <tok base="uma" class="word" ctag="UM" id="t9042" msd="fs" sp="y">uma</tok> <tok base="empresa" class="word" ctag="CN" id="t9043" msd="fs" sp="y">empresa</tok> <tok base="ou" class="word" ctag="CJ" id="t9044" sp="y">ou</tok> <tok base="organização" class="word" ctag="CN" id="t9045" msd="fs">organização</tok> <tok class="punctuation" ctag="PNT" id="t9046" sp="y">.</tok> </definingText> LxTransduce • Match tree using elements • Quick • Unicode friendly • freeware • Easy to integrate in other tools (java) • Input: simple text or xml • Regular expressions • Substitution and markup • Output the same file with changes Rules in lxtransduce <rule name="PARopen"> <query match="tok[.~'^\($']"/> </rule> <rule name="PARcl"> <query match="tok[.~'^\($']"/> </rule> <rule name="parenthetic"> <seq> <ref name="PARopen"/> <repeat-until name="tok"> <ref name="PARcl"/> </repeat-until> <ref name="PARcl"/> </seq> </rule> <rule name="Conj"> <query match="tok[@ctag = 'CJ']"/> </rule> <rule name="Coor"> <!-Conjunctions or comma --> <first> <query match="tok[. = ',']"/> <ref name="Conj" mult="+"/> </first> </rule> First development phase ● ● ● ● ● Precision Recall F2 Gr 00 0.14 0.44 0.26 Gr 01 0.31 0.20 0.22 Less than 50% of the corpus Focus on the verb Precision: manually marked/all automatic Recall: correct automatic/manually marked F2 :3*(precision*recall)/2*precision+recall Second developing phase • 75% of the corpus for developing • 25% of the corpus for testing • Specific grammar/rules for each type Copula baseline grammar Verb “to be” third person singular or plural present indicative <rule name="SERdef"> <best> <ref name="Ser3"/> <ref name="PoderSer"/> </best> </rule> <rule name="euristic"> <seq> <repeat-until name="tok"> <ref name="SERdef" mult="+"/> </repeat-until> <ref name="SERdef" mult="+"/> <not> <ref name="PPA"/> </not> <ref name="tok" mult="*"/> <end/> </seq> </rule> Copula base result • Sentence level results • Problem with precision Copula Grammar Rules for is_type <!-- To Be 3rd person pl and s --> <rule name="Serdef"> <query match="tok[@ctag = ’V’ and @base=’ser’ and (@msd[starts-with(.,’fi-3’ )] or @msd[starts-with(.,’pi3’ )])] </rule> .... <rule name="copula1"> <seq> <ref name="SERdef"/> <best> <seq> <ref name="Art"/> <ref name="adj|adv|prep|" mult="*"/> <ref name="Noun" mult="+"/> </seq> .... </best> <ref name="tok" mult="*"/> <end/> </seq> </rule> Confronting Results Include that patterns that were excluded Try to gather the syntactic pattern of non definition and confront with the syntactic pattern of definition. Other_Verbs grammar • Collect verbs in a lexicon • Three different category: reflexive, active, passive. • 22 different verbs <lex word="chamar"> <cat>ref</cat> </lex> <lex word="chamar,chamado"> <cat>pas</cat> </lex> <rule name="Vpas"> <seq> <ref name="tok"/> <not> <ref name="not"/> </not> <ref name="tok" mult="?"/> <query match="tok[mylex(@base) and (@ctag='PPA')]" constraint="mylex(@base)/cat=' pas'"/> </seq> </rule> Results for verb_type • Analyze each verbs separately as with is_type • Richer syntactic patterns Punctuation Grammar Preliminary work ● Definition introduced by colon mark (most frequent) ● <rule name="punct_def"> <seq> <start/> <ref name="CompmylexSN" mult="+"/> <query match="tok[.~’^:\$’]"/> <ref name="tok" mult="+"/> <end/> </seq> </rule> All-in-one • Combination of the previous grammars • The type is not take into account to calculate precision and recall Conclusions and Future Work • Overall results: Recall 86%, Precision 14% • Difference among domains: the style of a document influence the result. • Improve the rules for verb_type and punc_type • Combining with other techniques such as ML