Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
The CHAOS Project:
Theory and Practice
Fabio Massimo Zanzotto
Department of Computer Science, Systems and Production
University of Roma “Tor Vergata”
People
INVESTIGATORS
Roberto Basili
Fabio Massimo Zanzotto
Maria Teresa Pazienza
FORMER CONTRIBUTORS
Daniele Pighin
Daniele Previtali
Alessandro Bahgat
Marco Pennacchiotti
Massimo Di Nanni
Michele Vindigni
Luigi Mazzucchelli
Paola Velardi
Paolo Zirilli
Alessandro Cucchiarelli
Alessandro Marziali
Fabrizio Grisoli
Gianluca De Rossi
Outline
Theory: Customizable parsing architectures
XDG: eXtended Dependency Graph
Task oriented parsing design
Practice: System Implementation and Use
A component-based approach
An object-oriented platform
Linguistic data
Processing modules
How to use the parser in an application
Demo!!!
Theory
Customizable parsing architectures
Motivation
The Chaos Project unofficially began in ’96
… on the long tradition of ARIOSTO (Basili, Pazienza, Velardi) @ the
University of Rome “Tor Vergata” (RTV)
Aim
building robust parsers for Italian and for English
that use verb sub-categorization (syntactic) lexicons induced from
corpora
that can be used in applications
Constraints
use the long tradition @ RTV
“Social” background
Microtheories for microphenomena
Language analysis can be reduced to a cascade of modules (e.g., FSA)
Application-oriented language anaysis (e.g., IE)
Robust (formely, shallow) parsing approaches
Motivation
contribute-NP-PP(to)
value-NP-PP(at)
Inf(S1)
Inf(S2)
[ Mr. Gaubert ] [contributed] [real estate] [valued] [ at $ 25 million] [to the assets] [of Independent American]
Motivation
(found on vinyl supports)
Different NLP applications have different
performance constraints in term of:
Accuracy
Throughput
Customizable parsing architectures are reusable
in different application scenarios if:
the architectural design supports performance
control
Customizable parsing
architectures
(found on vinyl supports)
Modularization
clarifies the interdependency between different
syntactic information (grammatical/lexicalized)
allows to control
throughput via eliciting modules
quality via a clear relation between modules
(prerequisites/contributions)
Modular approach
Syntactic parser
SP(S,K)=I SP(S)=I
Syntactic parsing module:
Pi(Si,Ki)=Si+1 Pi(Si)=Si+1
Modular syntactic parser
SP = Pn... P2P1
Modular approach
To push a modular approach we need:
a suitable annotation scheme
a classification of the processing modules
A suitable annotation
scheme
Requirements:
Modularization
a stable representation of partially analyzed
structures
Lexicalization
a clear representation of the (semantic) head of a
given structure able to activate the lexicalized rule
XDG:
Extended Dependency Graph
XDG combines constituency and
dependency based formalisms
XDGGD=(C,D)
C = {(c,t,h)|cS,tG,hc}
D = {(c1,c2,t)| c1,c2C, tD}
Nice property: allow to store persistent
ambiguity (for interpretations projected by
the same nodes)
XDG:
Extended Dependency Graph
C are constituents
syntactic head
potential
semantic
governor
D are
dependencies
among
constituents
Classification of parsing
modules
Pi(XDGi,Ki)=Pi(XDGi)=XDGi+1
The classification is performed according
to:
the type of information K used
how they manipulate the sentence
representation
Task oriented parsing
design
Given:
The NLP application requirements R
The test-bed T
A pool of parsing modules PM
The designing activity is:
The research of a combination of the parsing
modules PM that fits R on the T
NLP application
requirements
Target phenomena: es. VP_PP, NP_PP,
etc
Metrics:
Recall R per sentence
Precision P per sentence
F-measure per sentence
CHAOS: Levels of Analysis
Dependencies
Clauses
Chunks
NPK
VPK
POS
NNS TO VB IN
PPK
NNS
NPK
VPK
PRP MD
VB
Strategies to use with questions you cannot answer
Verb dependencies and
Clause Boundaries
contribute-NP-PP(to)
value-NP-PP(at)
Inf(S1)
Inf(S2)
[ Mr. Gaubert ] [contributed] [real estate] [valued] [ at $ 25 million] [to the assets] [of Independent American]
Verb dependencies and
Clause Boundaries
contribute-NP-PP(to)
value-NP-PP(at)
Inf(S1)
Inf(S2)
[ Mr. Gaubert ] [contributed] [real estate] [valued] [ at $ 25 million] [to the assets] [of Independent American]
Verb dependencies and
Clause Boundaries
contribute-NP-PP(to)
value-NP-PP(at)
Inf(S1)
Inf(S2)
[ Mr. Gaubert ] [contributed] [real estate] [valued] [ at $ 25 million] [to the assets] [of Independent American]
Verb dependencies and
Clause Boundaries
The algorithm:
Initial Hypoteses:
Minimal boundaries of the clauses in the sentence
Derived Hierarchy
Until all verbs have not been analyzed:
Take the rightmost not analyzed verb v:
Take the lexicalized rules R(v) for the verb v
Find the dependencies of
Augment the clause boundaries
Practice
System Implementation and Use
A Computational
Framework
Object-oriented backbone
Objects for the different data
Objects for the different sub-processes
Linguistic sub-processors as libraries
Coexisting languages: Java, C++, C, Prolog
System implementation
A component-based approach
An object-oriented platform
Linguistic data
Textual entities: Text, Paragraphs
XDG
Linguistic processors
A Component-based Approach
Advantages:
Computational efficiency
Rapid prototyping
Integration of different technologies
Easy reuse
Linguistic processors
Linguistic processors
Tokenizer, Complex Tokenizer
Dictionary lookup modules
Yellow page look-up
Morphology analyzer
Name Entity Recognition
Part-of-speech tagging
Chunker
Verb shallow analyzer
Shallow analyzer
Linguistic modules
Each process is encapsulated in an object
initialize()
Load lexicons and rules (general or domain specific)
finalize()
Dismiss the process rules and lexicons
run()
Enrich the input with the contributes of the process
Linguistic processors
Microtheories for microphenomena
Each processor implements its own theory:
It has its language for describing rules
It is written in its own programming language
Processor:
Yellow page look-up, Morphology analyzer
Dictionary
compra comprare d(a) v.tran.sempl 2.sing.imper.pres ~:u:~
compra comprare d(a) v.tran.sempl 3.sing.ind.pres ~:u:~
comprai comprare d(a) v.tran.sempl 1.sing.ind.pass_rem ~:u:~
comprammo comprare d(a) v.tran.sempl 1.plur.ind.pass_rem ~:u:~
compran comprare d(a) v.tran.sempl 3.plur.ind.pres ~:u:~
comprando comprare d(a) v.tran.sempl geru.pres ~:u:~
comprano comprare d(a) v.tran.sempl 3.plur.ind.pres ~:u:~
Processor:
Chunker
Rules
…
constituent_class([_cst1, _cst2, _cst3], 'VerFin', _mor, 1, 3):verb_finite(_cst1),
verb_to_have(_cst1),
verb_past_particle(_cst2),
verb_to_be(_cst2),
verb_past_particle(_cst3),
common_morfology(_cst1,_mor).
…
Processor:
Verb Shallow Analyser
Sub-categorization lexicon
…
pattern(comprare,[
[(oggetto,Post),(per,Post)],
[(oggetto,Post),(da,Post),(per,Post)],
[(oggetto,Post),(a,Post),(per,Post)],[(oggetto,Post)]]).
pattern(comprendere,[[(oggetto,Post)],[],[(oggetto,Post)]]).
pattern(comprimere,[[(oggetto,Post)],[(oggetto,Post)]]).
pattern(compromettere,[[(con,Post)],[(oggetto,Post)]]).
pattern(comunicare,[[],
[(con,Post)],
[(a,Post)],
[(oggetto,Post),(a,Post)],[(oggetto,Post)]]).
…
Implemented Italian
Shallow Grammar
Constituent Categories
Part-of-Speech Tags
Chunk Types
Dependency Categories
Dependency Categories over Chunk Types
A survival user guide
Version stand-alone:
chaosparser -h
Version client-server:
chaosserver –h
chaosclient –h
XDG editor and actual gui:
choasgui
Using CHAOS in
applications
In JAVA applications:
ConfigurationHandler.initialize();
ConfigurationHandler.parseKBPropFile(“LANGUAGE”,”KB”);
Parser ms = new Parser();
ms.initialize();
In Non-JAVA applications:
Using one of the possible output forms:
XDG in Xml
XDG in Prolog
XDG in QLF (in prolog)
Perspective
Building a statistical Italian parser
Increasing the Itailan annotated corpora
Reusing existing corpora
TUT
SITAL
VIT
Tools
XDG editor
DEMO!!!!
Syntactic annotation transformer
People
INVESTIGATORS
Roberto Basili
Fabio Massimo Zanzotto
Maria Teresa Pazienza
FORMER CONTRIBUTORS
Daniele Pighin
Daniele Previtali
Alessandro Bahgat
Marco Pennacchiotti
Massimo Di Nanni
Michele Vindigni
Luigi Mazzucchelli
Paola Velardi
Paolo Zirilli
Alessandro Cucchiarelli
Alessandro Marziali
Fabrizio Grisoli
Gianluca De Rossi