Download Building Semantic Parser Overnight

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Knowledge representation and reasoning wikipedia , lookup

Embodied language processing wikipedia , lookup

Transcript
Building a Semantic Parser
Overnight
Yushi Wang
Jonathan Berant
Percy Liang
T Raghuveer
Abstract
 Functionality driven process for rapidly building a semantic
parser in a new domain
 The logical forms are meant to cover the desired set of
compositional operators, and the canonical utterances
are meant to capture the meaning of the logical forms
(although clumsily).
 Then crowdsourcing is used to paraphrase these canonical
utterances into natural utterances. The resulting data is
used to train the semantic parser
 Study compositionality …paraphrases
 Tested on 7 new domains
Logical form
• The logical form of a sentence is the form obtained by abstracting out the
subject matter of its content terms or by regarding the content terms as
mere placeholders or blanks on a form. In an ideal logical language, the
logical form can be determined from syntax alone
•
Original argument
– All humans are mortal.
– Socrates is human.
– Therefore, Socrates is mortal.
•
Argument Form
– All H are M.
– S is H.
– Therefore, S is M
• Multiple logic forms for one sentence and one logic for may correspond to
multiple sentences.
Seed Lexicon (L)
• fixed database w set of triples (e1, p, e2),
• where e1 and e2 are entities (e.g., article1, 2015) and p is
a property (e.g., publicationDate).
• The purpose of L is to simply connect each predicate with some
representation in natural language
• L:
– <t → s[p]>
• t is in natural language (representation)
• p is a database property/entity
• S is a category ex(RELNP,TYPENP)
– <person → TYPENP[person]>
here person is the natural lang. representation
– And TYPENP[person] is a logical representation
-
Examples
•
•
•
•
•
‘person’ has the syntactic category TYPENP,
All entities ‘alice’ , ’1950’ are ENTITYNP.
Properties ‘publication date’ are RELNP
Unary predicates are realized as verb phrases VP.
binaries as either relational noun phrases (RELNP) or
generalized transitive verbs (VP/NP).
Domain General Grammer
Canonical Utterances
• “article that has the largest publication date” and
arg max( type.article, publicationDate)).
• Lambda DCS is the logical language used
Paraphrasing
• Synonym level :
(“block” to “brick”)
• RELNP -> prepositions
(“meeting whose attendee is alice ⇒ meeting with alice”)
• complex RELNP => argument can become embedded:
“player whose number of points is 15 ⇒ “player who scored 15 points”
• Superlative/comparative constructions => other RELNP-dependent
“article that has the largest publication date ⇒ newest article”
Some examples
“housing unit whose housing
type is apartment ⇒ apartment”
“university of student alice whose
field of study is music” becomes “At which university did Alice study music?”
Assumptions
• Canonical compositionality :
Using a small grammar, all logical forms expressible in natural language
can be realized compositionally based on the logical form.
• Sublexical Compositionality :
– Our hypothesis is that the sublexical compositional units are small, so
we only need to crowdsource a small number of canonical utterances
to learn about most of the language variability in the given domain
– “parent of alice whose gender is female ⇒ “mother of alice”
– “person that is author of paper whose =>author is X ⇒ co-author of X”
Bounded non-compositionality
Natural utterances for expressing complex logical forms are compositional
with respect to fragments of bounded size
– “NP[number of
NP[article CP[whose publication date is larger
than NP[publication date of article 1]]]]” -> “How many articles were published after
article 1?”
Crowdsourcing
• Amazon Mechanical Turk (AMT) to paraphrase the
canonical utterances
• Paraphrases that share the same canonical utterance are
collapsed, while identical paraphrases that have distinct
canonical utterances are deleted.
• 26,098 examples collected over all domains
• 20 examples in each domain were manually analysed,
and found that 17% of the utterances were inaccurate.
Domains, x,c
Model and Learning
• Log linear distribution over candidate pairs
(z, c) ∈ GEN(G ∪ Lx):
• G : domain general grammer
• “article published in 2015 that cites article 1”
Lx or T(x) :
2015 → NP[2015]
article 1 → NP[article1]
Features – Basic + Lexical
Accuracies
Analysis
 Tested – 7 domains
 Data : Generated facts using entities and properties
 Training : 80%
Test : 20%
 Accuracy - fraction of examples that yield correct denotation.
Error Analysis
 70% due to paraphrasing model
“restaurants that have waiters and you can sit outside” >>> “restaurant that has waiter service and that takes reservations”
 12.5% - Reordering issues
“What venue has fewer than two articles” >>> “article that has less than two venue”
Thank you
Sublexical compositionality
• The idea is that common, multi-part concepts are compressed to single
words or simpler constructions.
• “person that is author of paper whose
author is X ⇒ co-author of X”
“person whose birthdate is birthdate of X ⇒ person born on the same day
as X”
“meeting whose start time is 3pm and whose end time is 5pm ⇒
meetings between 3pm and 5pm”
“that allows cats and that allows dogs ⇒ that allows pets”