Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Building a Semantic Parser Overnight Yushi Wang Jonathan Berant Percy Liang T Raghuveer Abstract Functionality driven process for rapidly building a semantic parser in a new domain The logical forms are meant to cover the desired set of compositional operators, and the canonical utterances are meant to capture the meaning of the logical forms (although clumsily). Then crowdsourcing is used to paraphrase these canonical utterances into natural utterances. The resulting data is used to train the semantic parser Study compositionality …paraphrases Tested on 7 new domains Logical form • The logical form of a sentence is the form obtained by abstracting out the subject matter of its content terms or by regarding the content terms as mere placeholders or blanks on a form. In an ideal logical language, the logical form can be determined from syntax alone • Original argument – All humans are mortal. – Socrates is human. – Therefore, Socrates is mortal. • Argument Form – All H are M. – S is H. – Therefore, S is M • Multiple logic forms for one sentence and one logic for may correspond to multiple sentences. Seed Lexicon (L) • fixed database w set of triples (e1, p, e2), • where e1 and e2 are entities (e.g., article1, 2015) and p is a property (e.g., publicationDate). • The purpose of L is to simply connect each predicate with some representation in natural language • L: – <t → s[p]> • t is in natural language (representation) • p is a database property/entity • S is a category ex(RELNP,TYPENP) – <person → TYPENP[person]> here person is the natural lang. representation – And TYPENP[person] is a logical representation - Examples • • • • • ‘person’ has the syntactic category TYPENP, All entities ‘alice’ , ’1950’ are ENTITYNP. Properties ‘publication date’ are RELNP Unary predicates are realized as verb phrases VP. binaries as either relational noun phrases (RELNP) or generalized transitive verbs (VP/NP). Domain General Grammer Canonical Utterances • “article that has the largest publication date” and arg max( type.article, publicationDate)). • Lambda DCS is the logical language used Paraphrasing • Synonym level : (“block” to “brick”) • RELNP -> prepositions (“meeting whose attendee is alice ⇒ meeting with alice”) • complex RELNP => argument can become embedded: “player whose number of points is 15 ⇒ “player who scored 15 points” • Superlative/comparative constructions => other RELNP-dependent “article that has the largest publication date ⇒ newest article” Some examples “housing unit whose housing type is apartment ⇒ apartment” “university of student alice whose field of study is music” becomes “At which university did Alice study music?” Assumptions • Canonical compositionality : Using a small grammar, all logical forms expressible in natural language can be realized compositionally based on the logical form. • Sublexical Compositionality : – Our hypothesis is that the sublexical compositional units are small, so we only need to crowdsource a small number of canonical utterances to learn about most of the language variability in the given domain – “parent of alice whose gender is female ⇒ “mother of alice” – “person that is author of paper whose =>author is X ⇒ co-author of X” Bounded non-compositionality Natural utterances for expressing complex logical forms are compositional with respect to fragments of bounded size – “NP[number of NP[article CP[whose publication date is larger than NP[publication date of article 1]]]]” -> “How many articles were published after article 1?” Crowdsourcing • Amazon Mechanical Turk (AMT) to paraphrase the canonical utterances • Paraphrases that share the same canonical utterance are collapsed, while identical paraphrases that have distinct canonical utterances are deleted. • 26,098 examples collected over all domains • 20 examples in each domain were manually analysed, and found that 17% of the utterances were inaccurate. Domains, x,c Model and Learning • Log linear distribution over candidate pairs (z, c) ∈ GEN(G ∪ Lx): • G : domain general grammer • “article published in 2015 that cites article 1” Lx or T(x) : 2015 → NP[2015] article 1 → NP[article1] Features – Basic + Lexical Accuracies Analysis Tested – 7 domains Data : Generated facts using entities and properties Training : 80% Test : 20% Accuracy - fraction of examples that yield correct denotation. Error Analysis 70% due to paraphrasing model “restaurants that have waiters and you can sit outside” >>> “restaurant that has waiter service and that takes reservations” 12.5% - Reordering issues “What venue has fewer than two articles” >>> “article that has less than two venue” Thank you Sublexical compositionality • The idea is that common, multi-part concepts are compressed to single words or simpler constructions. • “person that is author of paper whose author is X ⇒ co-author of X” “person whose birthdate is birthdate of X ⇒ person born on the same day as X” “meeting whose start time is 3pm and whose end time is 5pm ⇒ meetings between 3pm and 5pm” “that allows cats and that allows dogs ⇒ that allows pets”