Download AI Magazine - Spring 2016

yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts

Human-Computer Interaction Institute wikipedia, lookup

Computer vision wikipedia, lookup

Kevin Warwick wikipedia, lookup

Technological singularity wikipedia, lookup

Wizard of Oz experiment wikipedia, lookup

Human–computer interaction wikipedia, lookup

AI winter wikipedia, lookup

Turing test wikipedia, lookup

Knowledge representation and reasoning wikipedia, lookup

Embodied cognitive science wikipedia, lookup

Intelligence explosion wikipedia, lookup

Ethics of artificial intelligence wikipedia, lookup

Existential risk from artificial general intelligence wikipedia, lookup

Visual Turing Test wikipedia, lookup

History of artificial intelligence wikipedia, lookup

Philosophy of artificial intelligence wikipedia, lookup

2016 AAAI Fall Symposium Series
November 17–19, 2016
Arlington, Virginia
ISSN 0738-4602
Editorial Introduction: Beyond the Turing Test
Gary Marcus, Francesca Rossi, Manuela Veloso
My Computer Is an Honor Student — But How Intelligent Is It? …
Peter Clark, Oren Etzioni
How to Write Science Questions That Are Easy for People and Hard for
Ernest Davis
Toward a Comprehension Challenge, Using Crowdsourcing as a Tool
Praveen Paritosh, Gary Marcus
The Social-Emotional Turing Challenge
William Jarrold, Peter Z. Yeh
Artificial Intelligence to Win the Nobel Prize and Beyond …
Hiroaki Kitano
Planning, Executing, and Evaluating the Winograd Schema Challenge
Leora Morgenstern, Ernest Davis, Charles L. Ortiz, Jr.
Cover: Cognitive Orthoses
by Giacomo Marchesi
Brooklyn, New York.
The guest editors for the 2016 special issue on
Beyond the Turing Test are Gary Marcus,
Francesca Rossi, and Manuela Veloso
Why We Need a Physically Embodied Turing Test …
Charles L. Ortiz, Jr
Measuring Machine Intelligence Through Visual Question Answering
C. Lawrence Zitnick, Aishwarya Agrawal, Stanislaw Antol, Margaret Mitchell, Dhruv Batra,
Devi Parikh
Turing++ Questions: A Test for the Science of (Human) Intelligence
Tomaso Poggio, Ethan Meyers
I-athlon: Toward a Multidimensional Turing Test
Sam S. Adams, Guruduth Banavar, Murray Campbell
Software Social Organisms: Implications for Measuring AI Progress
Kenneth D. Forbus
Principles for Designing an AI Competition …
Stuart M. Shieber
WWTS (What Would Turing Say?)
Douglas B. Lenat
102 The First International Competition on
Computational Models of Argumentation
Matthias Thimm, Serena Villata, Federico Cerutti, Nir Oren, Hannes Strass, Mauro Vallati
105 The Ninth International Web Rule Symposium
Adrian Paschke
107 Fifteenth International Conference on Artificial Intelligence and Law
Katie Atkinson, Jack G. Conrad, Anne Gardner, Ted Sichelman
109 AAAI News
120 AAAI Conferences Calendar
SPRING 2016 1
An Official Publication of the Association for the Advancement of Artificial Intelligence
ISSN 0738-4602 (print) ISSN 2371-9621 (online)
Submissions information is available at Authors whose work is accepted for publication will be required to
revise their work to conform reasonably to AI Magazine styles. Author’s guidelines are
available at If
an article is accepted for publication, a new electronic copy will also be required.
Although AI Magazine generally grants reasonable deference to an author’s work, the
Magazine retains the right to determine the final published form of every article.
Calendar items should be posted electronically (at least two months prior to the event or
deadline). Use the calendar insertion form at News items should be sent
to the News Editor, AI Magazine, 2275 East Bayshore Road, Suite 160, Palo Alto, CA
94303. (650) 328-3123. Please do not send news releases via either e-mail or fax, and do not
send news releases to any of the other editors.
AI Magazine, 2275 East Bayshore Road, Suite 160, Palo Alto, CA 94303, (650) 328-3123;
Fax (650) 321-4457. Web: Web-based job postings can be made using the
form at
Microfilm, Back, or Replacement Copies
Replacement copies (for current issue only) are available upon written request and a
check for $10.00. Back issues are also available (cost may differ). Send replacement or
back order requests to AAAI. Microform copies are available from ProQuest Information
and Learning, 300 North Zeeb Road, Ann Arbor, MI 48106. Telephone (800) 521-3044 or
(734) 761-4700.
the appropriate fee is paid directly to the Copyright Clearance Center, 222 Rosewood
Drive, Danvers, MA 01923. Telephone: (978) 750-8400. Fax: (978) 750-4470. Website: E-mail: [email protected] This consent does not extend to other kinds of copying, such as for general distribution, resale, advertising, Internet or internal electronic distribution, or promotion purposes, or for creating new collective works.
Please contact AAAI for such permission.
Address Change
Please notify AAAI eight weeks in advance of a change of address. Send electronically via
MemberClicks or by e-mailing us to [email protected]
AI Magazine (ISSN 0738-4602) is published quarterly in March, June, September, and
December by the Association for the Advancement of Artificial Intelligence (AAAI), 2275
East Bayshore Road, Suite 160, Palo Alto, CA 94303, telephone (650) 328-3123. AI Magazine is a direct benefit of membership in AAAI. Membership dues are $145.00 individual, $75.00 student, and $285.00 academic / corporate libraries. Subscription price of
$50.00 per year is included in dues; the balance of your dues may be tax deductible as a
charitable contribution; consult your tax advisor for details. Inquiries regarding membership in the Association for the Advancement of Artificial Intelligence should be sent
to AAAI at the above address.
PERIODICALS POSTAGE PAID at Palo Alto CA and additional mailing offices. Postmaster: Change Service Requested. Send address changes to AI Magazine, 2275 East Bayshore
Road, Suite 160, Palo Alto, CA 94303.
Copying Articles for Personal Use
Copyright © 2016 by the Association for the Advancement of Artificial Intelligence. All
rights reserved. No part of this publication may be reproduced in whole or in part without prior written permission. Unless otherwise stated, the views expressed in published
material are those of the authors and do not necessarily reflect the policies or opinions
of AI Magazine, its editors and staff, or the Association for the Advancement of Artificial
Authorization to photocopy items for internal or personal use, or the internal or personal
use of specific clients, or for educational classroom use, is granted by AAAI, provided that
AI Magazine and AAAI Press
AAAI Officials
David Leake, Indiana University
Editor-in-Chief Elect
Ashok Goel, Georgia Institute of Technology
Competition Reports Coeditors
Sven Koenig, University of
Southern California
Robert Morris, NASA Ames
Reports Editor
Robert Morris, NASA Ames
Worldwide AI Column Editor
Matthijs Spaan, Instituto Superior Técnico
AAAI Press Editor
Anthony Cohn, University of Leeds
Managing Editor
David Hamilton, The Live Oak Press, LLC.
Editorial Board
John Breslin, National University of Ireland
Gerhard Brewka, Leipzig University
Vinay K. Chaudhri, SRI International
Marie desJardins, University of Maryland,
Baltimore County
Kenneth Forbus, Northwestern University
Kenneth Ford, Institute for Human and
Machine Cognition
Ashok Goel, Georgia Institute of Technology
Sven Koenig, University of Southern California
Ramon Lopez de Mantaras, IIIA, Spanish Scientific Research Council
Sheila McIlraith, University of Toronto
Robert Morris, NASA Ames
Hector Munoz-Avila, Lehigh University
Pearl Pu, EPFL
Sandip Sen, University of Tulsa
Kirsten Brent Venable, Tulane University and
Chris Welty, IBM Research
Holly Yanco, University of Massachusetts,
Qiang Yang, Hong Kong University of Science
and Technology
Feng Zhao, Microsoft Research
Thomas G. Dietterich,
Oregon State University
Subbarao Kambhampati,
Arizona State University
Manuela Veloso,
Carnegie Mellon University
Ted Senator
Standing Committees
Conference Chair
Shlomo Zilberstein, University of
Massachusetts, Amherst
Awards, Fellows, and Nominating Chair
Manuela Veloso,
Carnegie Mellon University
Finance Chair
Ted Senator
Conference Outreach Chair
Stephen Smith, Carnegie
Mellon University
International Committee Chair
Councilors (through 2016)
Toby Walsh, NICTA, The Australian
Sven Koenig, University of Southern
National University
California, USA
Membership Chair
Sylvie Thiebaux, NICTA, The Australian
Sven Koenig, University of
National University, Australia
Southern California
Francesca Rossi, University of Padova, Italy
Publications Chair
Brian Williams, Massachusetts
David Leake, Indiana University
Institute of Technology, USA
Symposium Chair and Cochair
Councilors (through 2017)
Gita Sukthankar, University of
Sonia Chernova, Worcester
Central Florida
Polytechnic Institute, USA
Christopher Geib,
Vincent Conitzer, Duke University, USA
Drexel University
Boi Faltings, École polytechnique
fédérale de Lausanne, Suisse
AAAI Staff
Stephen Smith, Carnegie
Mellon University, USA
Executive Director
Carol Hamilton
Councilors (through 2018)
Charles Isbell, Georgia Institute
Diane Mela
of Technology, USA
Diane Litman University of Pittsburgh, USA Conference Manager
Keri Harvey
Jennifer Neville, Purdue University, USA
Membership Coordinator
Kiri L. Wagstaff, Jet Propulsion
Alanna Spencer
Laboratory, USA
AI Journal
National Science Foundation
Microsoft Research
IBM Watson Group
Google, Inc.
Disney Research
Yahoo Labs!
Big ML
Nuance Communications
Oxford Internet Institute
Qatar Computing Research Institute
US Department of Energy
University of Michigan
Adventium Labs
CRA Computing Community Consortium
David Smith
Editorial Introduction to the Special Articles in the Spring Issue
Beyond the Turing Test
Gary Marcus, Francesca Rossi, Manuela Veloso
I The articles in this special issue of AI
Magazine include those that propose
specific tests and those that look at the
challenges inherent in building robust,
valid, and reliable tests for advancing
the state of the art in AI.
lan Turing’s renowned test on intelligence, commonly
known as the Turing test, is an inescapable signpost in
AI. To people outside the field, the test — which hinges
on the ability of machines to fool people into thinking that
they (the machines) are people — is practically synonymous
with the quest to create machine intelligence. Within the
field, the test is widely recognized as a pioneering landmark,
but also is now seen as a distraction, designed over half a century ago, and too crude to really measure intelligence. Intelligence is, after all, a multidimensional variable, and no one
test could possibly ever be definitive truly to measure it.
Moreover, the original test, at least in its standard implementations, has turned out to be highly gameable, arguably
an exercise in deception rather than a true measure of anything especially correlated with intelligence. The much ballyhooed 2015 Turing test winner Eugene Goostman, for
instance, pretends to be a thirteen-year-old foreigner and
proceeds mainly by ducking questions and returning canned
one-liners; it cannot see, it cannot think, and it is certainly a
long way from genuine artificial general intelligence.
Our hope is to see a new suite of tests, part of what we have
Copyright © 2016, Association for the Advancement of Artificial Intelligence. All rights reserved. ISSN 0738-4602
SPRING 2016 3
dubbed the Turing Championships, each designed in
some way to move the field forward, toward previously unconquered territory. Most of the articles in
this special issue stem from our first workshop
toward creating such an event, held during the AAAI
Conference on Artificial Intelligence in January 2015
in Austin, Texas.
he articles in this special issue can be broadly
divided into those that propose specific tests, and
those that look at the challenges inherent in building
robust, valid, and reliable tests for advancing the
state of the art in artificial intelligence.
In the article My Computer is an Honor Student —
But How Intelligent Is It? Standardized Tests as a
Measure of AI, Peter Clark and Oren Etzioni argue
that standardized tests developed for children offer
one starting point for testing machine intelligence.
Ernest Davis in his article How to Write Science
Questions That Are Easy for People and Hard for
Computers, proposes an alternative test called
SQUABU (science questions appraising basic understanding) that aims to asks questions that are easy for
people but hard for computers.
In Toward a Comprehension Challenge, Using
Crowdsourcing as a Tool, Praveen Paritosh and Gary
Marcus propose a crowdsourced comprehension
challenge, in which machines will be asked to answer
open-ended questions about movies, YouTube
videos, podcasts, stories, and podcasts.
The article The Social-Emotional Turing Challenge,
by William Jarrold and Peter Z. Yeh, considers the
importance of social-emotional intelligence and proposes a methodology for designing tests that assess
the ability of machines to infer things like motivations and desires (often referred to in the psychological literature as theory of mind.)
In Artificial Intelligence to Win the Nobel Prize
and Beyond: Creating the Engine for Scientific Discovery, Hiroaki Kitano urges the field to build AI systems that can make significant, even Nobel-worthy,
scientific discoveries.
In Planning, Executing, and Evaluating the Winograd Schema Challenge, Leora Morgenstern, Ernest
Davis, and Charles L. Ortiz, Jr., describe the Winograd
Schema Challenge, a test of commonsense reasoning
that is set in a linguistic context.
In the article Why We Need a Physically Embodied
Turing Test and What It Might Look Like, Charles L.
Ortiz, Jr., argues for tests, such as a construction challenge (build something given a bag of parts), that
focus on four aspects of intelligence: language, perception, reasoning, and action.
Measuring Machine Intelligence Through Visual
Question Answering, by C. Lawrence Zitnik, Aishwarya Agrawal, Stanislaw Antol, Margaret Mitchell,
Dhruv Batra, and Devi Parikh, argues for using visual question answering as an essential part of a multimodal challenge to measure intelligence.
Tomaso Poggio and Ethan Meyers in Turing++
Questions: A Test for the Science of (Human) Intelligence, which also focuses on visual questions, propose to develop tests where competitors must not
only match human behavior but also do so in a way
that is consistent with human physiology, in this
way aiming to use a successor to the Turing test to
bridge between the fields of neuroscience, psychology, and artificial intelligence.
The article I-athlon: Toward a Multidimensional
Turing Test, by Sam Adams, Guruduth Banavar, and
Murray Campbell, proposes a methodology for
designing a test that consists of a series of varied
events, in order to test several dimensions of intelligence. Kenneth Forbus also argues for testing multiple dimensions of intelligence in his article Software
Social Organisms: Implications for Measuring AI
In the article Principles for Designing an AI Competition, or Why the Turing Test Fails as an Inducement Prize, Stuart Shieber discusses several desirable
features for an inducement prize contest, contrasting
them with the current Turing test.
Douglas Lenat in WWTS (What Would Turing
Say?) takes a step back and focuses instead on synergy between human and machine, and the development of conjoint superhuman intelligence.
While the articles included in this issue propose
and discuss several kinds of tests, and we hope to see
many of them being deployed very soon, they should
be considered merely as a starting point for the AI
community. Challenge problems, well chosen, can
drive media interest in the field, but also scientific
progress. We hope therefore that many AI researchers
participate actively in formalizing and refining the
initial proposals described in these articles and discussed at our initial workshops.
In the meantime, we have created a website1 with
pointers to presentations, discussions, and most
importantly ways for interested researchers to get
involved, contribute, and participate in these successors to the Turing test.
Gary Marcus is founder and chief executive officer of Geometric Intelligence and a professor of psychology and neural science at New York University.
Francesca Rossi is a research scientist at the IBM T.J. Watson research center, (on leave from the University of Padova).
Manuela Veloso is the Herbert A. Simon University Professor in the Computer Science Department at Carnegie Mellon University.
My Computer Is an
Honor Student —
But How Intelligent Is It?
Standardized Tests as
a Measure of AI
Peter Clark, Oren Etzioni
I Given the well-known limitations of
the Turing test, there is a need for objective tests to both focus attention on, and
measure progress toward, the goals of
AI. In this paper we argue that machine
performance on standardized tests
should be a key component of any new
measure of AI, because attaining a high
level of performance requires solving significant AI problems involving language
understanding and world modeling —
critical skills for any machine that lays
claim to intelligence. In addition, standardized tests have all the basic requirements of a practical test: they are accessible, easily comprehensible, clearly
measurable, and offer a graduated progression from simple tasks to those
requiring deep understanding of the
world. Here we propose this task as a
challenge problem for the community,
summarize our state-of-the-art results
on math and science tests, and provide
supporting data sets (
lan Turing (Turing 1950) approached the abstract question can machines think? by replacing it with another,
namely can a machine pass the imitation game (the Turing test). In the years since, this test has been criticized as
being a poor replacement for the original enquiry (for example, Hayes and Ford [1995]), which raises the question: what
would a better replacement be? In this article, we argue that
standardized tests are an effective and practical assessment of
many aspects of machine intelligence, and should be part of
any comprehensive measure of AI progress.
While a crisp definition of machine intelligence remains
elusive, we can enumerate some general properties we might
expect of an intelligent machine. The list is potentially long
(for example, Legg and Hutter [2007]), but should at least
include the ability to (1) answer a wide variety of questions,
(2) answer complex questions, (3) demonstrate commonsense and world knowledge, and (4) acquire new knowledge
scalably. In addition, a suitable test should be clearly measurable, graduated (have a variety of levels of difficulty), not
gameable, ambitious but realistic, and motivating.
There are many other requirements we might add (for
example, capabilities in robotics, vision, dialog), and thus
any comprehensive measure of AI is likely to require a battery
of different tests. However, standardized tests meet a surprising number of requirements, including the four listed, and
thus should be a key component of a future battery of tests.
As we will show, the tests require answering a wide variety of
questions, including those requiring commonsense and
world knowledge. In addition, they meet all the practical
requirements, a huge advantage for any component of a
future test of AI.
Copyright © 2016, Association for the Advancement of Artificial Intelligence. All rights reserved. ISSN 0738-4602
SPRING 2016 5
My computer is an
Science and Math as Challenge Areas
Standardized tests have been proposed as challenge
problems for AI, for example, Bringsjord and Schimanski (2003), Bringsjord (2011), Beyer et al. (2005),
Fujita et al. (2014), as they appear to require significant advances in AI technology while also being
accessible, measurable, understandable, and motivating. They also enable us easily to compare AI performance with that of humans.
In our own work, we have chosen to focus on elementary and high school tests (for 6–18 year olds)
because the basic language-processing requirements
are surmountable, while the questions still present
formidable challenges for solving. Similarly, we are
focusing on science and math tests, and have recently achieved some baseline results on these tasks (Seo
et al. 2015, Koncel-Kedziorski et al. 2015, Khot et al.
2015, Li and Clark 2015, Clark et al. 2016). Other
groups have attempted higher level exams, such as
the Tokyo entrance exam (Strickland 2013), and
more specialized psychometric tests such as SAT
word analogies (Turney 2006), GRE word antonyms
(Mohammad et al. 2013), and TOEFL synonyms
(Landauer and Dumais 1997).
We also stipulate that the exams are taken exactly
as written (no reformulation or rewording), so that
the task is clear, standard, and cannot be manipulated or gamed. Typical questions from the New York
Regents 4th grade (9–10 year olds) science exams,
SAT math questions, and more are shown in the next
section. We have also made a larger collection of
challenge questions drawn from these and other
exams, available on our web site.1
We propose to leverage standardized tests, rather
than synthetic tests such as the Winograd schema
(Levesque, Davis, and Morgenstern 2012) or MCTest
(Richardson, Burges, and Renshaw 2013), because
they provide a natural sample of problems and more
directly suggest real-world applications in the areas
of education and science.
Exams and Intelligence
One pertinent question concerning the suitability of
exams is whether they are gameable, that is, answerable without requiring any real understanding of the
world. For example, questions might be answered
with a simple search-engine query or through simple
corpus statistics, without requiring any understanding of the underlying material. Our experience is that
while some questions are answerable in this way,
many are not. There is a continuum from (computationally) easy to difficult questions, where more difficult questions require increasingly sophisticated
internal models of the world. This continuum is
highly desirable, as it means that there is a low barrier to entry, allowing researchers to make initial
inroads into the task, while significant AI challenges
need to be solved to do well in the exam. The diversity of questions also ensures a variety of skills are
tested for, and guards against finding a simple shortcut that may answer them all without requiring any
depth of understanding. (This contrasts with the
more homogeneous Winograd schema challenge
[Levesque, Davis, and Morgenstern 2012], where the
highly stylized question format risks producing specialized solution methods that have little generality).
We illustrate these properties throughout this article.
In addition, 45–65 percent of the regents science
exam questions (depending on the exam), and virtually all SAT geometry questions, contain diagrams that
are necessary for solving the problem. Similarly, the
answers to algebraic word problems are typically four
numbers (see, for example, table 1). In all these cases,
a Google search or simple corpus statistics will not
answer these questions with any degree of reliability.
A second important question, raised by Davis in
his critique of standardized tests for measuring AI
(Davis 2014), is whether the tests are measuring the
right thing. He notes that standardized tests are
authored for people, not machines, and thus will be
testing for skills that people find difficult to master,
skipping over things that are easy for people but challenging for machines. In particular, Davis conjectures
that “standardized tests do not test knowledge that is
obvious for people; none of this knowledge can be
assumed in AI systems.” However, our experience is
generally contrary to this conjecture: although questions do not typically test basic world knowledge
directly, basic commonsense knowledge is frequently required to answer them. We will illustrate this in
detail throughout this article.
The New York Regents
Science Exams
One of the most interesting and appealing aspects of
elementary science exams is their graduated and
multifaceted nature: Different questions explore different types of knowledge and vary substantially in
difficulty (for a computer), from a simple lookup to
those requiring extensive understanding of the
world. This allows incremental progress while still
demanding significant advances for the most difficult
questions. Information retrieval and bag-of-words
methods work well for a subset of questions but eventually reach a limit, leaving a collection of questions
requiring deeper understanding. We illustrate some
of this variety here, using (mainly) the multiple
choice part of the New York Regents 4th Grade Science exams2 (New York State Education Department
2014). For a more detailed analysis, see Clark, Harrison, and Balasubramanian (2013). A similar analysis
can be made of exams at other grade levels and in
other subjects.
Basic Questions
Part of the New York Regents exam tests for relatively straightforward knowledge, such as taxonomic
(“isa”) knowledge, definitional (terminological)
knowledge, and basic facts about the world. Example
questions include the following.
(1) Which object is the best conductor of electricity?
(A) a wax crayon (B) a plastic spoon (C) a rubber eraser (D) an iron nail
(2) The movement of soil by wind or water is called (A)
condensation (B) evaporation (C) erosion (D) friction
(3) Which part of a plant produces the seeds? (A)
flower (B) leaves (C) stem (D) roots
This style of question is amenable to solution by
information-retrieval methods and/or use of existing
ontologies or fact databases, coupled with linguistic
Simple Inference
Many questions are unlikely to have answers explicitly written down anywhere, from questions requiring a relatively simple leap from what might be
already known to questions requiring complex modeling and understanding. An example requiring (simple) inference follows:
(4) Which example describes an organism taking in
nutrients? (A) dog burying a bone (B) A girl eating an
apple (C) An insect crawling on a leaf (D) A boy planting tomatoes in the garden
Answering this question requires knowledge that eating involves taking in nutrients, and that an apple
contains nutrients.
More Complex World Knowledge
Many questions appear to require both richer knowledge of the world, and appropriate linguistic knowledge to apply it to a question. As an example, consider the following question:
(5) Fourth graders are planning a roller-skate race.
Which surface would be the best for this race? (A)
gravel (B) sand (C) blacktop (D) grass
Strong cooccurrences between sand and surface,
grass and race, and gravel and graders (road-smoothing machines), throw off information-retrieval-based
guesses. Rather, a more reliable answer requires
knowing that a roller-skate race involves roller skat-
ing, that roller skating is on a surface, that skating is
best on a smooth surface, and that blacktop is
smooth. Obtaining these fragments of world knowledge and integrating them correctly is a substantial
As a second example, consider the following question:
(6) A student puts two identical plants in the same
type and amount of soil. She gives them the same
amount of water. She puts one of these plants near a
sunny window and the other in a dark room. This
experiment tests how the plants respond to (A) light
(B) air (C) water (D) soil
Again, information-retrieval methods and word correlations do poorly. Rather, a reliable answer requires
recognizing a model of experimentation (perform
two tasks, differing in only one condition), knowing
that being near a sunny window will expose the
plant to light, and that a dark room has no light in it.
As a third example, consider this question:
(7) A student riding a bicycle observes that it moves
faster on a smooth road than on a rough road. This
happens because the smooth road has (A) less gravity
(B) more gravity (C) less friction (D) more friction
A reliable processing of this question requires envisioning and comparing two different situations, overlaying a simple qualitative model on the situations
described (smoother → less friction → faster). It also
requires basic knowledge that bicycles move, and
that riding propels a bicycle.
All the aforementioned examples require general
knowledge of the world, as well as simple science
knowledge. In addition, some questions more directly test basic commonsense knowledge, such as the
(8) A student reaches one hand into a bag filled with
smooth objects. The student feels the objects but does
not look into the bag. Which property of the objects
can the student most likely identify? (A) shape (B) color (C) ability to reflect light (D) ability to conduct electricity
This question requires, among other things, knowing
that touch detects shape, and that sight detects color.
Some questions require selecting the best explanation for a phenomenon, requiring a degree of metareasoning. For example, consider the following question:
(9) Apple trees can live for many years, but bean
plants usually live for only a few months. This statement suggests that (A) different plants have different
life spans (B) plants depend on other plants (C) plants
produce many offspring (D) seasonal changes help
plants grow
This requires not just determining whether the statement in each answer option is true (here, several of
them are), but whether it explains the statement given in the body of the question. Again, this kind of
question would be challenging for a retrieval-based
SPRING 2016 7
uity in tests, and because spatial interpretation and
reasoning is such a fundamental aspect of intelligence. Diagrams introduce several new dimensions
to question-answering, including spatial interpretation and correlating spatial and textual knowledge.
Diagrammatic (nontextual) entities in elementary
exams include sketches, maps, graphs, tables, and
diagrammatic representations (for example, a food
chain). Reasoning requirements include sketch interpretation, correlating textual and spatial elements,
and mapping diagrammatic representations (graphs,
bar charts, and so on) to a form supporting computation. Again, while there are many challenges, the
level of difficulty varies widely, allowing a graduated
plan of attack. Two examples are shown. The first,
question 11 (figure 1), requires sketch interpretation,
part identification, and label/part correlation. The
second, question 12 (figure 2), requires recognizing
and interpreting a spatial representation.
Mathematics and Geometry
We also include elementary mathematics in our challenge scope, as these questions intrinsically require
mapping to mathematical models, a key requirement
for many real-world tasks. These questions are particularly interesting as they combine elements of language processing, (often) story interpretation, mapping to an internal representation (for example,
algebra), and symbolic computation. For example
(13) Molly owns the Wafting Pie Company. This
morning, her employees used 816 eggs to bake pumpkin pies. If her employees used a total of 1339 eggs
today, how many eggs did they use in the afternoon?
Figure 1. Question 11.
(11) Which letter in the diagram points to the plant structure that takes in
water and nutrients?
As a final example, consider the following question from the Texas Assessment of Knowledge and
Skills exam3 (Texas Education Agency 2014):
(10) Which of these mixtures would be easiest to
separate? (A) Fruit salad (B) Powdered lemonade (C)
Hot chocolate (D) Instant pudding
This question requires a complex interplay of basic
world knowledge and language to answer correctly.
A common feature of many elementary grade exams
is the use of diagrams in questions. We choose to
include these in the challenge because of their ubiq-
Such questions clearly cannot be answered by
information retrieval, and instead require symbolic
processing and alignment of textual and algebraic
elements (for example, Hosseini et al. 2014; KoncelKedziorski et al. 2015; Seo et al. 2014, 2015) followed by inference. Additional examples are shown
in table 1.
Note that, in addition to simple arithmetic capabilities, some capacity for world modeling is often
needed. Consider, for example, the following two
(14) Sara’s high school won 5 basketball games this
year. They lost 3 games. How many games did they
play in all?
(15) John has 8 orange balloons, but lost 2 of them.
How many orange balloons does John have now?
Both questions use the word lost, but the first question maps to an addition problem (5 + 3) while the
second maps to a subtraction problem (8 – 2). This
illustrates how modeling the entities, events, and
event sequences is required, in addition to basic algebraic skills.
Finally we also include geometry questions, as
these combine both arithmetic and diagrammatic
reasoning together in challenging ways. For example,
Pupa Egg
Adult Egg
Figure 2. Question 12.
(2) Which diagram correctly shows the life cycle of some insects?
Problems and Equations
John had 20 stickers. He bought 12 stickers from a store in the
mall and got 20 stickers for his birthday. Then John gave 5 of the
stickers to his sister and used 8 to decorate a greeting card. How
many stickers does John have left?
((20 + ((12 + 20) – 8)) – 5) = x
Maggie bought 4 packs of red bouncy balls, 8 packs of yellow
bouncy balls, and 4 packs of green bouncy balls. There were 10
bouncy balls in each package. How many bouncy balls did Maggie
buy in all?
x = (((4 + 8) + 4) * 10)
Sam had 79 dollars to spend on 9 books. After buying them he had
16 dollars. How much did each book cost?
79 = ((9 * x) + 16)
Fred loves trading cards. He bought 2 packs of football cards for
$2.73 each, a pack of Pokemon cards for $4.01, and a deck of
baseball cards for $8.95. How much did Fred spend on cards?
((2 * 2.73) + (4.01 + 8.95)) = x
T bl 1.
1 Examples
l off Problems
P bl
S l d
By Alges with the Returned Equation.
(From Koncel-Kedziorski et al. [2015])
question 16 (figure 3) requires multiple skills (text
processing, diagram interpretation, arithmetic, and
aligning evidence from both text and diagram
together). Although very challenging, there has been
significant progress in recent years on this kind of
problem (for example, Koncel-Kedziorski et al.
[2015]). Examples of problems that current systems
have been able to solve are shown in table 2.
Testing for Commonsense
Possessing and using commonsense knowledge is a
central property of intelligence (Davis and Marcus
2015). However, Davis (2015) and Weston et al.
Figure 3. Question 16.
(16) In the diagram, AB intersects circle O at D, AC intersects circle O at E, AE = 4, AC = 24, and AB = 16. Find AD.
(2015) have both argued that standardized tests do
not test “obvious” commonsense knowledge, and
hence are less suitable as a test of machine intelligence. For instance, using their examples, the following questions are unlikely to occur in a standardized test:
Can you make a watermelon fit into a bag by folding the watermelon?
If you look at the moon then shut your eyes, can
you still see the moon?
If John is in the playground and Bob is in the
office, then where is John?
Can you make a salad out of a polyester shirt?
However, although such questions may not be
SPRING 2016 9
2 E
In the diagram at the
left, circle O has a
radius of 5, and CE = 2.
Diameter AC is
perpendicular to
chord BD. What is
the length of BD?
Equals(RadiusOf(O), 5)
Equals(LengthOf(CE), 2)
Perpendicular(AC), BD)
Equals(what, Length(BD))
a) 12
b) 10
c) 8
d) 6
e) 4
In isosceles triangle
ABC at the left, lines
AM and CM are the
angle bisectors of
angles BAC and BCA.
What is the measure
of angle AMC?
BisectsAngle(AM, BAC)
Equals(what, MeasureOf(AMC))
a) 110
In the figure at left, the
bisector of angle BAC is
perpendicular to BC at
point D. If AB = 6 and
BD = 3, what is the
measure of angle BAC?
b) 115 c) 120 d) 125 e) 130
BisectsAngle(line, BAC)
Perpendicular(line, BC)
Equals(LengthOf(AB), 6)
Equals(LengthOf(BD), 3)
Equals(what, MeasureOf(BAC))
a) 15
b) 30
c) 45
d) 60
e) 75
Table 2. Examples of Problems That Current Systems Have Solved.
Questions (left) and interpretations (right) leading to correct solution by GEOS. From Seo et al. (2015).
directly posed in standardized tests, many questions
indirectly require at least some of this commonsense
knowledge in order to answer. For example, question
(6) (about plants) in the previous section requires
knowing (among other things) that if you put a plant
near X (a window), then the plant will be near X.
This is a flavor of blocks-world-style knowledge very
similar to that tested in many of Weston et al.’s
examples. Similarly question (8) (about objects in a
bag) requires knowing that touch detects shape, and
that not looking implies not being able to detect color. It also requires knowing that a bag filled with
objects contains those objects; a smooth object is
smooth; and if you feel something, you touch it.
These commonsense requirements are similar in style
to many of Davis’s examples. In short, at least some
of the standardized test questions seem to require the
kind of obvious commonsense knowledge that Davis
and Weston et al. call for in order to
derive an answer, even if the answers
themselves are less obvious. Conversely, if one authors a set of synthetic
commonsense questions, there is a significant risk of biasing the set toward
one’s own preconceived notions of
what commonsense means, ignoring
other important aspects. (This has
been a criticism sometimes made of
the Winograd schema challenge.) For
this reason we feel that the natural
diversity present in standardized tests,
as illustrated here, is highly beneficial,
along with their other advantages.
Other Aspects of
Standardized tests clearly do not test
all aspects of intelligence, for example,
dialog, physical tasks, speech. However, besides question-answering and reasoning there are some less obvious
aspects of intelligence they also push
on: explanation, learning and reading,
and dealing with novel problems.
Tests (particularly at higher grade levels) typically include questions that
not only ask for answers but also for
explanations of those answers. So, at
least to some degree, the ability to
explain an answer is required.
Learning and Reading
Reddy (1996) proposed the grand AI
challenge of reading a chapter of a
textbook and answering the questions
at the end of the chapter. While standardized tests do not directly test textbook reading, they do include question
sometimes long story questions. In
addition, acquiring the knowledge
necessary to pass a test will arguably
require breakthroughs in learning and
machine reading; attempts to encode
the requisite knowledge by hand have
to date been unsuccessful.
Dealing with Novel Problems
As our examples illustrate, test taking
is not a monolithic skill. Rather it
requires a battery of capabilities and
the ability to deploy them in potentially novel and unanticipated ways. In
this sense, test taking requires, to some
level, a degree of versatility and the
ability to handle new and surprising
problems that we would expect of an
intelligent machine.
State of the Art
on Standardized Tests
How well do current systems perform
on these tests? While any performance
figure will be exam specific, we can
provide some example data points
from our own research.
On nondiagram, multiple choice science questions (NDMC), our Aristo
system currently scores on average 75
percent (4th grade), 63 percent (8th
grade), and 41 percent (12th grade) on
(previously unseen) New York Regents
science exams (NDMC questions only,
typically four-way multiple choice). As
can be seen, questions become considerably more challenging at higher
grade levels. On a broader multistate
collection of 4th grade NDMC questions, Aristo scores 65 percent (unseen
questions). The data sets are available
at Note that
these are the easier questions (no diagrams, multiple choice); other question types pose additional challenges
as we have described. No system to
date comes even close to passing a full
4th grade science exam.
On algebraic story problems such as
those in table 1, our AlgeS system
scores over 70 percent accuracy on story problems that translate into single
equations (Koncel-Kedziorski et al.
2015). Kushman et al. (2014) report
results on story problems that translate
to simultaneous algebraic equations.
On geometry problems such as those
in table 2, our GeoS system achieves a
49 percent score on (previously
unseen) official SAT questions, and a
score of 61 percent on a data set of
(previously unseen) SAT-like practice
questions. The relevant questions,
data, and software are available on the
Allen Institute’s website.4
If a computer were able to pass standardized tests, would it be intelligent?
Not necessarily, but it would demonstrate that the computer had several
critical skills we associate with intelli-
gence, including the ability to answer
sophisticated questions, handle natural
language, and solve tasks requiring
extensive commonsense knowledge of
the world. In short, it would mark a significant achievement in the quest
toward intelligent machines. Despite
the successes of data-driven AI systems,
it is imperative that we make progress
in these broader areas of knowledge,
modeling, reasoning, and language if
we are to make the next generation of
knowledgable AI systems a reality. Standardized tests can help to drive and
measure progress in this direction as
they present many of these challenges,
yet are also accessible, comprehensible,
incremental, and easily measurable, To
help with this, we are releasing data sets
related to this challenge.
In addition, in October 2015 we
launched the Allen AI Science Challenge,5 a competition run on kaggle
.com to build systems to answer eighthgrade science questions. The competition attracted over 700 participating
teams, and scores jumped from 32.5
percent initially to 58.8 percent by the
end of January 2016. Athough the winner is not yet known at press time, this
successful impact demonstrates the
efficacy of standardized tests to focus
attention and research on these important AI problems.
Of course, some may claim that
existing data-driven techniques are all
that is needed, given enough data and
computing power; if that were so, that
in itself would be a startling result.
Whatever your bias or philosophy, we
encourage you to prove your case, and
take these challenges!
AI2’s data sets are available on the
Allen Institute’s website.5
2. .
Bayer, S.; Damianos, L.; Doran, C.; Ferro, L.;
SPRING 2016 11
Fish, R.; Hirschman, L.; Mani, I.; Riek, L.;
and Oshika, B. 2005. Selected Grand Challenges in Cognitive Science, MITRE Technical Report 05-1218. Bedford MA: The MITRE
Bringsjord, S. 2011. Psychometric Artificial
Intelligence. Journal of Experimental and Theoretical Artificial Intelligence (JETAI) 23(3):
Bringsjord, S., and Schimanski, B. 2003.
What Is Artificial Intelligence? Psychometric AI as an Answer. In Proceedings of the Eighteenth International Joint Conference on Artificial Intelligence, 887–893. San Francisco:
Clark, P.; Harrison, P.; and Balasubramanian, N. 2013. A Study of the Knowledge Base
Requirements for Passing an Elementary
Science Test. In AKBC’13: Proceedings of the
2013 Workshop on Automated Knowledge Base
Construction. New York: Association for
Clark, P.; Etzioni, O.; Khashabi, D.; Khot, T.;
Sabharwal, A.; Tafjord, O.; Turney, P. 2016.
Combining Retrieval, Statistics, and Inference to Answer Elementary Science Questions. In Proceedings of the Thirtieth Conference of the Association for the Advancement of
Artificial Intelligence. Menlo Park, CA: AAAI
Davis, E. 2014. The Limitations of Standardized Science Tests as Benchmarks for AI
Research. Technical Report, New York University. arXiv Preprint arXiv:1411.1629.
Ithaca, NY: Cornell University Library.
Davis, E., and Marcus, G. 2015. Commonsense Reasoning and Commonsense Knowledge in Artificial Intelligence. Communications of the ACM 58(9): 92–103.
Fujita, A.; Kameda, A.; Kawazoe, A.; and
Miyao, Y. 2014. Overview of Todai Robot
Project and Evaluation Framework of its
NLP-based Problem Solving. In Proceedings
of the Tenth International Conference on Language Resources and Evaluation (LREC 2014).
Paris: European Language Resources Association.
Hayes, P., and Ford, K. 1995. Turing Test
Considered Harmful. In Proceedings of the
Fourteenth International Joint Conference on
Artificial Intelligence. San Francisco: Morgan
Kaufmann Publishers.
Hosseini, M.; Hajishirzi, H.; Etzioni, O.; and
Kushman, N. 2014. Learning to Solve Arithmetic Word Problems with Verb Categorization. In EMNLP 2014: Proceedings of the
Empirical Methods in Natural Language Processing. Stroudsburg, PA: Association for
Khot, T.; Balasubramanian, N.; Gribkoff, E.;
Sabharwal, A.; Clark P.; and Etzioni, O.
2015. Exploring Markov Logic Networks for
Question Answering. In EMNLP 2015: Proceedings of the Empirical Methods in Natural
Language Processing. Stroudsburg, PA: Association for Computational Linguistics.
Koncel-Kedziorski, R.; Hajishirzi, H.; Sabharwal, A.; Ang, S. D.; and Etzioni, O. 2015.
Parsing Algebraic Word Problems into Equations. Transactions of the Association for Computational Linguistics, Volume 3 (2015).
Kushman, N.; Artzi, Y.; Zettlemoyer, L.; and
Barzilay, R. 2014. Learning to Automatically Solve Algebra Word Problems. In EMNLP
2014: Proceedings of the Empirical Methods in
Natural Language Processing Conference.
Stroudsburg, PA: Association for Computational Linguistics. 10.3115/v1/
Machine Comprehension of Text. In
EMNLP 2013: Proceedings of the Empirical
Methods in Natural Language Processing Conference. Stroudsburg, PA: Association for
Computational Linguistics.
Seo, M.; Hajishirzi, H.; Farhadi, A.; and
Etzioni, O. 2014. Diagram Understanding
in Geometry Questions. In Proceedings of the
Twenty-Eighth Conference on Artificial Intelligence AAAI 2014. Palo Alto, CA: AAAI Press.
Seo, M.; Hajishirzi, H.; Farhadi, A.; Etzioni,
O.; and Malcolm, C. 2015. Solving Geometry Problems: Combining Text and Diagram
Interpretation. In EMNLP 2015: Proceedings
of the Empirical Methods in Natural Language
Processing Conference. Stroudsburg, PA: Association for Computational Linguistics.
Strickland, E. 2013. Can an AI Get into the
University of Tokyo? IEEE Spectrum 21
August. mspec.2013.
Landauer, T. K., and Dumais, S. T. 1997. A
Solution to Plato’s Problem: The Latent
Semantic Analysis Theory of the Acquisition, Induction, and Representation of
Knowledge. Psychological Review 104(2):
Texas Education Agency. 2014. Texas Assessment of Knowledge and Skills. Austin, TX:
State of Texas Education Agency.
Legg, S., and Hutter, M. A. 2007. Collection
of Definitions of Intelligence. In Advances in
Artificial General Intelligence: Concepts, Architectures, and Algorithms. Frontiers in Artificial Intelligence and Applications Volume
157. Amsterdam, The Netherlands: IOS
Turney, P. D. 2006. Similarity of Semantic
Relations. Computational Linguistics 32(3):
379–416. coli.2006.32.
Levesque, H.; Davis, E.; and Morgenstern, L.
2012. The Winograd Schema Challenge. In
Principles of Knowledge Representation and
Reasoning: Proceedings of the Thirteenth International Conference (KR2012), 552–561. Palo
Alto: AAAI Press.
Li, Y., and Clark, P. 2015. Answering Elementary Science Questions via Coherent
Scene Extraction from Knowledge Graphs.
In EMNLP 2015: Proceedings of the Empirical
Methods in Natural Language Processing Conference. Stroudsburg, PA: Association for
Computational Linguistics.
Mohammad, S. M.; Dorr, B. J.; Hirst, G.; and
Turney, P. D. 2013. Computing Lexical Contrast. Computational Linguistics 39(3): 555–
New York State Education Department.
2014. The Grade 4 Elementary-Level Science
Test. Albany, NY: University of the State of
New York.
Reddy, R. 1996. To Dream the Possible
Dream. Communications of the ACM 39(5):
105–112. 229459.
Richardson, M.; Burges, C.; and Renshaw, E.
2013. MCTest: A Challenge Dataset for the
Turing, A. 1950. Computing Machinery and
Intelligence. Mind 59(236): 433–460. 433
Weston, J.; Bordes, A.; Chopra, S.; Mikolov,
T.; and Rush, A. 2015. Towards AI-Complete
Question Answering: A Set of Prerequisite
Toy Tasks. arXiv Preprint arXiv:1502.
05698v6. Ithaca, NY: Cornell University
Peter Clark is the senior research manager
for Project Aristo at the Allen Institute for
Artificial Intelligence. His work focuses on
natural language processing, machine reasoning, and large knowledge bases, and the
interplay among these three areas. He has
received several awards including a AAAI
Best Paper (1997), Boeing Associate Technical Fellowship (2004), and AAAI Senior
Member (2014).
Oren Etzioni is chief executive officer of
the Allen Institute for Artificial Intelligence.
Beginning in 1991, he was a professor at the
University of Washington’s Computer Science Department. He has received several
awards, including the Robert Engelmore
Memorial Award (2007), the IJCAI Distinguished Paper Award (2005), AAAI Fellow
(2003), and a National Young Investigator
Award (1993).
How to Write Science
Questions That Are Easy for
People and Hard for Computers
Ernest Davis
I As a challenge problem for AI systems, I propose the use of hand-constructed multiple-choice tests, with
problems that are easy for people but
hard for computers. Specifically, I discuss techniques for constructing such
problems at the level of a fourth-grade
child and at the level of a high school
student. For the fourth-grade-level questions, I argue that questions that require
the understanding of time, of impossible or pointless scenarios, of causality,
of the human body, or of sets of objects,
and questions that require combining
facts or require simple inductive arguments of indeterminate length can be
chosen to be easy for people, and are
likely to be hard for AI programs, in the
current state of the art. For the high
school level, I argue that questions that
relate the formal science to the realia of
laboratory experiments or of real-world
observations are likely to be easy for
people and hard for AI programs. I
argue that these are more useful benchmarks than existing standardized tests
such as the SATs or New York Regents
tests. Since the questions in standardized tests are designed to be hard for
people, they often leave many aspects of
what is hard for computers but easy for
people untested.
he fundamental paradox of artificial intelligence is that
many intelligent tasks are extremely easy for people but
extremely difficult to get computers to do successfully.
This is universally known as regards basic human activities
such as vision, natural language, and social interaction, but
it is true of more specialized activities, such as scientific reasoning, as well. As everyone knows, computers can carry out
scientific computations of staggering complexity and can
hunt through immense haystacks of data looking for minuscule needles of insights or subtle, complex correlations. However, as far as I know, no existing computer program can
answer the question, “Can you fold a watermelon?”
Perhaps that doesn’t matter. Why should we need computer programs to do things that people can already do easily? For the last 60 years, we have relied on a reasonable division of labor: computers do what they do extremely well —
calculations that are either extremely complex or require an
enormous, unfailing memory — and people do what they do
well — perception, language, and many forms of learning
and of reasoning. However, the fact that computers have
almost no commonsense knowledge and rely almost entirely on quite rigid forms of reasoning ultimately forms a serious limitation on the capacity of science-oriented applications including question answering; design, robotic
execution, and evaluation of experiments; retrieval, summarization, and high-quality translation of scientific documents; science educational software; and sanity checking of
the results of specialized software (Davis and Marcus 2016).
A basic understanding of the physical and natural world at
the level of common human experience, and an understanding of how the concepts and laws of formal science
relate to the world as experienced, is thus a critical objective
Copyright © 2016, Association for the Advancement of Artificial Intelligence. All rights reserved. ISSN 0738-4602
SPRING 2016 13
in developing AI for science. To measure progress
toward this objective, it would be useful to have standard benchmarks; and to inspire radically ambitious
research projects, it would be valuable to have specific challenges.
In many ways, the best benchmarks and challenges here would be those that are directly connected to real-world, useful tasks, such as understanding
texts, planning in complex situations, or controlling
a robot in a complex environment. However, multiple-choice tests also have their advantages. First, as
every teacher knows, they are easy to grade, though
often difficult to write. Second, multiple-choice tests
can enforce a much narrower focus on commonsense
physical knowledge specifically than on more broadly based tasks. In any more broadly based task, such
as those mentioned above, the commonsense reasoning will only be a small part of the task, and, to
judge by past experience, quite likely the part of the
task with the least short-term payoff. Therefore
research on these problems is likely to focus on the
other aspects of the problem and to neglect the commonsense reasoning.
If what we want is a multiple-choice science as a
benchmark or challenge for AI, then surely the obvious thing to do is to use one of the existing multiplechoice challenge tests, such as the New York State
Regents’ test (New York State Education Department
2014) or the SAT. Indeed, a number of people have
proposed exactly that, and are busy working on
developing systems aimed at that goal. Brachman et
al. (2005) suggest developing a program that can pass
the SATs. Clark, Harrison, and Balasubramanian
(2013) propose a project of passing the New York
State Regents Science test for 4th graders. Strickland
(2013) proposes developing an AI that can pass the
entrance exams for the University of Tokyo. Ohlsson
et al. (2013) evaluated the performance of a system
based on ConceptNet (Havasi, Speer, and Alonso
2007) on a preprocessed form of the Wechsler Preschool and Primary Scale of Intelligence test. Barker
et al. (2004) describe the construction of a knowledge-based system that (more or less) scored a 3 (passing) on two sections of the high school chemistry
advanced placement test. The GEOS system (Seo et
al. 2015), which answers geometry problems from
the SATs, scored 49 percent on official problems and
61 percent on a corpus of practice problems.
The pros and cons of using standardized tests will
be discussed in detail later on in this article. For the
moment, let us emphasize one specific issue: standardized tests were written to test people, not to test
AIs. What people find difficult and what AIs find difficult are extremely different, almost opposite. Standardized tests include many questions that are hard
for people and practically trivial for computers, such
as remembering the meaning of technical terms or
performing straightforward mathematical calculation. Conversely, these tests do not test scientific
knowledge that “every [human] fool knows”; since
everyone knows it, there is no point in testing it.
However, this is often exactly the knowledge that AIs
are missing. Sometimes the questions on standardized
tests do test this kind of knowledge implicitly; but
they do so only sporadically and with poor coverage.
Another possibility would be to automate the construction of questions that are easy for people and
hard for computers. The success of CAPTCHA (von
Ahn et al. 2003) shows that it is possible automatically to generate images that are easy for people to
interpret and hard for computers; however, that is an
unusual case. Weston et al. (2015) propose to build a
system that uses a world model and a linguistic model to generate simple narratives in commonsense
domains. However, the intended purpose of this set
of narratives is to serve as a labeled corpus for an endto-end machine-learning system. Having been generated by a well-understood world model and linguistic model, this corpus certainly cannot drive work on
original, richer, models of commonsense domains, or
of language, or of their interaction.
Having tabled the suggestion of using existing
standardized tests and having ruled out automatically constructed tests, the remaining option is to use
manually designed test problems. To be a valid test
for AI, such problems must be easy for people. Otherwise the test would be in danger of running into, or
at least being accused of, the superhuman human fallacy, in which we set benchmarks that AI cannot
attain because they are simply impossible to attain.
At this point, we have reached, and hopefully to
some extent motivated, the proposal of this article. I
propose that it would be worthwhile to construct
multiple-choice tests that will measure progress
toward developing AIs that have a commonsense
understanding of the natural world and an understanding of how formal science relates to the commonsense view; tests that will be easy for human subjects but difficult for existing computers. Moreover,
as far as possible, that difficulty should arise from
issues inherent to commonsense knowledge and
commonsense reasoning rather than specifically
from difficulties in natural language understanding
or in visual interpretation, to the extent that these
can be separated.
These tests will collectively be called science questions appraising basic understanding — or SQUABU
(pronounced skwaboo). In this article we will consider two specific tests. SQUABU-Basic is a test designed
to measure commonsense understanding of the natural world that an elementary school child can be
presumed to know, limited to material that is not
explicitly taught in school because it is too obvious.
The questions here should be easy for any contemporary child of 10 in a developed country.
SQUABU-HighSchool is a test designed to measure
how well an AI can integrate concepts of high school
chemistry and physics with a commonsense under-
standing of the natural world. The questions here are
designed to be reasonably easy for a student who has
completed high school physics, though some may
require a few minutes thought. The knowledge of the
subject matter is intended to be basic; the problems
are intended to require a conceptual understanding
of the domain, qualitative reasoning about mathematical relations, and basic geometry, but do not
require memory for fine details or intricate exact calculations. These two particular levels were chosen in
part because the 4th grade New York Regents exam
and the physics SATs are helpful points of contrast.
By commonsense knowledge I emphatically do
not mean that I am considering AIs that will replicate
the errors, illusions, and flaws in physical reasoning
that are well known to be common in human cognition. I are here interested only in those aspects of
commonsense reasoning that are valid and that
enhance or underlie formal scientific thinking.
Because of the broad scope of the questions
involved, it would be hard to be very confident of
any particular question that AI systems will find it
difficult. This is in contrast to the Winograd schema
challenge (Levesque, Davis, and Morgenstern 2012),
in which both the framework and the individual
questions have been carefully designed, chosen, and
tuned so that, with fair confidence, each individual
question will be difficult for an automated system. I
do not see any way to achieve that level of confidence for either level of SQUABU; there may be some
questions that can be easily solved. However, I feel
quite confident that at most a few questions would
be easily solved.
It is also difficult to be sure that an AI program will
get the right answer on specific questions in the categories I’ve marked below as “easy”; AI programs
have ways of getting confused or going on the wrong
track that are very hard to anticipate. (An example is
the Toronto problem that Watson got wrong [Welty,
undated].) However, AI programs exist that can
answer these kinds of questions with a large degree of
I will begin by discussing the kinds of problems
that are easy for the current generation of computers; these must be avoided in SQUABU. Then I will
discuss some general rules and techniques for developing questions for SQUABU-Basic and SQUABUHighSchool. After that I will return to the issue of
standardized tests, and their pros and cons for this
purpose, and finally, will come the conclusion.
Problems That Are
Easy for Computers
As of the date of writing (May 2015), the kinds of
problems that tend to arise on standardized tests that
are “easy for computers” (that is, well within the state
of the art) include terminology, taxonomy, and exact
Retrieving the definition of (for human students)
obscure jargon. For example, as Clark (2015) remarks,
the following problem from the New York State 4th
grade Regents Science test is easy for AI programs:
The movement of soil by wind or water is known as
(A) condensation (B) evaporation (C) erosion (D) friction
If you query a search engine for the exact phrase
“movement of soil by wind and water,” it returns
dozens of pages that give that phrase as the definition of erosion.
Constructing taxonomic hierarchies of categories
and individuals organized by subcategory and
instance relations can be considered a solved problem in AI. Enormous, quite accurate hierarchies of
this kind have been assembled through web mining;
for instance Wu et al. (2012) report that the Probase
project had 2.6 million categories and 20.7 million
isA pairs, with an accuracy of 92.8 percent.
Finding the features of these categories, and carrying out inheritance, particularly overridable inheritance, is certainly a less completely solved problem,
but is nonetheless sufficiently solved that problems
based on inheritance must be considered as likely to
be easy for computers.
For example a question such as the following may
well be easy:
Which of the following organs does a squirrel not
have: (A) a brain (B) gills (C) a heart (D) lungs?
(This does require an understanding of not, which is
by no means a feature of all IR programs; but it is well
within the scope of current technology.)
Exact Calculation
Problems that involve retrieving standard exact physical formulas, and then using them in calculations,
either numerical or symbolic, are easy. For example,
questions such as the following SAT-level physics
problems are probably easy (Kaplan [2013], p. 294)
A 40 Ω resistor in a closed circuit has 20 volts across it.
The current flowing through the resistor is (A) 0.5 A;
(B) 2 A; (C) 20 A; (D) 80 A; (E) 800 A.
A horizontal force F acts on a block of mass m that is
initially at rest on a floor of negligible friction. The
force acts for time t and moves the block a displacement d. The change in momentum of the block is (A)
F/t; (B) m/t; (C) Fd; (D) Ft; (E) mt.
The calculations are simple, and, for examples like
these, finding the standard formula that matches the
word problem can be done with high accuracy using
standard pattern-matching techniques.
One might be inclined to think that AI programs
would have trouble with the kind of brain teaser in
which the naïve brute-force solution is horribly complicated but there is some clever way of looking at the
SPRING 2016 15
problem that makes it simple. However, these probably will not be effective challenges for AI. The AI program will, indeed, probably not find the clever
approach; however, like John von Neumann in the
well-known anecdote,1 the AI program will be able to
do the brute force calculation faster than ordinary
people can work out the clever solution.
What kind of science questions, then, are easy for
people and hard for computers? In this section I will
consider this question in the context of SQUABUBasic, which does not rely on book learning. Later, I
will consider the question in the context of SQUABUHighSchool, which tests the integration of high
school science with commonsense reasoning.
In principle, representing temporal information in AI
systems is almost entirely a solved problem, and carrying out temporal reasoning is largely a solved problem. The known representational systems for temporal knowledge (for example, those discussed in Reiter
(2001) and in Davis (1990, chapter 5) suffice for all
but a handful of the situations that arise in temporal
reasoning;2 almost all of the purely temporal inferences that come up can be justified in established
temporal theories; and most of these can be carried
out reasonably efficiently, though not all, and there
is always room for improvement.
However, in practical terms, time is often seriously neglected in large-scale knowledge-based systems,
although CYC (Lenat, Prakash, and Shepherd 1986)
is presumably an exception. Mitchell et al. (2015)
specifically mention temporal issues as an issue
unaddressed in NELL, and systems like ConceptNet
(Havasi, Speer, and Alonso 2007) seem to be entirely
unsystematic in how they deal with temporal issues.
More surprisingly the abstract meaning representation (AMR)3, a recent project to manually annotate a
large body of text with a formal representation of its
meaning, has decided to exclude temporal information from its representation. (Frankly, I think this
may well be a short-sighted decision, which will be
regretted later.) Thus, there is a common impression
that temporal information is either too difficult or
not important enough to deal with in AI systems.
Therefore, if a temporal fact is not stated explicitly, then it is likely to be hard for existing AI systems
to derive. Examples include the following:
Problem B.1 Sally’s favorite cow died yesterday. The
cow will probably be alive again (A) tomorrow; (B)
within a week; (C) within a year; (D) within a few
years; (E) The cow will never be alive again.
Problem B.2 Malcolm Harrison was a farmer in Virginia who died more than 200 years ago. He had a
dozen horses on his farm. Which of the following is
most likely to be true: (A) All of Harrison’s horses are
dead. (B) Most of Harrison’s horses are dead, but a few
might be alive. (C) Most of Harrison’s horses are alive,
but a few might have died. (D) Probably all of Harrison’s horses are alive.
Problem B.3 Every week during April, Mike goes to
school from 9 AM to 4 PM, Monday through Friday.
Which of the following statements is true (only one)?
(A) Between Monday 9 AM and Tuesday 4 PM, Mike is
always in school. (B) Between Monday 9 AM through
Tuesday 4 PM, Mike is never in school. (C) Between
Monday 4 PM and Friday 9 AM, Mike is never in
school. (D) Between Saturday 9 AM and Monday 8
AM, Mike is never in school. (E) Between Sunday 4 PM
and Tuesday 9 AM, Mike is never in school. (F) It
depends on the year.
With regard to question B.2, the AI can certainly find
the lifespan of a horse on Wikipedia or some similar
source. However, answering this question requires
combining this with the additional facts that lifespan
measures the time from birth to death, and that if
person P owns horse H at time T, then both P and H
are alive at time T. This connects to the feature “combining multiple facts” discussed later.
This seems like it should be comparatively easy to
do; I would not be very surprised if AI programs could
solve this kind of problem 10 years from now. On the
other hand, I am not aware of much research in this
Inductive Arguments
of Indeterminate Length
AI programs tend to be bad at arguments about
sequences of things of an indeterminate number. In
the software verification literature, there are techniques for this, but these have hardly been integrated into the AI literature.
Examples include the following:
Problem B.4 Mary owns a canary named Paul. Does
Paul have any ancestors who were alive in the year
1750? (A) Definitely yes. (B) Definitely no. (C) There is
no way to know.
Problem B.5 Tim is on a stony beach. He has a large
pail. He is putting small stones one by one into the
pail. Which of the following is true: (A) There will never be more than one stone in the pail. (B) There will
never be more than three stones in the pail. (C) Eventually, the pail will be full, and it will not be possible
to put more stones in the pail. (D) There will be more
and more stones in the pail, but there will always be
room for another one.
Impossible and Pointless Scenarios
If you cook up a scenario that is obviously impossible
for no very interesting reason, then it is quite likely
that no one has gone to the trouble of stating on the
web that it is impossible, and that the AI cannot figure that out.
Of course, if all the questions of this form have the
answer “this is impossible,” then the AI or its designer will soon catch on to that fact. So these have to be
counterbalanced by questions about scenarios that
are in fact obviously possible, but so pointless that no
one will have bothered to state that they are possible
or that they occurred.
Examples include the following:
Problem B.6 Is it possible to fold a watermelon?
Problem B.7 Is it possible to put a tomato on top of a
Problem B.8 Suppose you have a tomato and a whole
watermelon. Is it possible to get the tomato inside the
watermelon without cutting or breaking the watermelon?
Problem B.9 Which of the following is true: (A) A
female eagle and a male alligator could have a baby.
That baby could either be an eagle or an alligator. (B)
A female eagle and a male alligator could have a baby.
That baby would definitely be an eagle. (C) A female
eagle and a male alligator could have a baby. That
baby would definitely be an alligator. (D) A female
eagle and a male alligator could have a baby. That
baby would be half an alligator and half an eagle. (E)
A female eagle and a male alligator cannot have a
Problem B.10 If you brought a canary and an alligator
together to the same place, which of the following
would be completely impossible: (A) The canary could
see the alligator. (B) The alligator could see the canary.
(C) The canary could see what is inside the alligator’s
stomach. (D) The canary could fly onto the alligator’s
stick became longer. (D) After the pin is pulled out,
the stick no longer has a length.
Putting Facts Together
Questions that require combining facts that are likely to be expressed in separate sources are likely to be
difficult for an AI. As already discussed, B.2 is an
example. Another example:
Problem B.15 George accidentally poured a little
bleach into his milk. Is it OK for him to drink the
milk, if he’s careful not to swallow any of the bleach?
This requires combining the facts that bleach is a
poison, that poisons are dangerous even when diluted, that bleach and milk are liquids, and that it is difficult to separate two liquids that have been mixed.
Human Body
Of course, people have an unfair advantage here.
Problem B.16 Can you see your hand if you hold it
behind your head?
Problem B.17 If a person has a cold, then he will probably get well (A) In a few minutes. (B) In a few days or
a couple of weeks. (C) In a few years. (D) He will never get well.
Problem B.18 If a person cuts off one of his fingers,
then he will probably grow a new finger (A) In a few
minutes. (B) In a few days or a couple of weeks. (C) In
a few years. (D) He will never grow a new finger.
Sets of Objects
Many causal sequences that are either familiar or
obvious are unlikely to be discussed in the corpus
Physical reasoning programs are good at reasoning
about problems with fixed numbers of objects, but
not as good at reasoning about problems with indeterminate numbers of objects.
Problem B.11 Suppose you have two books that are
identical except that one has a white cover and one
has a black cover. If you tear a page out of the white
book what will happen? (A) The same page will fall out
of the black book. (B) Another page will grow in the
black book. (C) The page will grow back in the white
book. (D) All the other pages will fall out of the white
book. (E) None of the above.
Spatial Properties of Events
Basic spatial properties of events may well be difficult
for an AI to determine.
Problem B.12 When Ed was born, his father was in
Boston and his mother was in Los Angeles. Where was
Ed born? (A) In Boston. (B) In Los Angeles. (C) Either
in Boston or in Los Angeles. (D) Somewhere between
Boston and Los Angeles.
Problem B.13 Joanne cut a chunk off a stick of cheese.
Which of the following is true? (A) The weight of the
stick didn’t change. (B) The stick of cheese became
lighter. (C) The stick of cheese became heavier. (D)
After the chunk was cut off, the stick no longer had a
measurable weight.
Problem B.14 Joanne stuck a long pin through the
middle of a stick of cheese, and then pulled it out.
Which of the following is true? (A) The stick remained
the same length. (B) The stick became shorter. (C) The
Problem B.19 There is a jar right-side up on a table,
with a lid tightly fastened. There are a few peanuts in
the jar. Joe picks up the jar and shakes it up and down,
then puts it back on the table. At the end, where,
probably, are the peanuts? (A) In the jar. (B) On the
table, outside the jar. (C) In the middle of the air.
Problem B.20 There is a jar right-side up on a table,
with a lid tightly fastened. There are a few peanuts on
the table. Joe picks up the jar and shakes it up and
down, then puts it back on the table. At the end,
where, probably, are the peanuts? (A) In the jar. (B) On
the table, outside the jar. (C) In the middle of the air.
The construction of SQUABU-HighSchool is quite
different from SQUABU-Basic. SQUABU-HighSchool
relies largely on the same gaps in an AI’s understanding that we have described earlier for SQUABUBasic. However, since the object is to appraise the AI’s
understanding of the relation between formal science and commonsense reasoning, the choice of
domain becomes critical; the domain must be one
where the relation between the two kinds of knowledge is both deep and evident to people.
SPRING 2016 17
Figure 1: A Chemistry Experiment.
One fruitful source of these kinds of domains is
simple high school level science lab experiments. On
the one hand experiments draw on or illustrate concepts and laws from formal science; on the other
hand, understanding the experimental set up often
requires commonsense reasoning that is not easily
formalized. Experiments also must be physically
manipulable by human beings and their effects must
be visible (or otherwise perceptible) to human
beings; thus, the AI’s understanding of human powers of manipulation and perception can also be tested. Often, an effective way of generating questions is
to propose some change in the setup; this may either
create a problem or have no effect.
I have also found basic astronomy to be a fruitful
domain. Simple astronomy involves combining general principles, basic physical knowledge, elementary
geometric reasoning, and order-of-magnitude reasoning.
A third category of problem is problems in everyday settings where formal scientific analysis can be
brought to bear.
One general caveat: I am substantially less confident that high school students would in fact do well
on my sample questions for SQUABU-HighSchool
than that fourth-graders would do well on the sample questions for SQUABU-Basic. I feel certain that
they should do well, and that something is wrong if
they do not do well, but that is a different question.
Chemistry Experiment
Read the following description of a chemistry experiment,4 illustrated in figure 1. A small quantity of
potassium chlorate (KClO3) is heated in a test tube,
and decomposes into potassium chloride (KCl) and
oxygen (O2). The gaseous oxygen expands out of the
test tube, goes through the tubing, bubbles up
through the water in the beaker, and collects in the
inverted beaker over the the water. Once the bubbling has stopped, the experimenter raises or lowers
the beaker until the level of the top of water inside
and outside the beaker are equal. At this point, the
pressure in the beaker is equal to atmospheric pressure. Measuring the volume of the gas collected over
the water, and correcting for the water vapor that is
mixed in with the oxygen, the experimenter can thus
measure the amount of oxygen released in the
Problem H.1: If the right end of the U-shaped tube
were outside the beaker rather than inside, how would
that change things? (A) The chemical decomposition
would not occur. (B) The oxygen would remain in the
test tube. (C) The oxygen would bubble up through
the water in the basin to the open air and would not
be collected in the beaker. (D) Nothing would change.
The oxygen would still collect in the beaker, as shown.
Problem H.2: If the beaker had a hole in the base (on
top when inverted as shown), how would that change
things? (A) The oxygen would bubble up through the
beaker and out through the hole. (B) Nothing would
change. The oxygen would still collect in the beaker, as
shown. (C) The water would immediately flow out
from the inverted beaker into the basin and the beaker
would fill with air coming in through the hole.
Problem H.3 If the test tube, the beaker, and the Utube were all made of stainless steel rather than glass,
how would that change things? (A) Physically it would
make no difference, but it would be impossible to see
and therefore impossible to measure. (B) The chemical
Charged Plate
Oil drop
Charged Plate
Figure 2. Millikan Oil-Drop Experiment.
decomposition would not occur. (C) The oxygen
would seep through the stainless steel beaker. (D) The
beaker would break. (E) The potassium chloride would
accumulate in the beaker.
Problem H.4 Suppose the stopper in the test tube were
removed, but that the U-tube has some other support
that keeps it in its current position. How would that
change things? (A) The oxygen would stay in the test
tube. (B) All of the oxygen would escape to the outside
air. (C) Some of the oxygen would escape to the outside air, and some would go through the U-shaped
tube and bubble up to the beaker. So the beaker would
get some oxygen but not all the oxygen.
Problem H.5 The experiment description says, “The
experimenter raises or lowers the beaker until the level of the top of water inside and outside the beaker are
equal. At this point, the pressure in the beaker is equal
to atmospheric pressure.” More specifically: Suppose
that after the bubbling has stopped, the level of water
in the beaker is higher than the level in the basin (as
seems to be shown in the right-hand picture). Which
of the following is true: (A) The pressure in the beaker
is lower than atmospheric pressure, and the beaker
should be lowered. (B) The pressure in the beaker is
lower than atmospheric pressure, and the beaker
should be raised. (C) The pressure in the beaker is
higher than atmospheric pressure, and the beaker
should be lowered. (D) The pressure in the beaker is
higher than atmospheric pressure, and the beaker
should be raised.
Problem H.6 Suppose that instead of using a small
amount of potassium chlorate, as shown, you put in
enough to nearly fill the test tube. How will that
change things? (A) The chemical decomposition will
not occur. (B) You will generate more oxygen than the
beaker can hold. (C) You will generate so little oxygen
that it will be difficult to measure.
Problem H.7 In addition to the volume of the gas in
the beaker, which of the following are important to
measure accurately? (A) The initial mass of the potassium chlorate. (B) The weight of the beaker. (C) The
diameter of the beaker. (D) The number and size of the
bubbles. (E) The amount of liquid in the beaker.
Problem H.8 The illustration shows a graduated beaker.
Suppose instead you use an ungraduated glass beaker. How
will that change things? (A) The oxygen will not collect
properly in the beaker. (B) The experimenter will not know
whether to raise or lower the beaker. (C) The experimenter
will not be able to measure the volume of gas.
Problem H.9 At the start of the experiment, the beaker
needs to be full of water, with its mouth in the basin below
the surface of the water in the basin. How is this state
achieved? (A) Fill the beaker with water rightside up, turn
it upside down, and lower it upside down into the basin.
(B) Put the beaker rightside up into the basin below the
surface of the water; let it fill with water; turn it upside
down keeping it underneath the water; and then lift it
upward, so that the base is out of the water, but keeping
the mouth always below the water. (C) Put the beaker
upside down into the basin below the surface of the water;
and then lift it back upward, so that the base is out of the
water, but keeping the mouth always below the water. (D)
Put the beaker in the proper position, and then splash
water upward from the basin into it. (E) Put the beaker in
its proper position, with the mouth below the level of the
water; break a small hole in the base of the beaker; suction
the water up from the basin into the beaker using a
pipette; then fix the hole.
Millikan Oil-Drop Experiment
Problem H.10: In the Millikan oil-drop experiment, a tiny
oil drop charged with a single electron was suspended
between two charged plates (figure 2). The charge on the
plates was adjusted until the electric force on the drop
exactly balanced its weight. How were the plates charged?
(A) Both plates had a positive charge. (B) Both plates had
a negative charge. (C) The top plate had a positive charge,
and the bottom plate had a negative charge. (D) The top
plate had a negative charge, and the bottom plate had a
positive charge. (E) The experiment would work the same,
no matter how the plates were charged.
Problem H.11: If the oil drop started moving upward, Millikan would (A) Increase the charge on the plates. (B)
Reduce the charge on the plates. (C) Increase the charge on
the drop. (D) Reduce the charge on the drop. (E) Make the
SPRING 2016 19
drop heavier. (F) Make the drop lighter. (G) Lift the
bottom plate.
Problem H.12: If the oil drop fell onto the bottom
plate, Millikan would (A) Increase the charge on the
plates. (B) Reduce the charge on the plates. (C)
Increase the charge on the drop. (D) Reduce the charge
on the drop. (E) Start over with a new oil drop.
Problem H.13: The experiment demonstrated that the
charge is quantized; that is, the charge on an object is
always an integer multiple of the charge of the electron, not a fractional or other noninteger multiple. To
establish this, Millikan had to measure the charge on
(A) One oil drop. (B) Two oil drops. (C) Many oil drops.
Astronomy Problems
Problem H.14: Does it ever happen that there is an
eclipse of the sun one day and an eclipse of the moon
the next?
Problem H.15: Does it ever happen that someone on
Earth sees an eclipse of the moon shortly after sunset?
Problem H.16: Does it ever happen that someone on
Earth sees an eclipse of the moon at midnight?
Problem H.17: Does it ever happen that someone on
Earth sees an eclipse of the moon at noon?
Problem H.18: Does it ever happen that one person on
Earth sees a total eclipse of the moon, and at exactly
the same time another person sees the moon
Problem H.19: Does it ever happen that one person on
Earth sees a total eclipse of the sun, and at exactly the
same time another person sees the sun uneclipsed?
Problem H.20: Suppose that you are standing on the
moon, and Earth is directly overhead. How soon will
Earth set? (A) In about a week. (B) In about two weeks.
(C) In about a month. (D) Earth never sets.
Problem H.21: Suppose that you are standing on the
moon, and the sun is directly overhead. How soon will
the sun set? (A) In about a week. (B) In about two
weeks. (C) In about a month. (D) The sun never sets.
Problem H.22: You are looking in the direction of a
particular star on a clear night. The planet Mars is on
a direct line between you and the star. Can you see the
Problem H.23: You are looking in the direction of a
particular star on a clear night. A small planet orbiting
the star is on a direct line between you and the star.
Can you see the star?
Problem H.24: Suppose you were standing on one of
the moons of Jupiter. Ignoring the objects in the solar
system, which of the following is true: (A) The pattern
of stars in the sky looks almost identical to the way it
looks on Earth. (B) The pattern of stars in the sky looks
very different from the way it looks on Earth.
Problem H.25: Nearby stars exhibit parallax due to the
annual motion of Earth. If a star is nearby, and is in the
plane of Earth’s revolution, and you track its relative
motion against the background of very distant stars
over the course of a year, what figure does it trace? (A)
A straight line. (B) A square. (C) An ellipse. (D) A
Problem H.26: If a star is nearby, and the line from
Earth to the star is perpendicular to the plane of
Earth’s revolution, and you track its relative motion
against the background of very distant stars over the
course of a year, what figure does it trace? (A) A
straight line. (B) A square. (C) An ellipse. (D) A cycloid.
Problems in Everyday Settings
Problem H.27: Suppose that you have a large closed
barrel. Empty, the barrel weighs 1 kg. You put into the
barrel 10 gm of water and 1 gm of salt, and you dissolve the salt in the water. Then you seal the barrel
tightly. Over time, the water evaporates into the air in
the barrel, leaving the salt at the bottom. If you put
the barrel on a scales after everything has evaporated,
the weight will be (A) 1000 gm (B) 1001 gm (C) 1010
gm (D) 1011 gm (E) Water cannot evaporate inside a
closed barrel.
Problem H.28: Suppose you are in a room where the
temperature is initially 62 degrees. You turn on a
heater, and after half an hour, the temperature
throughout the room is now 75 degrees, so you turn
off the heater. The door to the room is closed; however there is a gap between the door and the frame, so air
can go in and out. Assume that the temperature and
pressure outside the room remain constant over the
time period. Comparing the air in the room at the
start to the air in the room at the end, which of the following is true: (A) The pressure of the air in the room
has increased. (B) The air in the room at the end occupies a larger volume than the air in the room at the
beginning. (C) There is a net flow of air into the room
during the half hour period. (D) There is a net flow of
air out of the room during the half hour period. (E)
Impossible to tell from the information given.
Problem H.29: The situation is the same as in problem
H.28, except that this time the room is sealed, so that
no air can pass in or out. Which of the following is
true: (A) The pressure of the air in the room has
increased. (B) The pressure of the air in the room has
decreased. (C) The air in the room at the end occupies
a larger volume than the air in the room at the beginning. (D) The air in the room at the end occupies a
smaller volume than the air in the room at the beginning. (E) The ideal gas constant is larger at the end
than at the beginning. (F) The ideal gas constant is
smaller at the end than at the beginning.
Problem H.30: You blow up a toy rubber balloon, and tie
the end shut. The air pressure in the balloon is: (A) Lower than the air pressure outside. (B) Equal to the air pressure outside. (C) Higher than the air pressure outside.
Apparent Advantages
of Standardized Tests
An obvious alternative to creating our own SQUABU
test is to use existing standardized tests. However, it
seems to me that the apparent advantages of using
standardized tests as benchmarks are mostly either
minor or illusory. The advantages that I am aware of
are the following:
Standardized Tests Exist
Standardized tests exist, in large number; they do not
have to be created. This “argument from laziness” is
not entirely to be sneezed at. The experience of the
computational linguistics community shows that, if
you take evaluation seriously, developing adequate
evaluation metrics and test materials requires a very
substantial effort. However, the experience of the
computational linguistic community also suggests
that, if you take evaluation seriously, this effort cannot be avoided by using standardized tests. No one in
the computational linguistics community would
dream of proposing that progress in natural language
processing (NLP) should be evaluated in terms of
scores on the English language SATs.
Investigator Bias
Entrusting the issue of evaluation measures and
benchmarks to the same physical reasoning community that is developing the programs to be evaluated
is putting the foxes in charge of the chicken coops.
The AI researchers will develop problems that fit their
own ideas of how the problems should be solved.
This is certainly a legitimate concern; but I expect in
practice much less distortion will be introduced this
way than by taking tests developed for testing people
and applying them to AI.
Vetting and Documentation
Standardized tests have been carefully vetted and
the performance of the human population on them
is very extensively documented. On the first point,
it is not terribly difficult to come up with correct
tests. On the second point, there is no great value to
the AI community in knowing how well humans of
different ages, training, and so on do on this problem. It hardly matters which questions can be
solved by 5 year olds, which by 12 year olds, and
which by 17 year olds, since, for the foreseeable
future, all AI programs of this kind will be idiot
savants (when they are not simply idiots), capable
of superhuman calculations at one minute, and subhuman confusions at the next. There is no such
thing as the mental age of an AI program; the abilities and disabilities of an AI program do not correspond to those of any human being who has ever
existed or could ever exist.
Public Acceptance
Success on standardized tests is easily accepted by the
public (in the broad sense, meaning everyone except
researchers in the area), whereas success on metrics
we have defined ourselves requires explanation, and
will necessarily be suspect. This, it seems to me, is the
one serious advantage of using standardized tests.
Certainly the public is likely to take more interest in
the claim that your program has passed the SAT, or
even the fourth-grade New York Regents test, than in
the claim that it has passed a set of questions that
you yourself designed and whose most conspicuous
feature is that they are spectacularly easy.
However, this is a double-edged sword. The public
can easily jump to the conclusion that, since an AI
program can pass a test, it has the intelligence of a
human that passes the same test. For example, Ohlsson et al. (2013) titled their paper “Verbal IQ of a
Four-Year Old Achieved by an AI System.”5 Unfortunately, this title was widely misinterpreted as a claim
about verbal intelligence or even general intelligence. Thus, an article in ComputerWorld (Gaudin
2013) had the headline “Top Artificial Intelligence
System Is As Smart As a 4-Year-Old;” the Independent
published an article “AI System Found To Be as
Clever as a Young Child after Taking IQ Test;” and
articles with similar titles were published in many
other venues. These headlines are of course absurd; a
four-year old can make up stories, chat, occasionally
follow directions, invent words, learn language at an
incredible pace; ConceptNet (the AI system in question) can do none of these.
Finally, some standardized tests, including the SATs,
are not published and are available to researchers
only under stringent nondisclosure agreements. It
seems to me that AI researchers should under no circumstances use such a test with such an agreement.
The loss from the inability to discuss the program’s
behavior on specific examples far outweighs the gain
from using a test with the imprimatur of the official
test designer. This applies equally to Haroun and
Hestenes’ (1985) well-known basic physics test; in
any case, it would seem from the published information that that test focuses on testing understanding of force and energy rather than testing the relation of formal physics to basic world knowledge. The
same applies to the restrictions placed by
on the use of their data sets.
Standardized tests carry an immense societal burden and must meet a wide variety of very stringent
constraints. They are taken by millions of students
annually under very plain testing circumstances (no
use of calculators, let alone Internet). They bear a disproportionate share in determining the future of
those students. They must be fair across a wide range
of students. They must conform to existing curricula. They must maintain a constant level of difficulty,
both across the variants offered in any one year, and
from one year to the next. They are subject to
intense scrutiny by large numbers of critics, many of
them unfriendly. These constraints impose serious
limitations on what can be asked and how exams
can be structured.
In developing benchmarks for AI physical reasoning, we are subject to none of these constraints. Why
tie our own hands, by confining ourselves to standardized tests? Why not take advantage of our freedom?
SPRING 2016 21
I have not worked out all the practical
issues that would be involved in actually offering one of the SQUABU tests
as an AI challenge, but I feel confident
that it can be done, if there is any
interest in it.
The kind of knowledge tested in
SQUABU is, of course, only a small part
of the knowledge of science that a K–
12 student possesses; however, it is one
of the fundamental bases underlying
all scientific knowledge. An AI system
for general scientific knowledge that
cannot pass the SQUABU challenge,
no matter how vast its knowledge base
and how powerful its reasoning
engine, is built on sand.
Thanks to Peter Clark, Gary Marcus,
and Andrew Sundstrom for valuable
1. See Nasar (1998), p. 80.
2. There may be some unresolved issues in
the theory of continuously branching time.
4. Do not attempt to carry out this experiment based on the description here. Potassium chlorate is explosive, and safety precautions, not described here, must be taken.
5. They have since changed the title to
Measuring an Artificial Intelligence System’s Performance on a Verbal IQ Test for
Young Children.
Barker, K.; Chaudhri, V. K.; Chaw, S. Y.;
Clark, P.; Fan, J.; Israel, D.; Mishra, S.; Porter,
B.; Romero, P.; Tecuci, D.; and Yeh, P. 2004.
A Question-Answering System for AP
Chemistry: Assessing KR&R Technologies.
In Principles of Knowledge Representation and
Reasoning: Proceedings of the Ninth International Conference. Menlo Park, CA: AAAI
puter Is an Honor student — But How Intelligent Is It? Standardized Tests as a Measure
of AI. AI Magazine 37(1).
Clark, P.; Harrison, P.; and Balasubramanian, N. 2013. A Study of the Knowledge Base
Requirements for Passing an Elementary
Science Test. In AKBC’13: Proceedings of the
2013 Workshop on Automated Knowledge Base
Construction. New York: Association for
Computing Machinery.
Davis, E. 1990. Representations of Commonsense Reasoning. San Mateo, CA: Morgan
Davis, E., and Marcus, G. 2016. The Scope
and Limits of Simulation in Automated Reasoning. Artificial Intelligence 233(April): 60–
Gaudin, S. 2013. Top Artificial Intelligent
System Is as Smart as a 4-Year Old, Computerworld July 15, 2013.
Haroun, I., and Hestenes, D. 1985. The Initial Knowledge State of College Physics Students. American Journal of Physics 53(11):
Havasi, C.; Speer, R.; and Alonso, J. 2007.
Conceptnet 3: A Flexible Multilingual
Semantic Network for Common Sense
Knowledge. Paper presented at the Recent
Advances in Natural Language Processing
Conference, Borovets, Bulgaria, September
Kaplan. 2013. Kaplan SAT Subject Test:
Physics. 2013–2014. New York: Kaplan Publishing.
Lenat, D.; Prakash, M.; and Shepherd, M.
1986. CYC: Using Common Sense Knowledge to Overcome Brittleness and Knowledge Acquisition Bottlenecks. AI Magazine
6(4): 65–85.
Levesque, H., Davis, E.; and Morgenstern, L.
2012. The Winograd Schema Challenge. In
Principles of Knowledge Representation and
Reasoning: Proceedings of the Thirteenth International Conference. Palo Alto, CA: AAAI
Mitchell, T.; Cohen, W.; Hruschka, E.; Talukdar, P.; Betteridge, J.; Carlson, A.; Dalvi, B.;
Gardner, M.; Kisiel, B.; Krishnamurthy, J.;
Lao, N.; Mazaitis, K.; Mohamed, T.; Nakashole, N.; Platanios, E.; Ritter, A; Samadi, M.;
Settles, B.; Wang, R.; Wijaya, D.; Gupta, A.;
Chen, X.; Saparov, A.; Greaves, M.; Welling,
J.. 2015. Never-Ending Learning. In Proceedings of the Twenty-Ninth AAAI Conference on
Artificial Intelligence. Palo Alto: AAAI Press.
Brachman, R.; Gunning, D.; Bringsjord, S.;
Genesereth, M.; Hirschman, L.; and Ferro,
L. 2005. Selected Grand Challenges in Cognitive Science. MITRE Technical Report 051218. Bedford MA: The MITRE Corporation.
Nasar, S. 1998. A Beautiful Mind: The Life of
Mathematical Genius and Nobel Laureate John
Nash. New York: Simon and Schuster.
Brown, T. L.; LeMay, H. E.; Bursten, B; and
Burdge, J. R. 2003. Chemistry: The Central
Science, ninth edition. Upper Saddle River,
NJ: Prentice Hall.
New York State Education Department.
2014. The Grade 4 Elementary-Level Science Test. Albany, NY: University of the
State of New York.
Clark, P., and Etzioni, O. 2016. My Com-
Ohlsson, S.; Sloan, R. H.; Turán, G.; and
Urasky, A. 2013. Verbal IQ of a Four-Year
Old Achieved by an AI System. Paper presented at the Eleventh International Symposium on Logical Foundations of Commonsense Reasoning, Ayia Napa, Cyprus,
27–29 May.
Reiter, R. 2001. Knowledge in Action: Logical
Foundations for Specifying and Implementing
Dynamical Systems. Cambridge, Mass.: The
MIT Press.
Seo, M.; Hajishiri, H.; Farhadi, A.; Etzioni,
O.; and Malcolm, C. 2015. Solving Geometry Problems: Combining Text and Diagram
Interpretation. In EMNLP 2015: Proceedings
of the Empirical Methods in Natural Language
Processing. Stroudsburg, PA: Association for
Computational Linguistics.
Strickland, E. 2013. Can an AI Get into the
University of Tokyo? IEEE Spectrum 21
von Ahn, L.; Blum, M.; Hopper, N.; and
Langford, J. 2003. CAPTCHA: Using Hard AI
Problems for Security. In Proceedings of the
Annual International Conference on the Theory and Applications of Cryptographic Techniques (EUROCRYPT-03). Carson City, NV:
International Association for Cryptologic
Welty, C. undated. Why Toronto? Unpublished MS.
Weston, J.; Bordes, A.; Chopra, S.; Mikolov,
T.; and Rush, A. 2015. Towards AI-Complete
Question Answering: A Set of Prerequisite Toy
Tasks. arXiv preprint arXiv:1502.05698v6.
Ithaca, NY: Cornell University Library.
Wu, W.; Li, H.; Wang, H.; and Zhu, K.Q.
2012. Probase: A Probabilistic Taxonomy for
Text Understanding. In Proceedings of the
2012 ACM SIGMOD International Conference on Management of Data, 481-492.
New York: Association for Computing
Ernest Davis is a professor of computer science at New York University. His research
area is automated commonsense reasoning,
particularly commonsense spatial and physical reasoning. He is the author of Representing and Acquiring Geographic Knowledge
(1986), Representations of Commonsense
Knowledge (1990), and Linear Algebra and
Probability for Computer Science Applications
(2012); and coeditor of Mathematics, Substance and Surmise: Views on the Meaning and
Ontology of Mathematics (2015).
Toward a Comprehension
Challenge, Using
Crowdsourcing as a Tool
Praveen Paritosh, Gary Marcus
I Human readers comprehend vastly
more, and in vastly different ways, than
any existing comprehension test would
suggest. An ideal comprehension test for
a story should cover the full range of
questions and answers that humans
would expect other humans to reasonably learn or infer from a given story. As
a step toward these goals we propose a
novel test, the crowdsourced comprehension challenge (C3), which is constructed by repeated runs of a three-person game, the Iterative Crowdsourced
Comprehension Game (ICCG). ICCG
uses structured crowdsourcing to comprehensively generate relevant questions
and supported answers for arbitrary stories, whether fiction or nonfiction, presented across a variety of media such as
videos, podcasts, and still images.
rtificial Intelligence (AI) has made enormous advances,
yet in many ways remains superficial. While the AI scientific community had hoped that by 2015 machines
would be able to read and comprehend language, current
models are typically superficial, capable of understanding sentences in limited domains (such as extracting movie times and
restaurant locations from text) but without the sort of widecoverage comprehension that we expect of any teenager.
Comprehension itself extends beyond the written word;
most adults and children can comprehend a variety of narratives, both fiction and nonfiction, presented in a wide variety
of formats, such as movies, television and radio programs,
written stories, YouTube videos, still images, and cartoons.
They can readily answer questions about characters, setting,
motivation, and so on. No current test directly investigates
such a variety of questions or media. The closest thing that
one might find are tests like the comprehension questions in
a verbal SAT, which only assess reading (video and other formats are excluded) and tend to emphasize tricky questions
designed to discriminate between strong and weak human
readers. Basic questions that would be obvious to most
humans — but perhaps not to a machine — are excluded.
Yet is is hard to imagine an adequate general AI that could
not comprehend with at least the same sophistication and
breadth as an average human being, and easy to imagine that
progress in building machines with deeper comprehension
could radically alter the state of the art. Machines that could
comprehend with the sophistication and breadth of humans
could, for instance, learn vastly more than current systems
from unstructured texts such as Wikipedia and the daily news.
How might one begin to test broad-coverage comprehension in a machine?
Copyright © 2016, Association for the Advancement of Artificial Intelligence. All rights reserved. ISSN 0738-4602
SPRING 2016 23
In principle, the classic Turing test might be one
way to assess the capacity of a computer to comprehend a complex discourse, such as a narrative. In
practice, the Turing test has proved to be highly
gameable, especially as implemented in events such
as the Loebner competitions, in which the tests are
too short (a few minutes) to allow any depth (Shieber
1994; Saygin, Cicekli, and Akman 2003). Furthermore, empirical experimentation has revealed that
the best way to “win” the Turing test is to evade most
questions, answering with jokes and diversionary tactics. This winds up teaching us little about the capacity of machines to comprehend narratives, fictional
or otherwise.
As part of the Turing Championships, we (building
on Marcus [2014]) would like to see a richer test of
comprehension, one that is less easily gamed, and
one that probes more deeply into the capacity of
machines to understand materials that might be read
or otherwise perceived.
We envision that such a challenge might be structured into separate tracks for audio, video, still
images, images with captions, and so forth, including
both fiction and nonfiction. But how might one generate the large number of questions that provide the
requisite breadth and depth? Li et al. (forthcoming)
suggest one strategy, focused on generating “journalist-style” questions (who, what, when, where, why)
for still images.1 Poggio and Meyers (2016) and Zitnick et al. (2016) suggest approaches aimed at testing
question answering from still images. Here, we suggest a more general procedure, suitable for a variety of
media and a broad range of questions, using crowdsourcing as the primary engine.
In the remainder of this article we briefly examine
what comprehension consists of, discuss some existing approaches to assessing it, present desiderata for
a comprehension challenge, and then turn toward
crowdsourcing and how it can help define a meaningful comprehension challenge.
What Is Human Comprehension?
Human comprehension entails identifying the
meaning of a text as a connected whole, beyond a
series of individual words and sentences (Kintsch and
van Dijk 1978, Anderson and Pearson 1984, Rapp et
al. 2007). Comprehension reflects the degree to
which appropriate, meaningful connections are
established between elements of text and the reader’s
prior knowledge.
Referential and causal/logical relations are particularly important in establishing coherence, by
enabling readers to keep track of objects, people,
events, and the relational information connecting
facts and events mentioned in the text. These relations that readers must infer are not necessarily obvious. They can be numerous and complex; extend
over long spans of the text; involve extensive back-
ground commonsense, social, cultural, and world
knowledge; and require coordination of multiple
pieces of information.
Human comprehension involves a number of different cognitive processes. Davis (1944), for instance,
describes a still-relevant taxonomy of different skills
tested in reading comprehension tests, and shows
empirical evidence regarding performance variance
across these nine different skills: knowledge of word
meanings; ability to select the appropriate meaning
for a word or phrase in light of its particular contextual setting; ability to follow the organization of a
passage and to identify antecedents and references in
it; selecting the main thought of a passage; answering
questions that are specifically answered in a passage;
answering questions that are answered in a passage
but not in the words in which the question is asked;
drawing inferences from a passage about its content;
recognition of literary devices used in a passage and
determination of its tone and mood; inferring a
writer’s purpose, intent, and point of view.
Subsequent research into comprehension examining long-term performance data of humans shows
that comprehension is not a single gradable dimension, but comprises many distinct skills (for example,
Keenan, Betjemann, and Olson [2008]). Most extant
work examines small components of comprehension, rather than the capacity of machines to comprehend a complete discourse in its entirety.
Existing Approaches for Measuring
Machine Comprehension
How can we test progress in this area? In this section,
we summarize current approaches to measuring
machine comprehension.
AI has a wide variety of evaluations in the form of
shared evaluations and competitions, many of
which bear on the question of machine comprehension. For example, TREC-8 (Voorhees 1999)
introduced the question-answering track in which
the participants were given a collection of documents and asked to answer factoid questions such
as “How many calories are in a Big Mac?” or “Where
is the Taj Mahal?” This led to a body of research in
applying diverse techniques in information retrieval
and structured databases to question answering and
comprehension tasks (Hirschman and Gaizauskas
2001). The Recognizing Textual Entailment (RTE)
Challenge (Dagan, Glickmann, and Magnini 2006)
is another competition with relevance to comprehension. Given two text fragments, the task requires
recognizing whether the meaning of one text is
entailed by (can be inferred from) the other text.
From 2004 to 2013, eight RTE Challenges were
organized with the aim of providing researchers
with concrete data sets on which to evaluate and
compare their approaches. Neither the TREC nor
the RTE competitions, however, addresses the
breadth and depth of human comprehension we
One approach to testing broader-coverage
machine comprehension seeks to leverage the existing diverse battery of human comprehension tests,
such as SATs, domain-specific science tests, and so on
(for example, Barker et al. [2004] and Clark and
Etzioni [2016]). The validity of standardized tests lies
in their ability to identify humans who are more likely to succeed at a certain task, such as in the practice
of medicine or law.
As such, human tests are intended to effectively
discriminate among intelligent human applicants,
but as E. Davis (2016) notes, they do not necessarily
contain classes of questions relevant to discriminating between human and artificial intelligence: questions that are easy for humans but difficult for
machines, that are subjective, and so on.
Recent work on commonsense reasoning points to
one possible alternative approach. The Winograd
Schema Challenge (Levesque, Davis, and Morgenstern 2012; Morgenstern et al. 2016), for instance,
can be seen as comprehension in a microcosm: a single story in a single sentence or very short passage
with a single binary question that can in principle be
reliably answered only by a system that has some
commonsense knowledge. In each question there is a
special word, such as that underlined in the following example, that can be replaced by an alternative
word in a way that fundamentally changes the sentence’s meaning.
The trophy would not fit into the brown suitcase
because it was too big/small.
What was too big/small?
Answer 0: the trophy
Answer 1: the suitcase
In each example, the reader’s challenge is to disambiguate the passage. By design, clever tricks
involving word order or other features of words or
groups of words will not work. In the example above,
contexts where “big” can appear are statistically quite
similar to those where “small” can appear, and yet
the answer must change. The claim is that doing better than guessing requires readers to figure out what
is going on; for example, a failure to fit is caused by
one of the objects being too big and the other being
too small, and readers must determine which is
SQUABU, for “science questions appraising basic
understanding” (Davis 2016), generalizes this
approach into a test-construction methodology and
presents a series of test material for machines at
fourth-grade and high school levels. Unlike the
human counterparts of such tests, which focus on
academic material, these tests focus on commonsense knowledge such as the understanding of time,
causality, impossible or pointless scenarios, the
human body, combining facts, making simple inductive arguments of indeterminate length, relating for-
mal science to the real world, and so forth. Here are
two example questions from SQUABU for fourthgrade level:
Sally’s favorite cow died yesterday. The cow will probably be alive again (A) tomorrow; (B) within a week;
(C) within a year; (D) within a few years; (E) The cow
will never be alive again.
Is it possible to fold a watermelon?
Winograd schemas and SQUABU demonstrate
some areas where standardized tests lack coverage for
testing machines. Both tests, however, are entirely
generated by experts and are difficult to scale to large
numbers of questions and domains; neither is directed at broad-coverage comprehension.
Desiderata for a
Comprehension Challenge
In a full-coverage test of comprehension, one might
want to be able to ask a much broader range of questions. Suppose, for example, that a candidate software program is confronted with a just-published spy
thriller, for which there are no web-searchable CliffsNotes yet written. An adequate system (Marcus 2014,
Schank 2013) should be able to answer questions
such as the following: Who did what to whom? Who
was the protagonist? Was the CIA director good or
evil? Which character leaked the secrets? What were
those secrets? What did the enemy plan to do with
those secrets? Where did the protagonist live? Why
did the protagonist fly to Moscow? How does the story make the reader/writer feel? And so forth. A good
comprehension challenge should evaluate the full
breadth and depth of human comprehension, not
just knowledge of common sense. To our knowledge,
no previous test or challenge has tried to do this in a
general way.
Another concern with existing test-construction
methodology for putative comprehension challenges
is the lack of transparency in the test creation and
curation process. Namely, why does a test favor some
questions and certain formulations over others?
There is a central, often-unspoken role of the test
curator in choosing the questions to ask, which is a
key aspect of the comprehension task.
Given a news article, story, movie, podcast, novel,
radio program, or photo — referred to as a document
from this point forward — an adequate test should
draw from a full breadth of all document-relevant
questions with document-supported answers that
humans can infer.
We suggest that the coverage goal of the comprehension challenge can be phrased as an empirical
A comprehension test should cover the full range of
questions and answers that humans would expect other humans to reasonably learn or infer from a given
How can we move toward this goal?
SPRING 2016 25
The C3 Test
We suggest that the answer begins with crowdsourcing. Previous work has shown that crowdsourcing
can be instrumental in creating large-scale shared
data sets for evaluation and benchmarking.
The major benefits of crowdsourcing are enabling
scaling to broader coverage (for example, of domains,
languages), building significantly larger data sets, and
capturing broader sets of answers (Arroyo and Welty
2014), as well as gathering empirical data regarding
reliability and validity of the test (Paritosh 2012).
Imagenet (Deng et al. 2009), for example, is a
large-scale crowdsourced image database consisting
of 14 million images with over a million human
annotations, organized by the Wordnet lexicon; it
has been a catalyst for recent computer vision
research with deep convolutional networks
(Krizhevsky, Sutskever, and Hinton 2012). Freebase
(Bollacker et al. 2008) is a large database of humancurated structured knowledge that has similarly
sparked research fact extraction (Mintz et al. 2009;
Riedel, Yao, and McCallum 2010).
Christoforaki and Ipeirotis (2014) present a
methodology for crowdsourcing the construction of
tests using the questions and answers on the community question-answering site Stack Overflow.2
This work shows that open-ended question and
answer content can be turned into multiple-choice
questions using crowdsourcing. Using item response
theory on crowdsourced performance on the test
items, they were able to identify the relative difficulty of each question.
MCTEST (Richardson, Burges, and Renshaw 2013)
is a crowdsourced comprehension test corpus that
consists of approximately 600 fictional stories written by Amazon Mechanical Turk crowd workers.
Additionally, the crowd workers generated multiplechoice questions and their correct answers, as well as
plausible but incorrect answers. The workers were
given guidelines regarding the story, questions, and
answers, such as that they should ask questions that
make use of information in multiple sentences. The
final test corpus was produced by manual curation of
the resulting stories, questions, and answers. This
approach is promising, as it shows that it is possible
to generate comprehension tests using crowdsourcing. However, much like the standardized and commonsense tests, the test-curation process here is not
entirely transparent nor generalizable to other types
of documents and questions.
The question at hand is whether we can design
reliable processes for crowdsourcing the construction
of comprehension tests that provide us with measurable signals and guarantees of quality, relevance, and
coverage, not just whether we can design a test.
As an alternative, and as a starting point for further discussion, we propose here a crowdsourced
comprehension challenge (C3). At the root is a document-focused imitation game, which we call the iter-
ative crowdsourcing comprehension game (ICCG), the
goal of which is to generate a systematic and comprehensive set of questions and validated answers relevant to any given document (video, text story, podcast, or other). Participants are incentivized to
explore questions and answers exhaustively, until the
game terminates with an extensive set of questions
and answers. The C3 is then produced by aggregating
and curating questions and answers generated from
multiple iterations of the ICCG.
The structure, which necessarily depends on cooperative yet independent judgments from multiple
humans, is inspired partly by Luis von Ahn’s work.
For example, in the two-player ESP game (von Ahn
and Dabbish 2004) for image labeling, the goal is to
guess what label your partner would give to the
image. Once both players have typed the exact same
string, they win the round, and a new image appears.
This game and others in the games with a purpose
series (von Ahn 2006) introduced the methodology
of input agreement (Law and von Ahn 2009), where
the goal of the participants is to try to agree on the
input, encouraging them to model the other participant. The ICCG extends this to a three-person imitation game, itself partially in the spirit of Turing’s original test (Turing 1950).
The Iterative Crowdsourcing
Comprehension Game
The iterative crowdsourcing comprehension game
(ICCG) is a three-person game. Participants are randomly assigned to fill one of three roles in each run
of the game: reader (R), guesser (G), or judge (J). Players are sampled from a norming population of interest (for example, one might make tests at the secondgrade level or college level). They should not know
each other and should be identified only by
anonymized screen names that are randomly
assigned afresh in each round. They cannot communicate with each other besides the allowed game
Only the judge and the reader have access to the
document (as defined earlier, text, image, video, podcast, and others); the guesser is never allowed to see it.
The only thing readers and judges have in common is
this document that they can both comprehend.
The purpose of the game is to generate a comprehensive set of document-relevant questions (with
corresponding document-supported answers) as an
outcome. The judge’s goal is to identify who is the
genuine document holder. The reader’s goal is to
prove possession of the document, by asking document-relevant questions and by providing document-supported answers. The guesser’s goal is to
establish possession of the document, by learning
from prior questions and answers.
A game consists of a sequence of rounds, as depicted in figure 1. A shared whiteboard is used for keeping track of questions and answers, which are pub-
lished at the end of each round. The whiteboard is
visible to all participants and allows the guesser to
learn about the content of the document as the game
proceeds. (Part of the fun for the guesser lies in making leaps from the whiteboard in order to make educated guesses about new questions.)
Each round begins with randomly assigning either
the reader or the guesser to play the questioner for
the round. The questioner writes down a question for
this round. The reader’s goal, while playing questioner, is to ask novel questions that have reliable
document-supported answers. As the game proceeds,
the reader is incentivized to exhaust the space of document-supported questions to be distinguished from
the guesser. The reader, as questioner, does not earn
points for asking questions that the guesser could
answer correctly using nondocument knowledge or
conclude from prior questions and answers on the
whiteboard. When the questioner is the guesser, their
goal is to ask revealing questions to learn as much
about the story as quickly as possible.
At this point we have a question, from either the
reader or guesser. The question is shared with the
other participant,3 who independently answers.
The judge is presented with both the question and
the two answers with authors anonymized and
attempts to identify which one is the reader. This
anonymization is done afresh for the next round.
The objective of both the reader and guesser is to be
chosen as the reader by the judge, so both are incentivized to ask questions and generate answers that
will convince the judge that they are in possession of
the document.
The round is scored using this simple rubric: The
judge earns a point for identifying the reader correctly, and the reader or guesser earns a point for
being identified as the document holder by the
At the end of each round, the question and the
reader’s and guesser’s answers are published on the
whiteboard. The reader’s job is exhaustively to ask
document-relevant questions, without generating
questions that the guesser could extract from the
accumulated whiteboard notes; the guesser’s job is to
glean as much information as possible to improve at
Initially, it is very easy for the judge to identify the
reader. However, roughly every other round the
guesser (when chosen to be the questioner) gets to
ask a question and learn the reader’s and judge’s
answers to that question. The main strategic goal of
the guesser is to erode their disadvantage, the lack of
access to the document, as quickly as possible. For
example, the guesser might begin by asking basic
information-gathering questions: who, what, where,
when, why, and how questions.4 The increased knowledge of the document revealed through the questions
and answers should improve guessing performance
over rounds.
Questioner generates question
Reader and guesser answer
Judge attempts to identify the reader
Round is scored
Whiteboard is updated
better than
Figure 1. The Iterative Crowdsourcing Comprehension Game
The game concludes when all attempts at adding
further questions fail to discriminate between the
guesser and reader. This implies that the corpus of
questions and answers collected on the whiteboard is
a comprehensive set, that is, sufficient to provide an
understanding comparable to having read the document. There can be many different sets of questions,
SPRING 2016 27
Round Questioner Question
Answer Identification
Is it a
happy story?
What’s for
Who were
shoes for?
A baby
How many
characters are
in the story?
happening to
the shoes?
When were
the shoes
For sa
baby s rn
Figure 2. An Example Whiteboard.
Created for the document “For sale: baby shoes, never worn.”
due to sequence effects and variance in
participants. We repeat the ICCG manifold to collect the raw material for the
construction of the crowdsourced
comprehension challenge.
Figure 2 depicts an example whiteboard after several rounds of questioning for a simple document, a six-word
novel attributed to Ernest Hemingway.
Constructing the Crowdsourced
Comprehension Challenge
Given a document, each run of the
game above produces a set of document-relevant questions and document-validated answers, ultimately
producing a comprehensive (or at least
extensive) set of questions. By aggregating across multiple iterations of the
game with the same document, we
obtain a large corpus of document-relevant questions and validated answers.
This is the raw data for constructing
the comprehension test. Finalizing the
test requires further aggregation, de-
duplication, and filtering using crowdsourced methods, for example, the
Find/Fix/Verify methodology (Bernstein et al. 2010).
This approach suggests that comprehension must be considered relative to
a population. This turns our original
goal for the challenge — full range of
questions and answers that humans
would expect other humans to reasonably learn or infer from a given document — into an empirical and crowdsourceable goal. Additionally, this
allows us to design testing instruments
tailored across skill levels, ages, or
domains, as well as adaptable to a wide
swath of cultural contexts, by sampling participants from different populations.
Figure 3 depicts the process of constructing the final test, which features
the crowdsourced collection of the
question–answer pairs.
Using the C3, a broad-coverage comprehension challenge can be constructed using crowdsourcing. By vary-
ing the population, we can construct
comprehension tests that reveal the
comprehension of second graders or
doctors. In addition, by varying the
format of questions and answers,
open-ended, multiple choice, Boolean,
and others, or restricting allowable
questions to be of a certain type, we
can construct different challenges.
and Future Work
Improved machine comprehension
would be a vital step toward more general artificial intelligence and could
potentially have enormous benefits for
humanity, if machines could integrate
medical, scientific, and technological
information in ways that were humanlike.
Here we propose C3, the crowdsourced comprehension challenge,
and one candidate technique for generating such tests, the ICCG, which
yield a comprehensive, relevant, and
human-validated corpus of questions
and answers for arbitrary content, fiction or nonfiction, presented in a variety of forms. The game also produces
human-level performance data for
constructing tests, which with suitable
participants (such as second graders or
adult native speakers of a certain language) could be used to yield a range of
increasingly challenging benchmarks.
It could also be tailored to specific
areas of knowledge and inference (for
example, the domain of questions
could be restricted to commonsense
understanding, to science or medicine,
or to cultural and social understanding). Unlike specific tests of expertise,
this is a general test-generation procedure whose scope is all questions that
can be reliably answered by humans
(either in general, or drawn from a
population of interest) holding the
Of course, more empirical and theoretical work is needed to implement,
validate, and refine the ideas proposed
here. Variations of the ICCG might be
useful for different data-collection
processes (for example, Paritosh [2015]
explores a version where the individual reader and guesser are replaced by
samples of readers and guessers). An
important area of future work is the
design of incentives to make the game more engaging and useful (for example, Prelec [2004]). We
believe that crowdsourced processes for the design of
human-level comprehension tests will be an invaluable addition to the arsenal of assessments of
machine intelligence and will spur research in deep
understanding of language.
such as a story,
news article,
image, video
The authors would like to thank Ernie Davis, Stuart
Shieber, Peter Norvig, Ken Forbus, Doug Lenat, Nancy
Chang, Eric Altendorf, David Huynh, David Martin,
Nick Hay, Matt Klenk, Jutta Degener, Kurt Bollacker,
and participants and reviewers of the Beyond the Turing Test Workshop at AAAI 2015 for insightful comments and suggestions on the ideas presented here.
1. This is part of the VisualGenome corpus, visualgenome.
3. One might also secure an answer from the judge, as a validity check and to gain a broader range of acceptable answers
(for example, shoes or baby shoes might both work for a question about the Hemingway story shown in figure 2).
4. The popular Twenty Questions game is a much simpler
version, where the guesser tries to identify an object within
twenty yes/no questions. Questions such as “Is it bigger
than a breadbox?” or “Does it involve technology for communications, entertainment, or work?” allow the questioner to cover a broad range of areas using a single question.
Anderson, R. C., and Pearson, P. D. 1984. A Schema-Theoretic View of Basic Processes in Reading Comprehension.
Handbook of Reading Research Volume 1, 255–291. London
Barker, K.; Chaudhri, V. K.; Chaw, S. Y.; Clark, P.; Fan, J.;
Israel, D.; Mishra, S.; Porter, B.; Romero, P.; Tecuci, D,; and
Yeh, P. 2004. A Question-Answering System for AP Chemistry: Assessing KR&R Technologies, 488–497. In Principles
of Knowledge Representation and Reasoning: Proceedings of the
Ninth International Conference. Menlo Park, Calif: AAAI Press.
Bernstein, M. S.; Little, G.; Miller, R. C.; Hartmann, B.; Ackerman, M. S.; Karger, D. R.; Crowell, D.; and Panovich, K.
2010. Soylent: A Word Processor with a Crowd Inside. In
Proceedings of the 23nd Annual ACM Symposium on User Interface Software and Technology, 313–322. New York: Association
for Computing Machinery.
Bollacker, K.; Evans, C.; Paritosh, P.; Sturge, T.; and Taylor, J.
2008. Freebase: A Collaboratively Created Graph Database
for Structuring Human Knowledge, 1247–1250. In Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2008. New York: Association for
Computing Machinery.
Christoforaki, M., and Ipeirotis, P. 2014. STEP: A Scalable
Testing and Evaluation Platform. In Proceedings of the Second
AAAI Conference on Human Computation and Crowdsourcing.
Palo Alto, CA: AAAI Press.
Whiteboard 1
Whiteboard 3
{Question, Answer}
{Question, Answer}
Whiteboard 2
{Question, Answer}
Aggregate and Curate
{Question, Answer} pairs
Challenge, C3
Figure 3. Crowdsourced Comprehension Challenge Generation.
Clark, P., and Etzioni, O. 2016. My Computer Is an Honor
Student — But How Intelligent Is It? Standardized Tests as a
Measure of AI. AI Magazine 37(1).
Dagan, I.; Glickman, O.; and Magnini, B. 2006. The PASCAL
Recognising Textual Entailment Challenge. In Machine
SPRING 2016 29
Learning Challenges, Lecture Notes in Computer Science Volume 3944, 177–190.
Berlin: Springer.
Davis, E. 2016. How to Write Science Questions That Are Easy for People and Hard for
Computers. AI Magazine 37(1).
Davis, F. B. 1944. Fundamental Factors of
Comprehension in Reading. Psychometrika
9(3): 185–197. 10.1007/
Deng, J.; Dong, W.; Socher, R.; Li, L. J.; Li,
K.; and Fei-Fei, L. 2009. Imagenet: A LargeScale Hierarchical Image Database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
248–255. Piscataway, NJ: Institute of Electrical and Electronics Engineers.
10.1109/CVPR.2009. 5206848
Hirschman, L., and Gaizauskas, R. 2001.
Natural Language Question Answering: The
View from Here. Natural Language Engineering 7(04): 275–300.
Keenan, J. M.; Betjemann, R. S.; and Olson,
R. K. 2008. Reading Comprehension Tests
Vary in the Skills They Assess: Differential
Dependence on Decoding and Oral Comprehension. Scientific Studies of Reading
Kintsch, W., and van Dijk, T. A. 1978.
Toward a Model of Text Comprehension
and Production. Psychological Review 85(5):
Krizhevsky, A.; Sutskever, I.; and Hinton, G.
E. 2012. Imagenet Classification with Deep
Convolutional Neural Networks. In
Advances in Neural Information Processing Systems 25: 26th Annual Conference on Neural
Information Processing Systems 2012, 1097–
1105. La Jolla, CA: Neural Information Processing Systems Foundation, Inc.
Law, E., and von Ahn, L. 2009. Input-Agreement: A New Mechanism for Collecting
Data Using Human Computation Games. In
Proceedings of the SIGCHI Conference on
Human Factors in Computing Systems, 1197–
1206. New York: Association for Computing
Levesque, H.; Davis, E.; and Morgenstern, L.
2012 The Winograd Schema Challenge. In
Principles of Knowledge Representation and
Reasoning: Proceedings of the Thirteenth International Conference (KR2012), 552–561. Palo
Alto, CA: AAAI Press.
Marcus, G. 2014. What Comes After the
Turing Test? New Yorker (June 9).
Mintz, M.; Bills, S.; Snow, R.; and Jurafsky,
D. 2009. Distant Supervision for Relation
Extraction Without Labeled Data. In Pro-
ceedings of the Joint Conference of the 47th
Annual Meeting of the Association for Computational Linguistics, 1003–1011. Stroudsburg,
PA: Association for Computational Linguistics.
Morgenstern, L.; Davis, E.; Ortiz, C. L. Jr.
2016. Planning, Executing, and Evaluating
the Winograd Schema Challenge. AI Magazine 37(1).
Paritosh, P. 2012. Human Computation
Must Be Reproducible. In CrowdSearch 2012:
Proceedings of the First International Workshop
on Crowdsourcing Web Search. Ceur Workshop Proceedings Volume 842. Aachen,
Germany: RWTH-Aachen University.
Paritosh, P. 2015. Comprehensive Comprehension: A Document-Focused, HumanLevel Test of Comprehension. Paper presented at Beyond the Turing Test, AAAI
Workshop W06, Austin TX, January 25.
Poggio, T., and Meyers, E. 2016. Turing++
Questions: A Test for the Science of
(Human) Intelligence. AI Magazine 37(1).
Prelec, D. 2004. A Bayesian Truth Serum for
Subjective Data. Science 306(5695): 462–
Rapp, D. N.; Broek, P. V. D.; McMaster, K. L.;
Kendeou, P.; and Espin, C. A. 2007. HigherOrder Comprehension Processes in Struggling Readers: A Perspective for Research
and Intervention. Scientific Studies of Reading 11(4): 289–312.
Richardson, M.; Burges, C. J.; and Renshaw,
E. 2013. MCTest: A Challenge Dataset for
the Open-Domain Machine Comprehension of Text. In EMNLP 2013: Proceedings of
the Empirical Methods in Natural Language
Processing. Stroudsburg, PA: Association for
Computational Linguistics.
Riedel, S.; Yao, L.; and McCallum, A. 2010.
Modeling Relations and Their Mentions
Without Labeled Text. In Machine Learning
and Knowledge Discovery in Databases: Proceedings of the European Conference, ECML
PKDD 2010. Lecture Notes in Artificial Intelligence Volume 6322, 148–163. Berlin:
Saygin, A. P.; Cicekli, I.; and Akman, V.
2003. Turing Test: 50 Years Later. In The Turing Test: The Elusive Standard of Artificial
Intelligence, ed. J. H. Moor, 23–78. Berlin:
Schank, R. P. 2013. Explanation Patterns:
Understanding Mechanically and Creatively.
London: Psychology Press.
Shieber, S. M. 1994. Lessons from a Restricted Turing Test. Communications of the ACM
37(6): 70–78.
Turing, A. M. 1950. Computing Machinery
and Intelligence. Mind 59(236): 433–460.
von Ahn, L. 2006. Games with a Purpose.
Computer 39(6): 92–94.
von Ahn, L., and Dabbish, L. 2004. Labeling
Images with a Computer Game. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, 319–326. New
York: Association for Computing Machinery.
Voorhees, E. M. 1999. The TREC-8 Question
Answering Track Report. In Proceedings of
The Eighth Text Retrieval Conference, TREC
1999, 77–82. Gaithersburg, MD: National
Institue of Standards and Technology.
Poggio, T., and Meyers, E. 2016. Turing++
Questions: A Test for the Science of
(Human) Intelligence. AI Magazine 37(1).
Zitnick, C. L.; Agrawal, A.; Antol, S.; Mitchell, M.; Batra, D.; Parikh, D. 2016. Measuring Machine Intelligence Through Visual
Question Answering. AI Magazine 37(1).
Praveen Paritosh is a a senior research scientist at Google leading research in the
areas of human and machine intelligence.
He designed the large-scale humanmachine curation systems for Freebase and
the Google Knowledge Graph. He was the
co-organizer and chair for the SIGIR WebQA
2015 workshop, the Crowdsourcing at Scale
2013, the shared task challenge at HCOMP
2013, and Connecting Online Learning and
Work at HCOMP 2014, CSCW 2015, and
CHI 2016 toward the goal of galvanizing
research at the intersection of crowdsourcing, natural language understanding,
knowledge representation, and rigorous
evaluations for artificial intelligence.
Gary Marcus is a professor of psychology
and neural science at New York University
and chief executive officer and founder of
Geometric Intelligence, Inc. He is the
author of four books, including Kluge: The
Haphazard Evolution of the Human Mind and
Guitar Zero, and numerous academic articles
in journals such as Science and Nature. He
writes frequently for The New York Times
and The New Yorker, and is coeditor of the
recent book, The Future of the Brain: Essays
By The World’s Leading Neuroscientists.
The Social-Emotional
Turing Challenge
William Jarrold, Peter Z. Yeh
I Social-emotional intelligence is an
essential part of being a competent
human and is thus required for humanlevel AI. When considering alternatives
to the Turing test it is therefore a capacity that is important to test. We characterize this capacity as affective theory of
mind and describe some unique challenges associated with its interpretive or
generative nature. Mindful of these
challenges we describe a five-step
method along with preliminary investigations into its application. We also
describe certain characteristics of the
approach such as its incremental
nature, and countermeasures that make
it difficult to game or cheat.
he ability to make reasonably good predictions about
the emotions of others is an essential part of being a
socially functioning human. Without it we would not
know what actions will most likely make others around us
happy versus mad or sad. Our abilities to please friends, placate enemies, inspire our children, and secure cooperation
from our colleagues would suffer.
For these reasons a truly intelligent human-level AI will
need the ability to reason about other agents’ emotions in
addition to intellectual capabilities embodied in other tasks
such as the Winograd schema challenge, textbook reading
and question answering (Gunning et al. 2010, Clark 2015),
image understanding, or task planning. Thinking at the
Copyright © 2016, Association for the Advancement of Artificial Intelligence. All rights reserved. ISSN 0738-4602
SPRING 2016 31
human level also requires the ability to have reasonable hunches about other agents’ emotions.
Social-Emotional Intelligence
as Affective Theory of Mind
The ability to predict and understand another agent’s
emotional reactions is subsumed by a cognitive
capacity that goes by various names including folk
psychology, naïve psychology, mindreading, empathy, and theory of mind. We prefer the latter term,
considering it is more precise and is more frequently
used by psychologists nowadays. Theory of mind
encompasses the capacity to attribute and explain
the mental states of others such as beliefs, desires,
intentions, and emotions. In this article, we focus on
affective theory of mind because it restricts itself to
emotions. We further restrict ourselves to consensual affective theory of mind (AToM) to rule out idiosyncratic beliefs of particular individuals.
Is There a Logic to Emotion?
Each of us humans has our own oftentimes unique
affective reaction to a given situation. Although we
live in the same world, our emotional interpretations
of it are multitudinous. Does this mean that emotion
is an “anything goes” free-for-all? In spite of the
extreme variability in our affective evaluations, there
nonetheless seems to be a rationality, a logic, of what
constitutes a viable, believable, or sensible emotional response to a given situation.
When we hear of someone’s emotional reaction to
a situation sometimes, we think to ourselves, “I
would have responded the same way.” For other reactions, we might say, “That would not be my reaction,
but I can certainly understand why he or she would
feel that way.” At still other times, another’s actual
emotional reaction may vary far afield of our prediction and we say, “I cannot make any sense out of his
or her reaction.”
For these reasons there does appear to be some sort
of “logic” to emotion. Yet, how do we resolve the tension between the extreme possible richness and variability in emotional response and the sense that only
certain reactions are sensible, legitimate, or understandable?
In the next two sections, we show how the concepts of falsifiability — the possibility of proving an
axiom or prediction incorrect (for example, all swans
are white is disproven by finding a black swan [Popper 2005]) — and generativity — the capacity of a system to be highly productive and original — play an
important role in the resolution of this tension. Later, in the Proposed Framework section, we shall see
how these two concepts influence the methods we
propose for assessing machine social-emotional intelligence.
Falsifiability and AToM
In our approach to assessing affective theory of mind,
we take the term theory seriously. Prominent philosophers of science claim that scientific theories are, by
definition, falsifiable (Popper 2005). Although an
optimistic agent may view a situation with a glass
half full bias and pessimistic agents may tend to view
the very same situations with a glass half empty bias,
they can still both be correct. How then do we
demonstrate the falsifiability of affective theory of
mind? The answer comes when one considers a predicted emotion paired with the explanation of this
prediction. If we consider both together then we
have a theory that is falsifiable.
Consider the following situation and the following
Situation: Sue and Mary notice it is raining.
Appraisal U1: Sue feels happy because she expects the
sun will come out tomorrow.
Appraisal U2: Mary feels sad because she hates rain and
it will probably keep on raining.
Although some of us may tend to agree more with
one or the other’s reaction, virtually all of us will
judge both of these replies as potentially valid (modulo some relatively minor assumptions about normal
personality differences). By contrast, consider what
happens if we invert the emotions felt by each character:
Appraisal R1: Mary feels sad because she expects the
sun will come out tomorrow.
Appraisal R2: Sue feels happy because she hates rain
and it will probably keep on raining.
We take it as a given that the vast majority of typical humans representative of a given cultural group
will judge the immediately above appraisals as
invalid or extremely puzzling.
In sum, emotion is not an anything goes phenomenon — we have demonstrated that some appraisals
violate our intuitions about what makes sense.
Although there are a multitude of different emotions
that could make sense, falsifiability is demonstrable
when one considers the predicted emotion label
along with its explanation (Jarrold 2004). As will be
described next, falsifiability of AToM is important in
the context of Turing test alternatives.
A Generative AToM
Leaving falsifiability aside, there remains the need to
provide an account for the multitude of potential
emotional appraisals of a situation. The need is
addressed by viewing appraisal not as an inference
but rather as a generative process.
Generative processes are highly productive, able to
produce novel patterns of outputs such as cellular
automata, generative grammars, and fractals such as
the Mandelbroit or Julia Set. Ortony (2001) posited
that generative capacity is critical to computational
accounts of emotion.
As a demonstration of this generativity, consider
the range of appraisals obtained from “college sophomore” participants in Jarrold (2004) (see table 1).
Tracy wants a banana. Mommy gives Tracy an apple.
How will Tracy feel? (Choose from happy, sad, or indifferent)
She’ll feel happy even though she didn’t get exactly what she wanted; it is still something.
Because nonetheless she still has something to eat just not exactly what she wanted.
She will feel indifferent as long as she likes apples too. It isn’t exactly what she wanted, but she was
probably just hungry and if she likes apples then she would be satisfied because it would do the same
thing as a banana.
Because she was probably excited about eating the banana that day and when mom gave her an apple
instead she probably felt disappointed and wondered why her mom wouldn’t give her what she
She did not get what she wanted.
Table 1. Five Human Appraisals of a Simple Scenario
Alhough research subjects were presented with a very
simple scenario, answers ranged from happy to indifferent to sad. The explanations for a given emotion
also varied in terms of assumptions, focus, and complexity.
Note that the inferences in explanations are often
not deductions derived strictly from scenario premises. They can contain abductions or assumptions (for
example, in table 1, row 3 “she is probably just hungry”) and a series of subappraisals (for example, row
4 excitement yielding to disappointment).
Furthermore, note that the above data were generated in response to very simple scenarios derived
from an autism therapy workbook (Howlin, BaronCohen, and Hadwin 1999). Imagine the generative
diversity attainable in real-world appraisals where the
scenarios can include N preceding chapters in a novel or a person’s life history.
Typical humans predict and explain another’s
emotions and find it easy to generate, understand,
and evaluate the full range of appraisal phenomena
described above. For this reason it is important that
human-level AI models of emotion be able to emulate this generative capacity.
In the remainder of this article, we will first describe
how test items are involved in a five-stage framework
or methodology for conducting an evaluation of
computational social-emotional intelligence. Challenges to the integrity of the test are anticipated and
countermeasures are described. Finally, issues with
the specifics of implementing this framework are
Proposed Framework
Each of the framework’s five stages (see figure 1) is
described: first, developing the test items; second,
obtaining ground truth; third, computational modeling; and, finally, two stages of evaluation. In these
last two evaluation stages models are judged on the
basis of two corresponding tasks: (1) generating
appraisals (stage 4) and (2) their ability to evaluate
other’s appraisals — some of which have been
manipulated (stage 5)
Test Items
The framework revolves around the ability of a system to predict the emotions of agents in particular
situations in a human-like way across a sufficiently
large number of test items. As will be explained in
detail, test items are questions posed to examinees
(both humans and machines). They require the
examinee to generate appraisals (answers to the questions). Machine-generated appraisals are evaluated in
terms of how well they compare to the human-generated ones.
Items have the following structural elements: (1) a
scenario that is posed to the human or machine
examinee and that consists of (1a) a target character
whose emotion is to be predicted; a scenario involving the target (and possibly other characters). (2) a
two-part emotion question that prompts the examinee to (2a) select through multiple choice an emotion descriptor that best matches the emotion he,
she, or it predicts will likely be felt by the target character, and (2b) explain why the character might feel
that way.
SPRING 2016 33
Stage 2: Obtain Human-Generated
Stage 1:
Generate Scenario Items
Stage 2:
Humans Generate Appraisals
Stage 3:
Develop Models
Stage 4:
Humans Evaluate Appraisals
Model versus Human
The overall goal of this stage is to obtain a ground
truth for the test. Concretely, the goal of this stage is
to task a group of human participants to generate at
least one appraisal for items produced in stage 1.
Generating an appraisal involves choosing an emotion to answer the emotion question and producing
an explanation for that answer.
Given the generativity of emotional appraisal we
expect a wide range of responses even for a single scenario instance. Recall the example of appraisal data
derived from the simple scenario in table 1.
The range of distinct appraisals should increase
with the range of possible emotions from which to
chose, the length of the allowable explanation, and
the number of participants. That said, the increase at
some point will level off because the themes of the
nth participant’s appraisal will start to overlap with
those of earlier participants.
While the number of different scenario instances
may circumscribe the generative breadth we require
our computational models to cover, one might also
say that the generative depth of the model is circumscribed by the number of distinct appraisals generated for each scenario.
Some of the resulting human-generated appraisals
can be passed to the next stage as training data for
modeling. The remainder are sequestered as a test set
to be used during evaluation phases.
Stage 3: Develop Appraisal Models
Stage 5:
Models Evaluate Appraisals
Figure 1. High Level Schematic of the Framework’s Five Stages.
Stage 1: Generate Scenario Items
The purpose of stage one is to produce a set of scenario items that can be used later in the evaluation.
The range of scenarios circumscribes the breadth of
the modeling task.
In the early years of the competition, we will focus
on simple scenarios (for example, “Eric wanted to
ride the train but his father took him in the car. Was
he happy or sad?”) and in later years, move to ever
more complex material from brief stories to, much
later, entire novels.
The contestants, computational modelers, are challenged to develop a model that for any given scenario
instance can (1) predict an appropriate emotion label
for the target scenario character (for example, happy,
sad, and so on); and (2) generate an appropriate natural language (NL) explanation for this prediction.
Appropriate is judged by human raters in stage 4 in
reference to human-generated appraisals. Contestants are given a sample of scenario instances and the
corresponding human-generated appraisals to train
or engineer their models.
Stage 4: Evaluate Appraisals:
Model Versus Human
The purpose of this stage is to obtain an evaluation of
how well a given model performs appraisal in comparison to humans. This is achieved by a new group
of human participants serving as raters. The input to
this process is a set of appraisals including humangenerated ones from stage 2 and model-generated
ones from stage 3.
Valence Reversal
Before being submitted to a human judge, each
appraisal has a 50 percent chance of being subject to
an experimental manipulation known as valence
reversal. Operationally, this means replacing the
emotion label of a given appraisal with a different
label of preferably “opposite” emotional valence.
Under such a manipulation, happy would be
replaced with sad, and sad with happy. For example:
Situation: Eric wants a train ride and his father gives
him one.
Unreversed Appraisal: Eric feels happy because he got
what he wanted.
Reversed Appraisal: Eric feels sad because he got what
he wanted.
Reversal provides a contrast variable. We expect
the statistical effect of reversal on appraisal quality to
be strong. In contrast, if the model’s appraisals are
adequate, then among unreversed appraisals there
should be no significant difference between humanversus model-generated appraisals. This methodology was successfully used in Jarrold (2004) and this
article is essentially a scaling up of that approach.
Submission to Human Evaluators
Either the reversed or unreversed version of each
appraisal is administered to at least one judge. The
judges are to rate appraisals independently according
to some particular subjective measure(s) of quality
such as commonsensicality, believability, novelty,
and so on. The measure is specified by the contest
organizers. Judges are blinded to the reversal status
— reversed or unreversed — and source — human or
machine — of each item.
Stage 5: Model and Evaluate Human MetaAppraisal
The purpose of this stage is to evaluate a model’s ability not to generate but rather to validate appraisals.
This capacity is important because human-level
AToM involves not just the capacity to make one
decent prediction and explanation of another agent’s
emotions in a given situation. It also involves
breadth, the ability to assess the validity of any of the
multitude of the generatable appraisals of that situation. If a model’s pattern of quality ratings for all the
stage 4 appraisals — be they model or human generated, reversed or unreversed — matches the pattern
of ratings given by stage 4 human judges, then it
demonstrates the full generative breadth of understanding.
The capacity for validating appraisals is important
for another reason — detecting the authenticity of an
emotional reaction. Consider the following:
Bob: How are you today?
Fred: Deeply depressed — no espresso.
People know that Fred is kidding. A deep depression is not a believable or commonsensical appraisal
of a situation in which one is missing one’s espresso.
The input to stage 5 is the output of stage 4, that
is, human evaluations of appraisals. The appraisals
evaluated include all manner of appraisals generated
in prior stages: that is, both human and machine
generated, both unreversed and reversed. These rated
appraisals are segregated by the organizers into two
groups, a training set and a test set.
Modelers are given the training set and tasked
with enhancing their preexisting models by giving
them the ability to evaluate the validity of others’
appraisals. Once modeling is completed, the organizers evaluate the enhanced self-reflective models
against the test data. Model appraisal ratings should
be similar to human ratings — unreversed appraisals
should receive high-quality ratings, and reversed
ones, poorer ratings.
This phase may add new layers of model complexity and may be too difficult for the early years.
Thus, for reasons of incrementality we consider it a
stage that is phased in gradually over successive
Issues in Implementation
In this section we discuss specific issues associated
with actually running the experiments and competitions.
Hector Levesque (2011) described the benefits of an
incremental staged approach. Any challenge should
be matched to existing capabilities. If too easy, the
challenge will not be discriminative nor exciting
enough to attract developers. If too hard, solutions
will fail to generalize and developers will be discouraged. In addition, systems advance every year. In
view of all of these needs, it is best to have a test for
which it is easy to raise or lower the bar.
How can incrementality be implemented within
the framework? As will be explained in the next section, parameterization of scenarios provides one relatively low-effort means of adapting the difficulty of
the test.
Parameterization of Test Scenarios
It is important to be able to have a lot of test scenarios. More scenarios means more training data, a
more fine-grained evaluation, a greater guarantee of
comprehensive coverage. Cohen et al. (1998) used
parameterization to create numerous natural language test questions that deviate from sample questions in specific controlled ways. The space of variation within given parameterization can be
combinatorially large thus ensuring the ability to
cover a broad range of materials. Parameterization
was successfully used by Sosnovsky, Shcherbinina,
and Brusilovsky (2003) to produce large numbers of
training and test items for human education with
relatively low effort.
A parameterized scenario is essentially a scenario
template. Such templates can be created by taking an
existing scenario and replacing particular objects in
the scenario with variables of the appropriate type.
Consider the following scenario instance:
Scenario: Tracy wants a banana. Mommy gives Tracy
an apple for lunch.
SPRING 2016 35
Emotion Question: How will Tracy feel? (Choose from
one of happy or sad.)
Explanation: Explain why she will feel that way (in less
than 50 words).
This item can be parameterized by replacing Tracy,
banana, Mommy, and others with variables as shown
Scenario Template
<target-character> wants <object1>. <alt-character>
<target-character> <object2> for <condition>
Answer Template
Emotion: How does <target-character> feel?
Choose from: <range of emotion terms / levels>
Explanation: <answer constraints — length, vocabulary, and others>
The range for each parameter is specified by the
test administrator. For example the range for
<object1> could include any object within the vocabulary of a four year old (for example, banana, lump of
coal, chocolate, napkin). Additional item instances
are instantiated by choosing values for the parameters of a given template. If parameters can take on a
large set of values, a very large set of items can be generated.
To meet the needs of incrementality, one can
increase (or decrease) the level of difficulty by
increasing the range of values that scenario parameters may take on. Alternatively one can add more
How the Framework Prevents
Gaming Evaluation
Like any contest, it can be gamed by clever trickery
that violates the spirit of the rules and evades constructive progress in the field. We describe a variety of
gaming tactics and how the Framework prevents
Bag of Words to Predict Emotion
A bag of words (BOW) classifier assigns an input document to one of a predefined set of categories based
on weighted word frequencies. Thus, one “cheat” is
to use this simple technique to predict the correct
emotion label.
One problem is that such classifiers ignore word
order — thus “John loves Mary” and “Mary loves
John” would assign the same emotion to Mary. Further, they are not generative and thus unable to produce novel explanations necessary in stage 4. In stage
5, it is hard to imagine how such a shallow approach
would do well in evaluating the match between a scenario plus the appraisal emotion and explanation.
In stage 5, a chatbot will not do well because the task
involves no NL generation — it just involves producing scores rating the quality of an appraisal.
In stage 4, the case against the chatbot is more
involved. A chatbot hack for this stage would be to
chose an arbitrary emotion and generate explanation
through a chatbot. Chatty or snarky explanations
might sound human but contain no specific content.
Such explanations would intentionally be a form of
empty speech hand-crafted by the modeler to go
with any chosen emotion. For example, a Eugene
Goostman-like agent could chose happy or sad and
provide the same explanation, “Tracy feels that way
just because that’s the way she is.”
A related but slightly more sophisticated tactic is
always to chose the same emotion but devise a handcrafted appraisal that could go with virtually any scenario. For example, “Tracy feels happy because she
has a very upbeat personality — no matter what happens she’s always looking on the bright side.”
There are several reasons a chatbot will likely fail.
First, we expect chatbots may be detectable through
the human ratings. Although humans may sometimes provide answers like the above, more often
than not, we expect their answers to exhibit greater
specificity to the scenario and emotion chosen. We
suspect that direct answers will generally receive
higher ratings than chatty ones. Unlike the Turing
test, there is no chance to build conversational rapport because there is no conversation and thus little
for the chat bot to hide behind.
If necessary, contest administrators can give specific instructions to human judges to penalize
appraisals that are ironic, chatty, not specific to the
scenario, and so on. These considerations could be
woven into a single overall judgment score per
appraisal or by allowing for additional rating scales
(for example, one dimension might be believability,
another could be specificity, and so on). Elaborating
the instructions in this way demands more training
of judges and raises some issues associated with interrater reliability and multidimensional scoring.
The second countermeasure leverages falsifiability
and the valence reversal manipulation done to all
appraisals (machine as well as human generated) in
stage 4. A chatbot lacks an (affective) theory of mind
and thus does not know what kind of emotion goes
with what kind of explanation in an appraisal. There
should therefore be little to no dependency between
its emotion labels and explanations. Put another way,
being “theory free,” chatbot “predictions” about other agents’ appraisals are not falsifiable. Thus, valencereversed appraisals from a chatbot will likely not be
judged worse than their unreversed counterparts.
Thus if a given appraisal and its reversed counterpart
score about as well, this should factor negatively in
that contestant’s overall score.
Contest Evolution
An attractive design feature of this method is the
number of contest configuration variables that can
be readjusted each year in response to advancing
technology, pitfalls, changing goals, or emphasis.
If organizers want to maximize the generative pro-
ductivity of contestants’ models they can use fewer
scenario instances; involve more human participants
to generate more appraisals at stage 2; allow longer
appraisal explanations with a larger vocabulary; and
/ or reward models that generated multiple appraisals
per scenario.
By contrast, to maximize the breadth of appraisal
domains organizers can have more scenario templates, more parameters in a template, more parameter values for a given parameter; or adjust the size of
vocabulary allowed for a scenario.
To increase an appraisal’s algorithm sophistication
one can increase the number of characters in each
scenario, increase the number of emotions to chose
between, or allow multiple or mixed emotions to be
The first contests should involve a small handful of
emotions because Jarrold (2004) demonstrated there
is a tremendous amount of complexity yet to be
modeled to simply distinguish between happy and
Affective reasoning requires a substantial body of
commonsense knowledge. To bound the amount of
such background knowledge required and focus
efforts on affective reasoning, organizers can decrease
the diversity of scenario characters — for example,
human children ages 3 to 5; narrow the range of scenario parameters to a focused knowledge domain; or
restrict the vocabulary or length allowed in explanations.
In later contest years, there may be rater disagreement for some of the more nuanced or subtle scenario or appraisal pairs due to differing cultural or
social-demographic representativeness factors. A
variety of options present themselves — make rater
“cultural group” a contextual variable; increase the
cultural homogeneity of the human raters; or remove
appraisals with low interrater reliability from the
It is possible that considerable numbers of participants will be required at certain stages. For example,
modelers may desire a large number of appraisals to
be generated in stage 2 as training data. Prior work in
dialog systems (Yang et al. 2010) or the creation of
ImageNet (Su, Deng, and Fei-Fei 2012) (to pick just
two of many crowdsourced studies) has shown that
large numbers of people can be recruited online (for
example, through Amazon Mechanical Turk) as a
form of crowdsourcing. It is hoped that over successive years a large library of scenarios each with a large
number of appraisals and associated human ratings
could be collected in this way over time to compose
an emotion-oriented ImageNet analog.
Public Interest
Newsworthiness and public excitement are important because prior competitive challenges such as
Robocup, IBM Watson, and Deep Blue have demonstrated how these factors drive talented individuals
and other resources to attack a problem. One factor
helping the social-emotional Turing challenge is that
emotional content has mass appeal and may be less
dry than other challenges such as chess.
Stage 5, where machine- and human-generated
appraisals are judged side by side, may be the most
accessible media-worthy part of the framework. Prior
stages may be reserved for a qualifying round, which
may be of more scientific interest. Akin to the Watson competition, both human and machine contestants may be placed side by side while scenarios are
presented to them in real time. Judges will score each
appraisal blind to whether it was human versus
machine generated. Scores can be read off one by one
akin to a gymnastics competition.
We argue for the importance of assessing social-emotional intelligence among Turing test alternatives. We
focus on a specific aspect of this capacity, affective
theory of mind, which enables prediction and explanation of others’ emotional reactions to situations.
We explain how a generative logic can account for
the diversity yet specificity of predicted affective reactions. The falsifiability of these predictions is leveraged in a five-stage framework for assessing the
degree to which computer models can emulate this
behavior. Issues in implementation are discussed
including the importance of incremental challenge,
parameterization, and resisting hacks. It is hoped
that over successive years a large set of scenarios,
appraisals, and ratings would accrue and compose a
kind of affective version of ImageNet.
We would like to thank Deepak Ramachandran for
some helpful discussions.
Clark, P. 2015. Elementary School Science and Math Tests as
a Driver for AI: Take the Aristo Challenge! In Proceedings of
the Twenty-Ninth AAAI Conference on Artificial Intelligence,
4019–4021. Palo Alto, CA: AAAI Press.
Cohen, P. R.; Schrag, R.; Jones, E.; Pease, A.; Lin, A.; Starr, B.;
Gunning, D.; and Burke, M. 1998. The DARPA High-Performance Knowledge Bases Project. AI Magazine 19(4): 25.
Gunning, D.; Chaudhri, V. K.; Clark, P. E.; Barker, K.; Chaw,
S.-Y.; Greaves, M.; Grosof, B.; Leung, A.; McDonald, D. D.;
Mishra, S.; Pacheco, J.; Porter, B.; Spaulding, A.; Tecuci, D.;
and Tien, J. 2010. Project Halo Update — Progress Toward
Digital Aristotle. AI Magazine 31(3): 33–58.
Howlin, P.; Baron-Cohen, S.; and Hadwin, J. 1999. Teaching
Children with Autism to Mind-Read: A Practical Guide for
Teachers and Parents. Chichester, NY: J. Wiley & Sons.
Jarrold, W. 2004. Towards a Theory of Affective Mind. Ph.D.
Dissertation, Department of Educational Psychology, University of Texas at Austin, Austin, TX.
SPRING 2016 37
AI in Industry Columnists Wanted!
AI Magazine is soliciting contributions for a column on AI in industry. Contributions should inform AI Magazine’s readers about the
kind of AI technology that has been created or used in the company, what kinds of problems are addressed by the technology,
and what lessons have been learned from its deployment (including successes and failures). Prospective columns should allow
readers to understand what the current AI technology is and is not
able to do for the commercial sector and what the industry cares
about. We are looking for honest assessments (ideally tied carefully to the current state of the art in AI research) — not product
ads. Articles simply describing commercially available products are
not suitable for the column, although descriptions of interesting,
innovative, or high impact uses of commercial products may be.
Questions should be discussed with the column editors.
Columns should contain a title, names of authors, affiliations
and email addresses (and a designation of one author as contact
author), a 2–3 sentence abstract, and a brief bibliography (if
appropriate). The main text should be brief (600–1,000 words)
and provide the reader with high-level information about how AI
is used in their companies (we understand the need to protect
proprietary information), trends in AI use there, as well as an
assessment of the contribution. Larger companies might want to
focus on one or two suitable projects so that the description of
their development or use of AI technology can be made sufficiently detailed. The column should be written for a knowledgeable audience of Al researchers and practitioners.
Reports go through an internal review process (acceptance is
not guaranteed). The column editors and the AI Magazine editorin-chief are the sole reviewers of summaries. All articles will be
copyedited, and authors will be required to transfer copyright of
their columns to AAAI.
If you are interested in submitting an article to the AI in Industry column, please contact column editors Sven Koenig
([email protected]) and Sandip Sen ([email protected])
before submission.
Levesque, H. J. 2011. The Winograd Schema Challenge. In
Logical Formalizations of Commonsense Reasoning: Papers from
the 2011 AAAI Spring Symposium, 63–68. Palo Alto, CA: AAAI
Ortony, A. 2001. On Making Believable Emotional Agents
Believable. In Emotions in Humans and Artifacts, ed. R. Trappl, P. Petta, and S. Payr, 189–213. Cambridge, MA: The MIT
Popper, K. 2005. The Logic of Scientific Discovery. New York:
Routledge / Taylor & Francis.
Sosnovsky, S.; Shcherbinina, O.; and Brusilovsky, P. 2003.
Web-Based Parameterized Questions as a Tool for Learning.
In Proceedings of E-Learn 2003: World Conference on E-Learning in Corporate, Government, Healthcare, and Higher Education, 309–316. Waynesville, NC: Association for the
Advancement of Computing in Education.
Su, H.; Deng, J.; and Fei-Fei, L. 2012. Crowdsourcing Annotations for Visual Object Detection. In Human Computation:
Papers from the 2012 AAAI Workshop. AAAI Technical Report
WS-12-08, 40–46. Palo Alto, CA: AAAI Press.
Yang, Z.; Li, B.; Zhu, Y.; King, I.; Levow, G.; and Meng, H.
2010. Collection of User Judgments on Spoken Dialog System with Crowdsourcing. In 2010 IEEE Spoken Language
Technology Workshop (SLT 2010), 277–282. Piscataway, NJ:
Institute of Electrical and Electronics Engineers.
William Jarrold is a senior scientist at Nuance Communications. His research in intelligent conversational assistants
draws upon expertise in ontology, knowledge representation and reasoning, natural language understanding, and
statistical natural language processing (NLP). Throughout
his career he has developed computational models to augment and understand human cognition. In prior work at
the University of California, Davis and the SRI Artificial
Intelligence Lab he has applied statistical NLP to the differential diagnosis of neuropsychiatric conditions. At SRI and
the University of Texas he developed ontologies for intelligent tutoring (HALO) and cognitive assistants (CALO). Early in his career he worked at MCC and Cycorp developing
ontologies to support commonsense reasoning in Cyc — a
large general-purpose knowledge-based system. His Ph.D. is
from the University of Texas at Austin and his BS is from the
Massachusetts Institute of Technology.
Peter Z. Yeh is a senior principal research scientist at
Nuance Communications. His research interests lie at the
intersection of semantic technologies, data and web mining, and natural language understanding. Prior to joining
Nuance, Yeh was a research lead at Accenture Technology
Labs where he was responsible for investigating and applying AI technologies to various enterprise problems ranging
from data management to advanced analytics. Yeh is currently working on enhancing interpretation intelligence
within intelligent virtual assistants and automatically constructing large-scale knowledge repositories necessary to
support such interpretations. He received his Ph.D. in computer science from The University of Texas at Austin.
Artificial Intelligence to
Win the Nobel Prize and Beyond:
Creating the Engine for
Scientific Discovery
Hiroaki Kitano
I This article proposes a new grand
challenge for AI: to develop an AI system that can make major scientific discoveries in biomedical sciences and that
is worthy of a Nobel Prize. There are a
series of human cognitive limitations
that prevent us from making accelerated scientific discoveries, particularity in
biomedical sciences. As a result, scientific discoveries are left at the level of a
cottage industry. AI systems can transform scientific discoveries into highly
efficient practices, thereby enabling us
to expand our knowledge in unprecedented ways. Such systems may outcompute all possible hypotheses and
may redefine the nature of scientific
intuition, hence the scientific discovery
hat is the single most significant capability that
artificial intelligence can deliver? What pushes the
human race forward? Our civilization has
advanced largely by scientific discoveries and the application
of such knowledge. Therefore, I propose the launch of a
grand challenge to develop AI systems that can make significant scientific discoveries. As a field with great potential
social impacts, and one that suffers particularly from information overflow, along with the limitations of human cognition, I believe that the initial focus of this challenge should
be on biomedical sciences, but it can be applied to other areas
later. The challenge is “to develop an AI system that can
make major scientific discoveries in biomedical sciences and
that is worthy of a Nobel Prize and far beyond.” While recent
progress in high-throughput “omics” measurement technologies has enabled us to generate vast quantities of data,
scientific discoveries themselves still depend heavily upon
individual intuition, while researchers are often overwhelmed by the sheer amount of data, as well as by the complexity of the biological phenomena they are seeking to
understand. Even now, scientific discovery remains something akin to a cottage industry, but a great transformation
seems to have begun. This is an ideal domain and the ideal
timing for AI to make a difference. I anticipate that, in the
near future, AI systems will make a succession of discoveries
that have immediate medical implications, saving millions
of lives, and totally changing the fate of the human race.
Copyright © 2016, Association for the Advancement of Artificial Intelligence. All rights reserved. ISSN 0738-4602
SPRING 2016 39
Grand Challenges as a
Driving Force in AI Research
Throughout the history of research into artificial
intelligence, a series of grand challenges have been
significant driving factors. Advances in computer
chess demonstrated that a computer can exhibit
human-level intelligence in a specific domain. In
1997, IBM’s chess computer Deep Blue defeated
human world champion Gary Kasparov (Hsu 2004).
Various search algorithms, parallel computing, and
other computing techniques originating from computer chess research have been applied in other
fields. IBM took on another challenge when it set the
new goal of building a computer that could win the
TV quiz show Jeopardy! In this task, which involved
the real-time answering of open-domain questions
(Ferrucci et al. 2010, Ferrucci et al. 2013), IBM’s Watson computer outperformed human quiz champions.
IBM is currently applying technology from Watson
as part of its business in a range of industrial and
medical fields. In an extension of prior work on computer chess, Japanese researchers have even managed
to produce a machine capable of beating human
grand masters of Shogi, a Japanese chess variant with
a significantly larger number of possible moves.
RoboCup is a grand challenge founded in 1997
that traverses the fields of robotics and soccer. The
aim of this initiative is to promote the development
by the year 2050 of a team of fully autonomous
humanoid robots that is able to beat the most recent
winners of the FIFA World Cup (Kitano et al. 1997).
This is a task that requires both an integrated, collective intelligence and exceptionally high levels of
physical performance. Since the inaugural event, the
scheme has already given birth to a series of technologies that have been deployed in the real world.
For example, KIVA Systems, a technology company
that was formed based largely on technologies from
Cornell University’s team for RoboCup’s Small Size
League, provided a highly automated warehouse
management system that acquired in
2012. Various robots that were developed for the Rescue Robot League — a part of RoboCup focused on
disaster rescue — have been deployed in real-world
situations, including search and rescue operations at
New York’s World Trade Center in the aftermath of
the 9/11 terror attacks, as well as for surveillance missions following the accident at the Fukushima Daiichi Nuclear Power Plant.
These grand challenges present a sharp contrast
with the Turing test, aimed as they are at the development of superhuman capabilities as opposed to the
Turing test’s attempts to answer the question “Can
machines think?” by creating a machine that can
generate humanlike responses to natural language
dialogues (Turing 1950). These differing approaches
present different scientific challenges, and, while
going forward we may expect some cross-fertilization
between these processes, this article focuses on the
grand challenge of building superhuman capabilities.
History provides many insights into changes over
time in the technical approaches to these challenges.
In the early days of AI research, it was widely accepted that a brute force approach would not work for
chess, and that heuristic programming was essential
for very large and complex problems (Feigenbaum
and Feldman 1963). Actual events, however, confounded this expectation. Among the features critical for computer chess were the massive computing
capability required to search millions of moves; vast
memory to store a record of all past games; and a
learning mechanism to evaluate the quality of each
move and adjust search paths accordingly. Computing power, memory, and learning have proven to
hold the winning formula, overcoming sophisticated
heuristics. The 1990s saw a similar transformation of
approach in speech recognition, where rule-based
systems were outperformed by data- and computingdriven systems based on hidden Markov models (Lee
1988). Watson, the IBM computer that won the Jeopardy! quiz show, added new dimensions of massively
parallel heterogeneous inference and real-time stochastic reasoning. Coordination of multiple different
reasoning systems is also key when it comes to Shogi. Interestingly, similar technical features are also
critical in bioinformatics problems (Hase et al. 2013;
Hsin, Ghosh, and Kitano 2013). Elements currently
seen as critical include massively parallel heterogeneous computing, real-time stochastic reasoning,
limitless access to information throughout the network, and sophisticated multistrategy learning.
Recent progress in computer GO added a combination of deep learning, reinforcement learning, and
tree search to be the winning formula (Silver et al.
2016). Challenges such as those described have been
highly effective in promoting AI research. By demonstrating the latest advances in AI, and creating highimpact industrial applications, they continue to contribute to the progress of AI and its applications.
The Scientific Discovery
Grand Challenge
It is time to make an even greater stride, by imagining and initiating a new challenge that may change
our very principles of intelligence and civilization.
While scientific discovery is not the only driving
force of our civilization, it has been one of the most
critical factors. Creating AI systems with a very high
capability for scientific discovery will have a profound impact, not only in the fields of AI and computer science, but also in the broader realms of science and technology. It is a commonly held
perception that scientific discoveries take place after
years of dedicated effort or at a moment of great
serendipity. The process of scientific discovery as we
know it today is considered unpredictable and ineffi-
cient and yet is blithely accepted. I would argue,
however, that the practice of scientific discovery is
stuck at a level akin to that of a cottage industry. I
believe that the productivity and fundamental
modalities of the scientific discovery process can be
dramatically improved. The real challenge is to trigger a revolution in science equivalent to the industrial revolution.
It should be noted that machine discovery, or discovery informatics (Gil et al. 2014, Gil and Hirsh
2012), has long been a major topic for AI research.
BEACON (Langley and Simon 1987), DENDRAL
(Lindsay et al. 1993), AM, and EURISKO (Lenat and
Brown 1984) are just some of the systems of this
nature developed to date.
We must aim high. What distinguishes the proposed challenge from past efforts is its focus on biomedical sciences in the context of dramatic increases
in the amount of information and data available,
along with levels of interconnection of experimental
devices that were unavailable in the past. It is also set
apart by the focus on research, with the extremely
ambitious goal of facilitating major scientific discoveries in the biomedical sciences that may go on to
earn the Nobel Prize in Physiology or Medicine, or
achieve even more profound results. This is the
moonshot in AI. Just as the Apollo project’s goal went
beyond the moon (Kennedy 1961, 1962), the goals of
this project go far beyond the Nobel Prize. The goal
is to promote a revolution in scientific discovery and
to enable the fastest-possible expansion in the
knowledge base of mankind. The development of AI
systems with such a level of intelligence would have
a profound impact on the future of humanity.
Human Cognitive Limitations in
Biomedical Sciences
There are fundamental difficulties in biomedical
research that overwhelm the cognitive capabilities of
humans. This problem became even more pronounced with the emergence of systems biology
(Kitano 2002a, 2002b). Some of the key problems are
outlined below.
First, there is the information horizon problem.
Biomedical research is flooded with data and publications at a rate of production that goes far beyond
human information-processing capabilities. Over 1
million papers are published each year, and this rate
is increasing rapidly. Researchers are already overwhelmed by the flood of papers and data, some of
which may be contradictory, inaccurate, or misused.
It is simply not possible for any researcher to read, let
alone comprehend, such a deluge of information in
order to maintain consistent and up-to-date knowledge. The amount of experimental data is exploding
at an even faster pace, with widespread use of highthroughput measurement systems. Just as the rapidly expanding universe creates a cosmic event horizon
that prevents even light emitted in the distant past
from reaching us, thus rendering it unobservable, the
never-ending abundance of publications and data
creates an information horizon that prevents us from
observing a whole picture of what we have discovered and what data we have gathered. It is my hope
that, with the progress instigated by the challenge I
am proposing, AI systems will be able to compile a
vast body of intelligence in order to mitigate this
problem (Gil et al. 2014).
Second, there is also the problem of an information gap. Papers are written in language that frequently involves ambiguity, inaccuracy, and missing
information. Efforts to develop a large-scale comprehensive map of molecular interactions (Caron et al.
2010, Matsuoka et al. 2013, Oda and Kitano 2006,
Oda et al. 2005) or any kind of biological knowledge
base of any form will encounter this problem (see
sidebar). Our interpretation, hence human-based
knowledge extraction, largely depends on subjectively filling in the gaps using the reader’s own knowledge or representation of knowledge with missing
details, results in an arbitral interpretation of knowledge in the text.
Obviously, solving this is far beyond the capacity
to convey information of the language of a given text
(Li, Liakata, and Rebholz-Schuhmann 2014). It also
involves actively searching for missing information
to discern what is missing and how to find it. It is
important to capture details of the interactions within a process rather than merely an abstracted
overview, because researchers are well aware of overall interaction and expect such a knowledge base, or
maps, to provide consistent and comprehensive yet
in-depth description of each interaction. Similar
issues exist when it comes to understanding images
from experiments. They include: how to interpret
images, checking consistency with the sum of past
data, identifying differences and the reasons for
these, and recovering missing information on experimental conditions and protocol.
Third, there is a problem of phenotyping inaccuracy. The word phenotyping refers to representation
and categorization of biological anomalies such as
disease, effects of genetics mutations, and developmental defects. Phenotyping is generally performed
based on subjective interpretation and consensus of
medical practitioners and biologists, described using
terms that are relatively easy to understand. This
practice itself is tightly linked with human cognitive
limitations. Biomedical sciences have to deal with
complex biological systems that are highly nonlinear, multidimensional systems. Naïve delineation of
observation into coarse categorization can create significant inaccuracies and lead to misdiagnosis and
inaccurate understanding of biological phenomena
(figure 1a). This is a practical clinical problem as signified in some rare disease cases that took decades for
patients to be diagnosed and had almost 40 percent
SPRING 2016 41
In contrast, in response to mating pheromones, the Far1–Cdc24
complex is exported from the nucleus by Msn5
From the nucleus to where?
Is Msn5 within the nucleus?
Are all forms of Far1-Cdc24 exported?
Can all forms of Msn5 do this?
An Example of Missing Information in a Biological Statement
Biomedical science is knowledge-intensive and empirical science. Currently, knowledge is embedded in the
text and images in publications. The figure exemplifies a case of missing information implicit in biomedical papers. Take the example of the following typical sentence from a biology paper: “In contrast, in
response to mating pheromones, the Far1-Cdc24 complex is exported from the nucleus by Msn5” (taken
from the abstract by Shimada, Gulli, and Peter [2000]). We can extract knowledge on a specific molecular
interaction involving the Far1-Cdc24 complex and Msn5 and represent this graphically. The sentence itself
does not, however, describe where the Far-Cdc24 complex is exported to, and where Msn5 is located. In
such cases, researchers can fill in the conceptual gaps from their own biological knowledge. However, it is
not clear if all forms of the Far1-Cdc24 complex will become the subject of this interaction, nor if all forms
of Msn5 can conduct this export process. In this case, the general biological knowledge of researchers will
generally prove insufficient to fill in such gaps, thereby necessitating either the inclusion of a specific clarifying statement elsewhere in the paper, or the need to search other papers and databases to fill this gap.
of initial misdiagnosis rate (EURORDIS 2007). Clinical diagnosis is a process of observation, categorization of observed results, and hypothesis generation
on a patient’s disease status. Misdiagnosis leads to
inappropriate therapeutic interventions. Identification of proper feature combinations for each axis, the
proper dimension for the representation of space,
and the proper granularity for categorization shall
significantly improve diagnosis, hence therapeutic
efficacy (figure 1b). Extremely complex feature combinations for each axis, extreme high-dimensional
representation of space, and extremely fine-grained
categorization that can be termed as extreme classification shall dramatically improve accuracy of diagnosis. Since many diseases are constellations of very
large numbers of subtypes of diseases, such an
extreme classification shall enable us to properly
identify specific patient subgroups that may not be
identified as an isolate group at present and lead to
specific therapeutic options. However, an emerging
problem would be that humans may not be able to
comprehend what exactly each category means in
feature A
False positive
False negative
Which feature or feature combinations to use?
For example, Y-Axis = feature A versus Y-Axis = f(feature A, feature B, feature D)
How m
feature B
What is the best granularity for categorization?
For example, Low, Mid, High (coarse) versus Low-low, Low-mid, Low-high,
Mid-low, Mid-mid, Mid-high, Mid-very-high, High-low, High-high (Fine-grain)
Figure 1. Problems in the Representation and Categorization of Biological Objects and Processes.
Left figure modified based on Kitano (1993). Figure 1a is an example of an attempt to represent a nonlinear boundary object,
assumed to be a simplification of a phenotype, with a simple two-feature dimensional space with coarse categorization such
as Low, Mid, and High. The object can be most covered with “feature A = Mid and feature B = Mid condition.” However, it
inevitably results in inaccuracy (false-positives and false-negatives). Improving accuracy of nonlinear object coverage
requires the proper choice of the feature complex for each axis, the proper dimension of representational space, and the
proper choice of categorization granularity (figure 1b).
relation to their own biomedical knowledge, which
was developed based on the current coarse and lowdimensional categorization.
Another closely related problem is that of cognitive bias. Due to the unavoidable use of language and
symbols in our process of reasoning and communication, our thought processes are inevitably biased.
As discussed previously, natural language does not
properly represent biological reality. Alfred Kozybski’s statement that “the map is not the territory”
(Korzybski 1933) is especially true in biomedical sciences (figure 2). Vast knowledge of the field comes in
the form of papers that are full of such biases. Our
ability to ignore inaccuracies and ambiguity facilitates our daily communication, yet poses serious limitations on scientific inquiry.
Then there is the minority report problem. Biology is an empirical science, meaning knowledge is
accumulated based on experimental findings. Due to
the complexity of biological systems, diversity of
individuals, uncertainty of experimental conditions,
and other factors, there are substantial deviations
and errors in research outcomes. While consensus
among a majority of reports can be considered to
portray the most probable reality regarding a specific
aspect of biological systems, reports exist that are not
consistent with this majority (figure 3).
Whether such minority reports can be discarded as
errors or false reports is debatable. While some will
naturally fall into this category, others may be correct,
and may even report unexpected biological findings
that could lead to a major discovery. How can we distinguish between such erroneous reports and those
with the potential to facilitate major discoveries?
Are We Ready to Embark
on This Challenge?
I have described some of the human cognitive limitations that act as obstacles to efficient biomedical
research, and that AI systems may be able to resolve
during the course of the challenge I am proposing.
Interestingly, there are a few precedents that may
provide a useful starting point. Of the early efforts to
mitigate the information horizon problem, research
using IBM’s Watson computer is currently focused on
the medical domain. The intention is to compile the
vast available literature and present it in a coherent
manner, in contrast to human medical practitioners
and researchers who cannot read and digest the
entire available corpus of information. Watson was
used in a collaboration between IBM, Baylor College
of Medicine, and MD Anderson Cancer Center that
led to the identification of novel modification sites
of p53, an important protein for cancer suppression
(Spangler et al. 2014). A recent DARPA Big Mechanism Project (BMP) aimed at automated extraction of
SPRING 2016 43
Reality versus Human Cognition
Human Cognitive
Representation 1
Human Cognitive
Representation 2
Figure 2. The Same Reality Can Be Expressed Differently,
or the Same Linguistic Expressions May Represent Different Realities.
Frequency of reports
Majority reports
Average value
Figure 3. Should Minority Reports Be Discarded? Or Might They Open Up Major Discoveries?
large-scale molecular interactions related to cancer
(Cohen 2014).
With regard to problems of phenotyping inaccuracy, progress in machine learning as exemplified in
deep learning may enable us to resolve some cognitive issues. There are particular hopes that computers
may learn to acquire proper features for representing
complex objects (Bengio 2009; Bengio, Courville,
and Vincent 2013; Hinton 2011). Deep phenotyping
is an attempt to develop much finer-grained and indepth phenotyping than current practice provides to
establish highly accurate diagnosis, patient classification, and precision clinical decisions (Frey, Lenert,
and Lopez-Campos 2014; Robinson 2012), and some
of pioneering researchers are using deep learning
(Che et al. 2015). Combining deep phenotyping and
personal genomics as well as other comprehensive
measurements leads to dramatically improved accurate diagnosis and effective therapeutic interventions, as well as improving drug discovery efficiency.
For generating hypotheses and verifying them,
Ross King and his colleagues have developed a systematic robot scientist that can infer possible biological hypotheses and design simple experiments using
a defined-protocol automated system to analyze
orphan genes in budding yeast (King et al. 2009a,
2009b; King et al. 2004). While this brought only a
moderate level of discovery within the defined context of budding yeast genes, the study represented an
integration of bioinformatics-driven hypothesis generation and automated experimental processes. Such
an automatic experimental system has great potential for expansion and could become a driving force
for research in the future.
Most experimental devices these days are highly
automated and connected to networks. In the near
future, it is likely that many will be supplemented by
high-precision robotics systems, enabling AI systems
not only to access digital information but also to
design and execute experiments. That would mean
that every detail of experimental results, including
incomplete or erroneous data, could be stored and
made accessible. Such progress would have a dramatic impact on the issues of long-tail distribution and
dark data in science (Heidorn 2008).
Crowdsourcing of science, or citizen science, offers
many interesting opportunities, and great potential
for integration with AI systems. The protein-folding
game FoldIt, released in 2008, demonstrated that
with proper redefinition of a scientific problem, ordinary citizens can contribute to the process of scientific discovery (Khatib et al. 2011). Patient-powered
research network Patientslikeme is another example
of how motivated ordinary people can contribute to
science (Wicks et al. 2015, Wicks et al. 2011). While
successful deployment of community-based science
requires carefully designed missions, clear definition
of problems, and the implementation of appropriate
user interfaces (Kitano, Ghosh, and Matsuoka 2011),
crowdsourcing may offer an interesting opportunity
for AI-based scientific discovery. This is because, with
proper redefinition of a problem, a system may also
help to facilitate the best use of human intelligence.
There are efforts to develop platforms that can
connect a broad range of software systems, devices,
databases, and other necessary resources. The Garuda
platform is an effort to develop an open application
programming interface (API) platform aimed at
attaining a high-level of interoperability among biomedical and bioinformatics analysis tools, databases,
devices, and others (Ghosh et al. 2011). The Pegasus
and Wings system is another example that focuses on
sharing the workflow of scientific activities (Gil et al.
2007). A large-scale collection of workflow from the
scientific community that may direct possible
sequences of analysis and experiments used and
reformulated by AI systems would be a powerful
knowledge asset. With globally interconnected highperformance computing systems such as InfiniCortex
Michalewicz, et al. 2015), we are now getting ready to
undertake this new and formidable challenge. Such
research could form the partial basis of this challenge. At the same time, we still require a clear game
plan, or at the very least an initial hypothesis.
Scientific Discovery as a
Search Problem: Deep
Exploration of Knowledge Space
What is the essence of discovery? To rephrase the
question, what could be the engine for scientific discovery? Consistent and broad-ranging knowledge is
essential, but does not automatically lead to new discoveries. When I talk about this initiative, many scientists ask whether AI can be equipped with the necessary intuition for discovery. In other words, can AI
systems be designed to ask the “right” questions that
may lead to major scientific discoveries? While this
certainly appears to be a valid question, let us think
more deeply here. Why is asking the right question
important? It may be due to resource constraints
(such as the time for which researchers can remain
active in their professional careers), budget, competition, and other limitations. Efficiency is, therefore,
the critical factor to the success of this challenge.
When time and resources are abundant, the importance of asking the right questions is reduced. One
might arrive at important findings after detours, so
the route is not of particular significance. At the same
time, science has long relied to a certain extent on
serendipity, where researchers made a major discovery by accident. Thinking about such observations, it
is possible to arrive at a hypothesis that infers that
the critical aspect of scientific discovery is how many
hypotheses can be generated and tested, including
examples that may seem highly unlikely.
This indicates the potential to scientific discovery
SPRING 2016 45
Entire Hypothetical Body
of Scientific Knowledge
Some hypotheses require
experimental verification
Hypotheses Generated
to da
may include
errors and noise
generation from
Are newly verified hypotheses
consistent with current
knowledge, or do
they generate
Knowledge in
AI System
Portions of knowledge
believed to be correct may
in fact be false
Dark Data
Papers and
Papers and databases
contains errors,
inconsistencies, and
even fabrications
Figure 4. Bootstrapping of Scientific Discovery and Knowledge Accumulation.
Correct and incorrect knowledge, data, and experimental results are involved throughout this process, though some may be ambiguous.
Scientific discovery requires an iterative cycle aimed at expanding our knowledge on this fragile ground. The aim is to compute, verify, and
integrate every possible hypothesis, thereby building a consistent body of knowledge.
of a brute-force approach in which AI systems generate and verify as many hypotheses as possible. Such
an approach may differ from the way in which scientists traditionally conduct their research, but could
become a computational alternative to the provision
of scientific insights. It should be stressed that while
the goal of the grand challenge is to make major scientific discoveries, this does not necessarily mean
those discoveries should be made as if by human scientists.
The brute-force approach empowered by machine
learning and heterogeneous inference has already
provided the basis of success for a number of grand
challenges to date. As long as a hypothesis can be verified, scientific discovery can also incorporate computing to search for probable correct hypotheses from
among the full range of possible ones. The fundamental thrust should be toward massive combinator-
ial hypothesis generation, the maintenance of a consistent repository of global knowledge, and perhaps a
number of other fundamental principles that we may
not be aware of at present. Thus, by using computing
to generate and verify as quickly as possible the full
range of logically possible hypotheses, it would mitigate resource constraint issues and enable us to examine even unexpected or seemingly far-fetched ideas.
Such an approach would significantly reduce the
need to ask the right questions, thereby rendering scientific intuition obsolete, and perhaps even enabling
us to explore computational serendipity.
The engine of discovery should be a closed-loop
system of hypothesis generation and verification,
knowledge maintenance, knowledge integration, and
so on (figure 4) and should integrate a range of technologies (figure 5). Fundamentally speaking,
hypotheses, along with constraints imposed on
Massive Hypothesis
Generation and
Distributed and
Massively Parallel
Access over
CyberPhysical and
Croud Integration
and Active Data
Figure 5. Evolution of Key Elements in Grand Challenges and Possible Elements of the Scientific Discovery Grand Challenge.
Computing, memory, and learning have long been key elements in computer chess. Further techniques have originated from the application of computers to the quiz show Jeopardy! To facilitate scientific discovery, an even more complex and sophisticated range of functions
is required. The term twilight-zone reasoning refers to the parsing of data and publications that may be highly ambiguous, error-prone, or
faulty. The elements introduced here represent general ideas on how to approach the scientific discovery grand challenge, rather than originating from precise technical analysis of the necessary functionalities.
hypothesis generation and the initial validation
process, would be derived from the vast body of
knowledge to be extracted from publications, databases, and automatically executed experiments. Successfully verified hypotheses would be added to the
body of knowledge, enabling the bootstrapping
process to continue. It is crucial to recognize that not
all papers and data to emerge from the scientific community are correct or reliable; they contain substantial errors, missing information, and even fabrications. It may be extremely difficult to reproduce the
published experimental results, and some may prove
impossible to re-create (Prinz, Schlange, and Asadullah 2011). At the same time, major progress is continually being made in the field of biomedical science. How can this be possible if such a high
proportion of papers present results that are false or
not reproducible? While individual reports may con-
tain a range of problems, collective knowledge has
the potential to uncover truths from even an errorprone scientific process. This is a twilight zone of scientific discovery, and AI systems need to be able to
reason in the twilight zone. The proposed challenge
would shed light on this conundrum.
Advanced Intelligence
What is certain is that such a system would substantially reinforce the intellectual capabilities of humans
in a manner that is entirely without precedent and
that holds the potential to change fundamentally the
way science is conducted.
The first-ever defeat of a chess grand master by an
AI system was followed by the emergence of a new
style of chess known as advanced chess, in which
human and computer work together as a team, to
SPRING 2016 47
take on similarly equipped competitors. This partnership may be considered a form of human-computer symbiosis in intelligent activities. Similarly, we
can foresee that in the future sophisticated AI systems
and human researchers will work together to make
major scientific discoveries. Such an approach can be
considered “advanced intelligence.”
Advanced intelligence as applied to scientific discovery would go beyond existing combinations of AI
and human experts. Just as most competitive biomedical research institutions are now equipped with
high-throughput experimental systems, I believe that
AI systems will become a fundamental part of the
infrastructure for top-level research institutions in
the future. This may involve a substantial level of
crowd intelligence, utilizing the contributions of
both qualified researchers and ordinary people to
contribute, each for different tasks, thereby forming
a collaborative form of intelligence that could be ably
and efficiently orchestrated by AI systems. Drawing
this idea out to its extreme, it may be possible to
place AI systems at the center of a network of intelligent agents — comprising both other AI systems and
humans — to coordinate large-scale intellectual
activities. Whether this path would ultimately make
our civilization more robust (by facilitating a series of
major scientific discoveries) or more fragile (due to
extensive and excessive dependence on AI systems) is
yet to be seen. However, just as Thomas Newcomen’s
atmospheric engine was turned into a modern form
of steam engine by James Watt to become the driving
force of the industrial revolution, AI scientific discovery systems have the potential to drive a new revolution that leads to new frontiers of civilization.
Ferrucci, D.; Levas, A; Bagchi, S.; Gondek, D.; and Mueller,
E. 2013. Watson: Beyond Jeopardy! Artificial Intelligence 199–
200: (June–July): 93–105.
Frey, L. J.; Lenert, L.; and Lopez-Campos, G. 2014. EHR Big
Data Deep Phenotyping. Contribution of the IMIA Genomic Medicine Working Group. Yearbook of Medical Informatics
9: 206–211.
Ghosh, S.; Matsuoka, Y.; Asai, Y.; Hsin, K. Y.; and Kitano, H.
2011. Software for Systems Biology: From Tools to Integrated Platforms. Nature Reviews Genetics 12(12): 821–832.
Gil, Y.; Greaves, M.; Hendler, J.; and Hirsh, H. 2014. Artificial Intelligence. Amplify Scientific Discovery with Artificial
Intelligence. Science 346(6206): 171–172.
Gil, Y., and Hirsh, H. 2012. Discovery Informatics: AI
Opportunities in Scientific Discovery. In Discovery Informatics: The Role of AI Research in Innovating Scientific Processes:
Papers from the AAAI Fall Symposium, 1–6. Technical
Report FS-12-03. Palo Alto, CA: AAAI Press.
Gil, Y.; Ratnakar, V.; Deelman, E.; Mehta, G.; and Kim, J.
2007. Wings for Pegasus: Creating Large-Scale Scientific
Applications Using Sematic Representations of Computational Workflows. In Proceedings of the 19th Innovative Applications of Artificial Intelligence (IAAI-07). Palo Alto, CA: AAAI
Hase, T.; Ghosh, S.; Yamanaka, R.; and Kitano, H. 2013. Harnessing Diversity Towards the Reconstructing of Large Scale
Gene Regulatory Networks. PLoS Computational Biology
9(11): e1003361.
Heidorn, P. B. 2008. Shedding Light on the Dark Data in the
Long Tail of Science. Library Trends 57(2): 280–299.
Bengio, Y. 2009. Learning Deep Architecture for AI. Foundations and Trends in Machine Learning 2(1): 1–127.
Hinton, G. 2011. A Better Way to Learn Features. Communications of the ACM 54(10).
Bengio, Y.; Courville, A.; and Vincent, P. 2013. Representation Learning: A Review and New Prespectives. IEEE Transactions on Pattern Analysis and Machine Intelligence 35(8):
Hsin, K. Y.; Ghosh, S.; and Kitano, H. 2013. Combining
Machine Learning Systems and Multiple Docking Simulation Packages to Improve Docking Prediction Reliability for
Network Pharmacology. PLoS One 8(12): e83922. .0083922
Caron, E.; Ghosh, S.; Matsuoka, Y.; Ashton-Beaucage, D.;
Therrien, M.; Lemieux, S.; Perreault, C.; Roux, P.; and
Kitano, H. 2010. A Comprehensive Map of the mTOR Signaling Network. Molecular Systems Biology 6, 453.
Che, Z.; Kale, D.; Li, W.; Bahadori, M. T.; and Liu, Y. 2015.
Deep Computational Phenotyping. In Proceedings of the
21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: Association for Computing Machinery.
Cohen, P. 2014. Big Mechanism [Project Announcement].
Arlington, VA: Defense Advanced Research Projects Agency.
EURORDIS. 2007. Survey of the Delay in Diagnosis for 8
Rare Diseases in Europe (EurordisCare2). Brussels, Belgium:
EURODIS Rare Diseases Europe.
Feigenbaum, E., and Feldman, J. 1963. Computers and
Thought. New York: McGraw-Hill Book Company.
Ferrucci, D.; Brown, E.; Chu-Carroll, J.; Fan, J.; Gondek, D.;
Kalyanpur, A; Lally, A.; Murdock, J. W.; Nyberg, E.; Prager, J.;
Schlaefer, N.; Welty, C. 2010. Building Watson: An Overview
of the DeepQA Project. AI Magazine 31(3): 59–79.
Hsu, F.-H. 2004. Behind Deep Blue: Buidling the Computer That
Defeated the World Chess Champion. Princeton, NJ: Princeton
University Press.
Kennedy, J. F. 1961. Special Message to Congress on Urgent
National Needs, 25 May 1961. Papers of John F. Kennedy.
Presidential Papers. President’s Office Files. JFKPOF-034-030
John F. Kennedy Presidential Library, Boston, MA.
Kennedy, J. F. 1962. Address at Rice University on the
Nation’s Space Effort 12 September 1962. Accession Number
USG:15 reel 29. John F. Kennedy Presidential Library,
Boston, MA.
Khatib, F.; DiMaio, F.; Foldit Contenders Group; Foldit Void
Crushers Group; Cooper, S.; Kazmierczyk, M.; Gilski, M.;
Krzywda, S.; Zabranska, H.; Pichova, I.; Thompson, J.;
Popovi, Z.; Jaskolski, M.; Baker, D. 2011. Crystal Structure of
a Monomeric Retroviral Protease Solved by Protein Folding
Game Players. Natural Structural and Molecular Biology
18(10): 1175–1177. nsmb.2119
King, R. D.; Rowland, J.; Oliver, S. G.; Young, M.; Aubrey,
W.; Byrne, E.; Liakata, M.; Markham, M.; Pir, P.; Soldatova,
L. N.; Sparkes, A.; Whelan, K. E.; Clare, A. 2009a. The
Automation of Science. Science 324(5923): 85–89.
King, R. D.; Rowland, J.; Oliver, S. G.; Young, M.; Aubrey,
W.; Byrne, E.; Liakata, M.; Markham, M.; Pir, P.; Soldatova,
L. N.; Sparkes, A.; Whelan, K. E.; Clare, A. 2009b. Make Way
for Robot Scientists. Science 325(5943), 945. dx.doi.
org/10.1126/science. 325_945a
King, R. D.; Whelan, K. E.; Jones, F. M.; Reiser, P. G.; Bryant,
C. H.; Muggleton, S. H.; Kell, D. B.; Oliver, S. G. 2004. Functional Genomic Hypothesis Generation and Experimentation by a Robot Scientist. Nature 427(6971): 247–252. 1038/nature02236
Kitano, H. 1993. Challenges of Massive Parallelism. In Proceedings of the 13th International Joint Conference on Artificial
Intelligence, 813–834. San Mateo, CA: Morgan Kaufmann
Kitano, H. 2002a. Computational Systems Biology. Nature
420(6912): 206–210. dx.doi. org/10.1038/nature01254
Kitano, H. 2002b. Systems Biology: A Brief Overview. Science
Kitano, H.; Asada, M.; Kuniyoshi, Y.; Noda, I.; Osawa, E.;
and Matsubara, H. 1997. RoboCup: A Challenge Problem for
AI. AI Magazine 18(1): 73–85.
Kitano, H.; Ghosh, S.; and Matsuoka, Y. 2011. Social Engineering for Virtual ‘Big Science’ in Systems Biology. Nature
Chemical Biology 7(6): 323–326.
Korzybski, A. 1933. Science and Sanity: An Introduction to NonAristotelian Systems and General Semantics. Chicago: Institute
of General Semantics.
Langley, P., and Simon, H. 1987. Scientific Discovery: Computational Exploration of the Creative Processes. Cambridge, MA:
The MIT Press.
Lee, K. F. 1988. Automatic Speech Recognition: The Development of the SPHINX System. New York: Springer.
Lenat, D., and Brown, J. 1984. Why AM and EURISKO
Appear to Work. Artificial Intelligence 23(3): 269–294.
Li, C.; Liakata, M.; and Rebholz-Schuhmann, D. 2014. Biological Network Extraction from Scientific Literature: State
of the Art and Challenges. Brief Bioinform 15(5): 856–877.
Lindsay, R.; Buchanan, B.; Feigenbaum, E.; and Lederberg, J.
1993. DENDRAL: A Case Study of the First Expert System for
Scientific Hypothesis Formation. Artificial Intelligence 61(2):
209–261. 0004-3702(93)90068-M
Matsuoka, Y.; Matsumae, H.; Katoh, M.; Eisfeld, A. J.; Neumann, G.; Hase, T.; Ghosh, S.; Shoemaker, J. E.; Lopes, T.;
Watanabe, T.; Watanabe, S.; Fukuyama, S.; Kitano, H.;
Kawaoka, Y. 2013. A Comprehensive Map of the Influenza:
A Virus Replication Cycle. BMC Systems Biology 7: 97(2 October).
Michalewicz, M.; Poppe, Y.; Wee, T.; and Deng, Y. 2015.
InfiniCortex: A Path To Reach Exascale Concurrent Supercomputing Across the Globe Utilising Trans-Continental
Infiniband and Galaxy Of Supercomputers. Position Paper
Presented at the Third Big Data and Extreme-Scale Computing Workshop (BDEC), Barcelona, Spain, 29–30 January.
Oda, K., and Kitano, H. 2006. A Comprehensive Map of the
Toll-Like Receptor Signaling Network. Molecular Systems Biology 2: 2006 0015.
Oda, K.; Matsuoka, Y.; Funahashi, A.,; and Kitano, H. 2005.
A Comprehensive Pathway Map of Epidermal Growth Factor Receptor Signaling. Molecular Systems Biology 1 2005
Prinz, F.; Schlange, T.; and Asadullah, K. 2011. Believe It or
Not: How Much Can We Rely on Published Data on Potential Drug Targets? Nature Reviews Drug Discovery 10(9): 712.
Robinson, P. N. 2012. Deep Phenotyping for Precision Medicine. Human Mutation 33(5), 777–780.
Shimada, Y.; Gulli, M. P.; and Peter, M. (2000). Nuclear
Sequestration of the Exchange Factor Cdc24 by Far1 Regulates Cell Polarity During Yeast Mating. Nature Cell Biology
2(2): 117–124.
Silver, D.; Huang, A.; Maddison, C.; Guez, A.; Sifre, L.; den
Driessche, G.; Schrittwieser, J.; Antonoglou, I.; Panneershelvam, V.; Lanctot, M.; Dieleman, S.; Grewe, D.; Nham, J.;
Kalchbrenner, N.; Sutskever, I.; Lillicrap, T.; Leach, M.;
Kavukcuoglu, K.; Graepel, T.; Hassabis, D. 2016. Mastering
the Game of Go with Deep Neural Networks and Tree
Search. Nature 529(7587): 484-489.
Spangler, S.; Wilkins, A.; Bachman, B.; Nagarajan, M.;
Dayaram, T.; Haas, P.; Regenbogen, S.; Pickering, C. R.; Corner, A.; Myers, J. N.; Stanoi, I.; Kato, L.; Lelescu, A.; Labire,
J. J.; Parikh, N.; Lisewski, A. M.; Donehower, L.; Chen, Y.;
Lichtarge, O. 2014. Automated Hypothesis Generation
Based on Mining Scientific Literature. In Proceedings of the
20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: Association for Computing Machinery.
Turing, A. M. 1950. Computing Machinery and Intelligence.
Mind 59(236): 433–460.
Wicks, P.; Lowe, M.; Gabriel, S.; Sikirica, S.; Sasane, R.; and
Arcona, S. 2015). Increasing Patient Participation in Drug
Development. Nature Biotechnology 33(2): 134–135. dx.doi.
Wicks, P.; Vaughan, T. E.; Massagli, M. P.; and Heywood, J.
2011. Accelerated Clinical Discovery Using Self-Reported
Patient Data Collected Online and a Patient-Matching Algorithm. Nature Biotechnology 29(5): 411–414.
Hiroaki Kitano is director, Sony Computer Science Laboratories, Inc., president of the Systems Biology Insitute, a professor at Okinawa Institute of Science and Technology, and
a group director for Laboratory of Disease Systems Modeling
at Integrative Medical Sciences Center at RIKEN. Kitano is a
founder of RoboCup, received the Computers and Thoughts
Award in 1993, and Nature Award for Creative Mentoring in
Science in 2009. His current research focuses on systems
biology, artifiical intelligence for biomedical scientific discovery, and their applications.
SPRING 2016 49
Planning, Executing, and
Evaluating the Winograd
Schema Challenge
Leora Morgenstern, Ernest Davis, Charles L. Ortiz, Jr.
I The Winograd Schema Challenge
(WSC) was proposed by Hector
Levesque in 2011 as an alternative to
the Turing test. Chief among its features
is a simple question format that can
span many commonsense knowledge
domains. Questions are chosen so that
they do not require specialized knoweldge or training and are easy for
humans to answer. This article details
our plans to run the WSC and evaluate
he Winograd Schema Challenge (WSC) (Levesque,
Davis, and Morgenstern, 2012) was proposed by Hector
Levesque in 2011 as an alternative to the Turing test.
Turing (1950) had first introduced the notion of testing a
computer system’s intelligence by assessing whether it could
fool a human judge into thinking that it was conversing with
a human rather a computer. Although intuitively appealing
and arbitrarily flexible — in theory, a human can ask the
computer system that is being tested wide-ranging questions
about any subject desired — in practice, the execution of the
Turing test turns out to be highly susceptible to systems that
few people would wish to call intelligent.
The Loebner Prize Competition (Christian 2011) is in particular associated with the development of chatterbots that
are best viewed as successors to ELIZA (Weizenbaum 1966),
the program that fooled people into thinking that they were
talking to a human psychotherapist by cleverly turning a person’s statements into questions of the sort a therapist would
ask. The knowledge and inference that characterize conversations of substance — for example, discussing alternate
metaphors in sonnets of Shakespeare — and which Turing
presented as examples of the sorts of conversation that an
intelligent system should be able to produce, are absent in
these chatterbots. The focus is merely on engaging in surfacelevel conversation that can fool some humans who do not
delve too deeply into a conversation, for at least a few minutes, into thinking that they are speaking to another person.
The widely reported triumph of the chatterbot Eugene Goostman in fooling 10 out of 30 judges to judge, after a fiveminute conversation, that it was human (University of Read-
Copyright © 2016, Association for the Advancement of Artificial Intelligence. All rights reserved. ISSN 0738-4602
ing 2014), was due precisely to the system’s facility
for this kind of shallow conversation.
Winograd Schemas
In contrast to the Loebner Prize Competition, the
Winograd Schema Challenge is designed to test a system’s ability to understand natural language and use
commonsense knowledge. Winograd schemas (WSs)
are best understood by first considering Winograd
schema halves, which are sentences with at least one
pronoun and two possible referents for that pronoun,
along with a question that asks which of the two referents is correct. An example1 is the following:
The customer walked into the bank and stabbed one
of the tellers. He was immediately taken to the emergency room.
Who was taken to the emergency room? The customer
/ the teller
The correct answer is the teller. We know this because
of all the commonsense knowledge that we have
about stabbings, injuries, and how they are treated.
We know that if someone is stabbed, he is very likely
to be seriously wounded, and that if someone is seriously wounded, he needs medical attention. We
know, furthermore, that people with acute and serious injuries are frequently treated at emergency
rooms. Moreover, there is no indication in the text
that the customer has been injured, and therefore no
apparent reason for him to be taken to the emergency room. We reason with much of this information when we determine that the referent of who in
the second sentence in the example is the teller
rather than the customer.
So far, we are just describing the problem of pronoun disambiguation. Winograd schemas, however,
have a twist: they are constructed so that there is a
special word (or short phrase) that can be substituted
for one of the words (or short set of words) in the sentence, causing the other candidate pronoun referent
to be correct. For example, consider the above sentence with the words police station substituted for
emergency room:
The customer walked into the bank and stabbed one
of the tellers. He was immediately taken to the police
ensure that test designers do not inadvertently construct a set of problems in which ordering of words or
sentence structure can be used by test takers to help
the disambiguation process. For example, if a sentence with subject and object is followed by a phrase
or sentence that starts with a pronoun, the subject is
more likely to be the referent of the pronoun than
the object. The test taker, however, who is given a
Winograd schema half, knows not to rely on this
heuristic because the existence of the special word or
set of words negates that heuristic. For instance, in
the example, who refers to the subject when the special set of words is police station but the object when
the special set of words is emergency room.
There are three additional restrictions that we
place on Winograd schemas: First, humans should be
able to disambiguate these questions easily. We are
testing whether systems are as intelligent as humans,
not more intelligent.
Second, they should not obey selectional restrictions. For example, the following would be an invalid
example of a Winograd schema:
The women stopped taking the pills, because they
were carcinogenic / pregnant.
What were carcinogenic / pregnant? The women / the
This example is invalid because one merely needs to
know that women, but not pills, can be pregnant,
and that pills, but not women, can be carcinogenic,
in order to solve this pronoun disambiguation problem. While this fact can also be viewed as a type of
commonsense knowledge, it is generally shallower
than the sort of commonsense knowledge exemplified by the emergency room / police station example
above, in which one needs to reason about several
commonsense facts together. The latter is the sort of
deeper commonsense knowledge that we believe is
characteristic of human intelligence and that we
would like the Winograd Schema Challenge to test.
Third, they should be search-engine proof to the
extent possible. Winograd schemas should be constructed so that it is unlikely that one could use statistical properties of corpora to solve these problems.
Who was taken to the police station? The customer /
the teller
Executing and Evaluating the
Winograd Schema Challenge
The correct answer now is the customer. To get the
right answer, we use our knowledge of what frequently happens in crime scenarios — that the
alleged perpetrator is arrested and taken to the police
station for questioning and booking — together with
our knowledge that stabbing someone is generally
considered a crime. Since the text tells us that the
customer did the stabbing, we conclude that it must
be the customer, rather than the teller, who is taken
to the police station.
The existence of the special word is one way to
When the Winograd Schema Challenge was originally conceived and developed, details of the execution
of the challenge were left unspecified. In May 2013,
the participants at Commonsense-2013, the Eleventh
Symposium on Logical Formalizations of Commonsense Reasoning, agreed that focusing on the Winograd Schema Challenge was a high priority for
researchers in commonsense reasoning. In July 2014,
Nuance Communications announced its sponsorship
of the Winograd Schema Challenge Competition
(WSCC), with cash prizes awarded for top computer
SPRING 2016 51
systems surpassing some threshold of
performance on disambiguating pronouns in Winograd schemas. At the
time this article was written, the first
competition was scheduled to be held
at IJCAI-2016 in July, 2016 in New
York, New York, assuming there are
systems that are entered into competition. Because doing well at the WSC is
difficult, it is possible no systems will
be entered at that time; in this case, the
first competition will be delayed until
we have received notification of interested entrants. Subsequent competitions will be held annually, biennially,
or at some other set interval of time to
be determined.
During the last year, we have developed a set of rules for the competition
that are intended to facilitate test corpus development and participation of
serious entrants. While some parts will
naturally change from one competition to the next — date and time, obviously, as well as hardware limitations
— we expect the overall structure of
the competition to remain the same.
Exact details are given at the Winograd
Schema Challenge Competition website;2 the general structure and requirements are discussed next.
The competition will consist of a
maximum of two rounds: a qualifying
round and a final round. There will be
at least 60 questions in each round.
Each set of questions will have been
tested on at least three human adult
annotators. At least 90 percent of the
questions in the test set will have been
answered correctly by all human annotators. The remaining questions in the
test set (no more than 10 percent of
the test set) will have been answered
correctly by at least half of the human
annotators. This will ensure that the
questions in the test set are those for
which pronoun disambiguation is
It is possible that no system will
progress beyond the first level, in
which case the second round will not
be held. The threshold required to
move from the first to the second level, or to achieve a prize, must be at
least 90 percent or no more than three
percentage points below the interannotator agreement achieved on the test
set, whichever is greater. (For example,
if interannotator agreement on a test is
95 percent, the required system score is
92 percent.)
Prounoun Disambiguation
Problems in the Winograd
Schema Challenge
The first round will consist of pronoun
disambiguation problems (PDPs) that
are taken directly or modified from
examples found in literature, biographies, autobiographies, essays, news
analyses, and news stories; or have
been constructed by the organizers of
the competition. The second round
will consist of halves of Winograd
schemas; almost all of these will have
been constructed by the competition
Some examples of the sort of pronoun disambiguation problems that
could appear in the first round follow:
Example PDP 1
Mrs. March gave the mother tea and
gruel, while she dressed the little baby
as tenderly as if it had been her own.
She dressed: Mrs. March / the mother
As if it had been: tea / gruel / baby
Example PDP 2
Tom handed over the blueprints he
had grabbed and, while his companion spread them out on his knee,
walked toward the yard.
His knee: Tom/ companion
Example PDP 3
One chilly May evening the English
tutor invited Marjorie and myself into
her room.
Her room: the English tutor / Marjorie
Example PDP 4
Mariano fell with a crash and lay
stunned on the ground. Castello
instantly kneeled by his side and
raised his head.
His head: Mariano / Castello
The following can be noted from these
examples: (1) A PDP can be taken
directly from text (example PDP 3 is
taken from Vera Brittain’s autobiography Testament of Youth) or may be
modified (examples PDP 1, 2, and 4 are
modified slightly from the novels Little
Women, Tom Swift and His Airship, and
The Pirate City: An Algerine Tale). (2) A
pronoun disambiguation problem may
consist of more than one sentence, as
in example PDP 4. In practice, we will
rarely use PDPs that contain more than
three sentences. (3) There may be multiple pronouns and therefore multiple
ambiguities in a sentence, as in example PDP 1. In practice, we will have
only a limited number of cases of multiple PDPs based on a single sentence
or set of sentences, since misinterpreting a single text could significantly
lower one’s score if it is the basis for
multiple PDPs.
As in Winograd schemas, a substantial amount of commonsense knowledge appears to be needed to disambiguate pronouns. For example, one
way to reason that she in she dressed
(example PDP 1) refers to Mrs. March
and not the mother, is to realize that
the phrase “as if it were her own”
implies that it (the baby) is not actually her own; that is, she is not the mother and must, by process of elimination,
be Mrs. March. Similarly one way to
understand that the English tutor is
the correct referent of her in example
PDP 3 is through one’s knowledge of
the way invitations work: X typically
invites Y into X’s domain, and not into
Z’s domain. Especially, X does not
invite Y into Y’s domain. Similar
knowledge of etiquette comes into
play in example PDP 2: one way to
understand that the referent of his is
Tom is through the knowledge that X
typically spreads documents out over
X’s own person, and not Y’s person.
(Other knowledge that comes into play
is the fact that a person doesn’t have a
lap while he is walking, and the structure of the sentence entails that Tom is
the individual who walks to the yard.)
Why Have PDPs in the WSC Competition?
From the point of view of the computer system taking the test, there is no
difference between Winograd schemas
and pronoun disambiguation problems.3 In either case, the system must
choose between two (or more) possible
referents for a pronoun.
Nevertheless, the move from a competition that is run solely on Winograd
schemas to a competition that in its
first round runs solely on pronoun disambiguation problems requires some
The primary reason for having PDPs
is entirely pragmatic. As originally conceived, the Winograd Schema Challenge was meant to be a one-time chal-
lenge. An example corpus of more
than 100 Winograd schemas was
developed and published on the web.1
Davis developed an additional 100
Winograd schemas to be used in the
course of that one-time challenge.
Since Nuance’s decision to sponsor the
Winograd Schema Challenge Competition, however, the competition is likely to be run at regular intervals, perhaps yearly. Creating Winograd
schemas is difficult, requiring creativity and inspiration, and too burdensome to do on a yearly or biennial
By running the first round on PDPs,
the likelihood of advancing to the second round without being able to
answer correctly many of the Winograd schemas in the competition is
minimized. Indeed, if a system can
advance to the second round, we
believe there is a good chance that it
will successfully meet the Winograd
Schema Challenge.
Once we had decided on using PDPs
in the initial round, other advantages
became apparent:
First, pronoun disambiguation problems occur very frequently in natural
language text in the wild. One finds
examples in many genres, including
fiction, science fiction, biographies,
and essays. In contrast Winograd
schemas are fabricated natural language text and might be considered
irrelevant to automated natural language processing in the real world. It is
desirable to show that systems are proficient at handling the general pronoun disambiguation problem, which
is a superset of the Winograd Schema
Challenge. This points toward a realworld task that a system excelling in
this competition should be able to do.
Second, a set of PDPs taken from the
wild, and from many genres of writing,
may touch on different aspects of commonsense knowledge than that which
a single person or small group of people could come up with when creating
Winograd schemas.
At the same time it is important to
keep in mind one of the original purposes of Winograd schemas — that the
correct answer be dependent on commonsense knowledge rather than sentence structure and word order — and
to choose carefully a set of PDPs that
retain this property. In addition, strong
preference will be given to PDPs that
do not rely on selectional restriction or
on syntactical characteristics of corpora, and which are of roughly the same
complexity as Winograd schemas.
The aim of this competition is to
advance science; all results obtained
must be reproducible, and communicable to the public. As such, any winning entry is encouraged to furnish to
the organizers of the Winograd
Schema Challenge Competition its
source code and executable code, and
to use open source databases or knowledge bases or make its databases and
knowledge structures available for
independent verification of results. If
an organization cannot do this, other
methods for assuring reproducibility of
results will be considered, such as furnishing a detailed trace of execution.
Details of such methods will be published on the Winograd Schema Challenge Competition website. Entries
that do not satisfy these requirements,
even if excelling at the competition,
will be disqualified.
An individual representing an organization’s entry must be present at the
competition, and must bring a laptop
on which the entry will run. The specifications of the laptop to be used are
given at the Winograd Schema Challenge Competition website. It is
assumed that the laptop will have a
hard drive no larger than one terabyte,
but researchers may negotiate this
point and other details of laptop specifications with organizers. Reasonable
requests will be considered.
Some entries will need to use the
Internet during the running of the test.
This will be allowed but restricted. The
room in which the competition will
take place will have neither wireless nor
cellular access to the Internet. Internet
access will be provided through a highspeed wired cable modem or fiber optic
service. Access to a highly restricted set
of sites will be provided. Access to the
Google search engine will be allowed.
All access to the Internet will be monitored and recorded.
If any entry that is eligible for a prize
has accessed the Internet during the
competition, it will be necessary to ver-
ify that the system can achieve similar
results at another undisclosed time.
The laptop on which the potentially
prize-winning system has run must be
given to the WSCC organizers. They
will then run the system on the test at
some undisclosed time during a twoweek period following the competition. Following the system run, organizers will compare the results obtained
with the results achieved during the
competition, and check that they are
reasonably close. Assuming that the
code contains statistical algorithms,
the answers may not be identical
because what is retrieved through
Internet query will not be exactly the
same; however, the differences should
be relatively small.
In the three weeks following the
competition, researchers with winning
or potentially winning entries will be
expected to submit to WSCC organizers a paper explaining the algorithms,
knowledge sources, and knowledge
structures used. These papers will be
posted on the website. Publication on the website
does not preclude any other publication. Entries not submitting such a
paper will be disqualified.
Provisional results will be announced the day after the competition. Three weeks after the competition, final results will be announced.
AI Community’s
Potential Gain
Publishing papers on approaches to
solving the Winograd Schema Challenge is required for those eligible for a
prize and highly encouraged for everyone else. All papers submitted will be
posted on the Winograd Schema Challenge Competition website; it is hoped
that in addition they will be submitted
and published in other venues. A central aim of the Winograd Schema Challenge is that it ought to serve as motivation for research in commonsense
reasoning, and we are eager to see the
many directions that this research will
WSSC organizers will try to use the
data obtained from running the competition to assess progress in automating commonsense reasoning by calcu-
SPRING 2016 53
lating the proportion of correct results
in various subfields of commonsense
reasoning. The existing example corpus and test corpus of Winograd
schemas have been developed with the
goal of automating commonsense reasoning, and span many areas of common sense, including physical, spatial,
and social reasoning, as well as commonsense knowledge about many
common domains such as transportation, criminal acts, medical treatment,
and household furnishings. PDPs will
be chosen with this goal and with
these areas of commonsense in mind
as well.
Current plans are to annotate example PDPs and WSs with some of the
commonsense areas that might prove
useful in disambiguating the text. The
WSCC organizers will choose an annotation scheme that is (partly) based on
an existing taxonomy, such as that given by OpenCyc4 or DBPedia.5 Note
that a PDP or WS might be annotated
with several different commonsense
domains. An entire test corpus, annotated in this way, may prove useful in
assessing a system’s proficiency in specific domains of commonsense reasoning. For example, a system might correctly answer 65 percent of all PDPs
and WSs that involve spatial reasoning; but correctly answer only 15 percent of all PDPs and WSs involving
social reasoning. Assuming the sentences are of roughly the same complexity, this could indicate that the system is more proficient at spatial
reasoning than at social reasoning.
The systems that excel in answering
PDPs and WSs correctly should be
capable of markedly improved natural
language processing compared to current systems. For example, in translating from English to French, Google
Translates often translates pronouns
incorrectly, using incorrect gender,
presumably because it cannot properly
determine pronoun references; the
technology underlying a system that
wins the WSCC could improve Google
Translate’s performance in this regard.
More broadly, a system that contains
the commonsense knowledge that
facilitates correctly answering the
many PDPs and WSs in competition
should be capable of supporting a wide
range of commonsense reasoning that
would prove useful in many AI applications, including planning, diagnostics, story understanding, and narrative generation.
The sooner a system wins the Winograd Schema Challenge Competition,
the sooner we will be able to leverage
the commonsense reasoning that such
a system would support. Even before
the competition is won, however, we
look forward to AI research benefiting
from the commonsense knowledge
and reasoning abilities that researchers
build into the systems that will participate in the challenge.
This article grew out of an invited talk
by the first author at the Beyond Turing Workshop organized by Gary Marcus, Francesca Rossi, and Manuela
Veloso at AAAI-2016; the ideas were
further developed through conversations and email with the second and
third authors after the conclusion of
the workshop, and during a very productive panel session on the WSC at
Commonsense-2015, held as part of
the AAAI Spring Symposium Series.
Thanks especially to Andrew Gordon,
Jerry Hobbs, Ron Keesing, Pat Langley,
Gary Marcus, and Bob Sloane for helpful discussions.
1. See E. Davis’s web page, A Collection of
Winograd Schemas, 2012: www.cs.nyu.
3. Except that possibly there may be more
than two choices in a PDP, which is disallowed in WSs by construction. So if a system notices three or more possibilities for
an answer, it could know that it is dealing
with a PDP. But it is a distinction without a
difference; this knowledge does not seem to
lead to any new approach for solution.
Christian, B. 2011. Mind Versus Machine.
The Atlantic, March.
Levesque, H.; Davis, E.; and Morgenstern,
L. 2012 The Winograd Schema Challenge.
In Principles of Knowledge Representation and
Reasoning: Proceedings of the Thirteenth International Conference (KR2012), 552–561. Palo
Alto, CA: AAAI Press.
Turing, A. 1950. Computing Machinery and
Intelligence. Mind 59(236): 433–460.
University of Reading. 2015. Turing Test
Success Marks Milestone in Computing History. Press Release, June 8, 2014. Communications Office, University of Reading, Reading, UK (
Weizenbaum, J. 1966. ELIZA — A Computer Program for the Study of Natural Language Communication Between Man and
Machine. Communications of the ACM 9(1):
Leora Morgenstern is a technical fellow
and senior scientist at Leidos Corporation.
Her research focuses on developing innovative techniques in knowledge representation and reasoning, targeted toward deep
understanding of large corpora in a wide
variety of domains, including legal texts,
biomedical research, and social media. She
heads the Executive Committee of, which has run the
biennial Commonsense Symposium series
since 1991. She received a BA in mathematics from the City College of New York and a
Ph.D. in computer science from Courant
Institute of Mathematical Sciences, New
York Universtiy.
Ernest Davis is a professor of computer science at New York University. His research
area is automated commonsense reasoning,
particularly commonsense spatial and physical reasoning. He is the author of Representing and Acquiring Geographic Knowledge
(1986), Representations of Commonsense
Knowledge (1990), and Linear Algebra and
Probability for Computer Science Applications
(2012); and coeditor of Mathematics, Substance and Surmise: Views on the Meaning and
Ontology of Mathematics (2015).
Charles Ortiz is the director of the Nuance
Natural Language and AI Laboratory. His
research is in collaborative multiagent systems, knowledge representation and reasoning (causation, counterfactuals, and
commonsense reasoning), and robotics
(cognitive and team-based robotics). His
previous positions include director of
research in collaborative multiagent systems at the AI Center at SRI International,
adjunct professor at the University of California, Berkeley, and postdoctoral research
fellow at Harvard University. He received an
S.B. in physics from the Massachusetts Institute of Technology and a Ph.D. in computer and information science from the University of Pennsylvania.
Why We Need a Physically
Embodied Turing Test and
What It Might Look Like
Charles L. Ortiz, Jr.
I The Turing test, as originally conceived, focused on language and reasoning; problems of perception and action
were conspicuously absent. To serve as a
benchmark for motivating and monitoring progress in AI research, this article
proposes an extension to that original
proposal that incorporates all four of
these aspects of intelligence. Some initial suggestions are made regarding how
best to structure such a test and how to
measure progress. The proposed test also
provides an opportunity to bring these
four important areas of AI research back
into sync after each has regrettably
diverged into a fairly independent area
of research of its own.
or Alan Turing, the problem of creating an intelligent
machine was to be reduced to the problem of creating a
thinking machine (Turing 1950). He observed, however,
that such a goal was somewhat ill-defined: how was one to
conclude whether or not a machine was thinking (like a
human)? So Turing replaced the question with an operational
notion of what it meant to think through his now famous
Turing test. The details are well known to all of us in AI. One
feature of the test worth emphasizing, however, is its direct
focus on language and its use: in its most well known form,
the human interrogator can communicate but not see the
computer and the human subject participating in the test.
Hence, in a sense, it has always been tacitly assumed that
physical embodiment plays no role in the Turing test. Hence,
if the Turing test is to represent the de facto test for intelligence, having a body is not a prerequisite for demonstrating
intelligent behavior.1
The general acceptance of the Turing test as a sensible
measure of achievement in the quest to make computers
intelligent has naturally led to an emphasis on equating
intelligence with cogitation and communication. But, of
course, in AI this has only been part of the story: disembodied thought alone will not get one very far in the world. The
enterprise to achieve AI has always equally concerned itself
with the problems of perception and action. In the physical
world, this means that an agent needs to be able to perform
physical actions and understand the physical actions of others.
Also of concern for the field of AI is the problem of how to
Copyright © 2016, Association for the Advancement of Artificial Intelligence. All rights reserved. ISSN 0738-4602
SPRING 2016 55
quantify progress and how to support incremental
development; it is, by now, pretty much agreed upon
that the Turing test represents a rather weak tool for
measuring the level of demonstrable intelligence or
thinking associated with a particular subject, be it
human or artificial. The passing of the test by a
machine would certainly justify one in announcing
the arrival of human-level AI, but along the way, it
can only provide a rather crude measure. To address
this deficiency, variants of the Turing test have been
proposed and are being pursued; one notable example is the Winograd Schema Challenge2 that supports
incremental testing and development (Levesque,
Davis, and Morgenstern 2012). The Winograd
Schema Challenge does not, however, address the
physical embodiment concerns that are the subject
of this article. Nevertheless, any proposed alternative
must bring with it a reasonable set of quantifiable
measures of performance.
So, what is it about the Turing test that makes it
unsuitable for gauging progress in intelligent perception and action? 3 From the perspective of action, the
Turing test can only be used to judge descriptions of
actions that one could argue were sufficiently
detailed to be, in principle, executable. Consider
some simple everyday ascriptions of action: “Little
Johnny tied his shoelace,” or “LeBron James just hit
a layup.” If perception is taken completely out of the
picture, a purely linguistic description of these types
of actions is rather problematic (read: a royal pain):
one would have to write down a set of rules or axioms
that correctly captured the appropriate class of movement actions and how they were stitched together to
produce a particular spatiotemporally bounded highlevel movement, in this case, bona fide instances of
shoelace tying or basketball layups. A more sensible
alternative might involve learning from many examples, along the lines demonstrated by Li and Li
(2010). And for that, you need to be able to perceive.
It’s hard for me to describe a shoe-tying to you if you
have never seen one or could never see one.4
However, consider now the problem of judging the
feasibility of certain actions without perception, such
as reported by the statement, “the key will not fit in
the lock.” Through a process of spatial reasoning, an
agent can determine whether certain objects (such as
a ball) might fit into certain other objects (such as a
suitcase). However, this sort of commonsense reasoning could only help with our example during initial considerations: perhaps to conclude whether a
particular key was a candidate for fitting into a particular lock given that it was of a particular type. After
all, old antique keys, car keys, and house keys all look
different. However, it would still be quite impossible
to answer the question, “Will the key fit?” without
being able to physically perceive the key and the keyhole, physically manipulating the key, trying to get it
into the hole, and turning the key.5 It’s no surprise,
then, that the challenges that these sorts of actions
raise have received considerable attention in the
robotics literature: Matt Mason at CMU categorizes
them as paradigmatic examples of “funneling
actions” in which other artifacts in the environment
are used to guide an action during execution (Mason
2001). Note that from a purely linguistic standpoint,
the details of such action types have never figured
into the lexical semantics of the corresponding verb.
From a commonsense reasoning perspective in AI,
their formalization has not been attempted for the
reasons already given.6
These observations raise the question of whether
verbal behavior and reasoning are the major indicators of intelligence, as Descartes and others believed.
The lessons learned from AI over the last 50 years
should suggest that they do not. Equally challenging
and important are problems of perception and
action. Perhaps these two problems have historically
not received as much attention due to a rather firmly held belief that what separates human from beast
is reasoning and language: all animals can see and
act, after all: one surely should not ascribe intelligence to a person simply because he or she can, for
example, open a door successfully. However, any
agent that can perform only one action — opening a
door — is certainly not a very interesting creature, as
neither is one that can utter only one particular sentence. It is, rather, the ability to choose and compose
actions for a very broad variety of situations that distinguishes humans. In fact, humans process a rather
impressive repertoire of motor skills that distinguish
them from lower primates: highly dexterous,
enabling actions as diverse as driving, playing the
piano, dancing, playing football, and others. And certainly, from the very inception of AI, problems of
planning and acting appeared center stage
(McCarthy and Hayes 1969).
Functional Individuation of Objects
The preceding illustrations served to emphasize the
difficulty in reasoning and talking about many
actions without the ability to perceive them. However, our faculty of visual perception by itself, without
the benefit of being able to interact with an object or
reason about its behavior, runs up against its own difficulties when it attempts to recognize correctly
many classes of objects.
For example, recognizing something as simple as a
hinge requires not only that one can perceive it as
something that resembles those hinges seen in the
past, but also that one can interact with it to conclude that it demonstrates the necessary physical
behavior: that is, that it consist of two planes that
can rotate around a common axis. Finally, one must
also be able to reason about the object in situ. The
latter requires that one can reason commonsensically to determine whether it is sufficiently rigid, can be
attached to two other objects (such as a door and a
Figure 1. Collaboratively Setting Up a Tent.
A major challenge is to coordinate and describe actions, such as “Hold the pole like this while I attach the rope.”
wall), and is also constructed so that it can bear the
weight of one or both of those objects. So this very
simple example involving the functional individuation of an object requires, by necessity, the integration of perception, action, and commonsense reasoning. The challenge tasks described in the next
section nicely highlight the need for such integrated
The Challenge
This leads finally to the question of what would constitute a reasonable physically embodied Turing test
that would satisfy the desiderata so far outlined:
physical embodiment coupled with reasoning and
communication, support for incremental development, and the existence of clear quantitative measures of progress.
In my original description of this particular challenge, I attempted to parallel the original Turing test
as much as possible. I imagined a human tester communicating with a partially unseen robot and an
unseen human; the human would have access to a
physically equivalent but teleoperated pair of robot
manipulators. The tester would not be able to see the
body of either, only the mechanical arms and video
sensors. Significant differences in the appearance of
motion between the two could be reduced through
stabilizing software to smooth any jerky movements.
The interrogator would interact with the human
and robot subject through language, as in the Turing
test, and would be able to ask questions or make commands that would lead to the appropriate physical
actions. The tester would also be able to demonstrate
However, some of the participants of the workshop
at which this idea was first presented7 observed that
particular expertise involving tele-operation might
render comparisons difficult. The participants of the
workshop agreed that the focus should instead be on
defining a set of progressively more challenging
problem types. The remainder of this document follows that suggestion.
This challenge will consist of two tracks: The construction track and the exploration track.
The construction track’s focus will be on building
predefined structures (such as a tent or modular furniture) given a combination of verbal instructions
and diagrams or pictures. A collaborative subtrack
will extend this to multiple individuals, a human
agent and a robotic agent.
The exploration track will be more improvisational in flavor and focus on experiments in building,
modifying, and interacting with complex structures
in terms of more abstract mental models, possibly
acquired through experimentation itself. These struc-
SPRING 2016 57
Figure 2. The IkeaBots Developed at the Massachusetts Institute of Technology
Can Collaborate on the Construction of Modular Furniture.
tures can be static (for example, as in figure 3) or
dynamic (as in figure 6).
Communication through natural language will be
an integral part of each track. One of the principal
goals of this challenge is to demonstrate grounding
of language both during execution of a task and after
completion. For example, for both the exploration
and the construction tracks, the agents must be able
to accept initial instructions, describe and explain
what they are doing, accept critique or guidance, and
consider hypothetical changes. 8
The Construction Track
The allowable variability of target structures in the
construction track is expected to be less than in the
exploration track. The construction task will involve
building predefined structures that would be specified through a combination of natural language and
pictures. Examples might include an object such as a
tent (figure 1) or putting together Ikea-like furniture
(figure 2). Often, ancillary information in the form of
diagrams or snapshots plays an important role in
instructions (see, for example, figure 4). During the
task challenge definition phase, the degree to which
this complex problem can be limited (or perhaps
included as part of another challenge) will be investigated. Crowdsourced sites that contain such
instructions might be useful to consult in this
The collaboration task requires that the artificial
and human agents exchange information before and
during execution to guide the construction task. A
teammate might ask for help through statements
such as, “Hold the tent pole like this while I tighten
the rope”; the system must reason commonsensically about the consequences of the planned action
involving the rope-tightening to the requested task
as well as how an utterance such as “Hold . . . like this
. . .” should be linguistically interpreted and coordinated with the simultaneous visual interpretation.
Rigidity of materials, methods of attachment, and
the structural function of elements (that is, that tent
poles are meant to hold up the fabric of a tent) will
be varied as well as the successful intended functionality of the finished product (for example, a tent
should keep water out and also not fall apart when
someone enters it). Eventually, time to completion
could also be a metric; however, for now, these proposed tasks are of sufficient difficulty that the major
concern should simply be success.
The description given here of the construction task
places emphasis on robotic manipulation; however,
there are nonmanipulation robotic tasks that could
be incorporated into the challenge that also involve
an integration of perception, reasoning, and action.
Abilities demonstrated
Construction by one agent
Basic physical, perceptual, and motor skills
Monitoring the activity (perceive progress,
identify obstacles), contribute help
Reference (“hold like <this>”), offer help,
explain, question answering (“why did you let
go?”), narrate activity as necessary
Table 1. Some Possible Levels of Progression for the Construction Track Challenge Tasks.
Certain capabilities might best be first tested somewhat independently; for example, perception faculties might be tested
for by having the agent watch a human perform the task and being able to narrate what it observes.
Examples include finding a set of keys, counting the
number of chairs in a room, and delivering a message
to some person carrying a suitcase.10 An organization
committee that will be selected for this challenge will
investigate the proper mix of such tasks into the final
challenge roadmap.
There are many robotic challenges involving
manipulation and perception related to this challenge. However, a number of recent existence proofs
provide some confidence that such a challenge can
be initiated now. The final decisions on subchallenge
definition will be made by the organizing committee.
As the complexity of these tasks increases, one can
imagine their real-world value in robot-assistance
tasks as demanding as, say, repairing roads, housing
construction, or setting up camp on Mars.
The IkeaBot system (figure 2) developed at MIT is
one such existence proof: it demonstrates the collaboration of teams of robots in assembling Ikea furniture during which robots are able to ask automatically for help when needed (Knepper et al. 2013). Other
work involving communication and human-robot
collaboration coupled with sophisticated laboratory
manipulation capabilities has been demonstrated at
Carnegie Mellon University, and represents another
good starting point (Strabala et al. 2012).
Research in computer vision has made impressive
progress lately (Li and Li 2010, Le et al. 2012),
enabling the learning and recognition of complex
movements and feature-rich objects. It is hoped that
this challenge would motivate extensions that would
factor in functional considerations into any objectrecognition process.
Finally, the organization committee hopes to be
able to leverage robotic resources under other activities such as the RoboCup Home Challenge,11 as
much as possible.
The Exploration Track
If you’ve ever watched a child play with toys such as
Lego blocks, you know that the child does not start
with a predefined structure in mind. There is a strong
element of improvisation and experimentation during a child’s interactions, exploring possible structures, adapting mental models (such as that of a
house or car), experimenting with sequences of
attachment, modifying structures, and so on. Toys
help a child groom the mind-body connection, serving as a sort of laboratory for exploring commonsense notions of space, objects, and physics.
For the exploration track, I therefore propose
focusing on the physical manipulation of children’s
toys, such as Lego blocks (figure 3). The main difference between the two tracks is that the exploration
track supports experimentation involving the modification of component structures, adjusting designs
according to resources available (number of blocks,
for example), and exploring states of stability during
execution. These are all possible because of the simple modular components that agents would work
with. The exploration track would also allow for testing the ability of intelligent agents to build a dynamic system and describe its operation in commonsense
Incremental progression of difficulty would be
possible by choosing tasks to roughly reflect levels of
child development.
Table 2 summarizes possible levels of progression.
The idea is to create scenarios with a pool of physical
resources that could support manipulation, commonsense reasoning, and abstraction of structures
and objects, vision, and language (for description,
explanation, hypothetical reasoning, and narrative).
Figure 3 illustrates a static complex structure while
the object in figure 6 involves the interaction of
many parts. In the latter case, success in the construction of the object also involves observing and
demonstrating that the end functionality is the
intended one. In the figure, there is a small crank at
the bottom left that results in the turning of a long
screw, which lifts metal balls up a column into
another part of the assembly in which the balls fall
down ramps turning various wheels and gates along
SPRING 2016 59
Development stage
1. Simple manipulation
Create a row of blocks; then a wall
Connect two walls; then build a “house”; size
depends on number of blocks available
Add integrated structures such as a parking
garage to house
2. Construction and abstraction
3. Modification
4. Narrative generation
“This piece is like a hinge that needs to be
placed before the wall around it; otherwise it
won’t fit later” (said while installing the door
of a house structure)
5. Explanation
“The tower fell because the base was too
6. Hypothetical reasoning
“What will happen if you remove this?”
Table 2. A Sequence of Progressively More Sophisticated Skills
to Guide the Definition of Subtask Challenges Within the Exploration Track.
the way. A description along the lines of the last sentence is an example of the sort of explanation that a
robot should be able to provide, in which the abstract
objects are functionally individuated, in the manner
described earlier.
Figure 5 shows another assembly that demonstrates the creation of new objects (such as balls from
clay, a continuous substance), operating a machine
that creates small balls, fitting clay into syringes, and
making lollipop shapes with swirls made from multiple color clays.12 Tasks involving explaining the operation of such a device, demonstrating its operation,
having a particular behavior replicated, and answering questions about processes involved are all beyond
the abilities of current AI systems.
Manipulation of Lego blocks and other small toy
structures would require robotic manipulators capable of rather fine movements. Such technology exists
in robotic surgical systems as well as in less costly
components under development by a number of
Relation to Research in
Commonsense Reasoning
Figure 3. An Abstract Structure of a House Built Using Lego Blocks.
The more ambitious exploration track emphasizes
the development of systems that can experiment on
their own, intervening into the physical operation of
a system and modifying the elements and connections of the system to observe the consequences and,
in the process, augment their own commonsense
knowledge. Rather than having a teacher produce
many examples, such self-motivated exploring
agents would be able to create alternative scenarios
and learn from them on their own. Currently this is
all done by hand; for example, if one wants to encode
the small bit of knowledge that captures the fact that
Figure 4. Instructions Often Require Pictures or Diagrams.
The step-by-step instructions are for a Lego-like toy. Notice that certain pieces such as the window or wheels are unrecognizable as such
unless they are placed in the correct context of the overall structure.
not tightening the cap on a soda bottle will cause it
to loose its carbonation, one would write down a
suitable set of axioms. The problem, of course, is that
there is so much of this sort of knowledge.
Research in cognitive science suggests the possibility of the existence of bodies of core commonsense
knowledge (Tenenbaum 2015). The exploration track
provides a setting for exploring these possibilities.
Perhaps within such a laboratory paradigm, the role
of traditional commonsense reasoning research
would shift to developing general principles, such as
models of causation or collaboration. AI systems
would then instantiate such principles during selfdirected experimentation.
The proposed tests will provide an opportunity to
bring four important areas of AI research (language,
reasoning, perception, and action) back into sync
after each has regrettably diverged into a fairly independent area of research.
Figure 5. Manipulation and Object
Formation with Nonrigid Materials.
This article was not about the blocks world and it has
not argued for the elimination of reasoning from
intelligent systems in favor of a stronger perceptual
component. This article argued that the Turing test
was too weak an instrument for testing all aspects of
intelligence and, inspired by the Turing test, proposed an alternative that was argued to be more suitable for motivating and monitoring progress in settings that demand an integrated deployment of
perceptual, action, commonsense reasoning, and language faculties. The challenge described in this document differs from other robotic challenges in terms
of its integrative aspects. Also unique here is the per-
Figure 6. The Exploration Track Will Also Involve Dynamic Toys with
Moving Parts and Some Interesting Aggregate Physical Behavior.
The modularity afforded by toys makes this much easier than working with
large expensive systems. This picture is a good illustration of the need for
functional understanding of elements of a structure. In the picture, the child
can turn a crank at the bottom left — a piece that has functional significance — that turns a large red vertical screw that then lifts metal balls up a
shaft after which they fall through a series of ramps turning various gears
along the way.
SPRING 2016 61
spective on agent embodiment as leading to an
agent-initiated form of experimentation (the world
as a physical laboratory) that can trigger commonsense learning.
The considerable span of time that has elapsed
since Turing proposed his famous test should be sufficient for the field of AI to devise more comprehensive tests that stress the abilities of physically embodied intelligent systems to think as well as do.
1. One should resist the temptation here of equating intelligence with being smart in the human sense, as in having
a high IQ. That has rarely been the case in AI where we have
usually been quite happy to try to replicate everyday
human behavior. In the remainder of this article, I will use
the term intelligence in this more restrictive, technical sense.
2. Winograd Challenge, 2015,
3. I certainly would not deny that a program that passed the
Turing test was intelligent. What I am suggesting is that it
would not be intelligent in a broad enough set of areas for
the many problems of interest to the field of AI. The Turing
test was never meant as a necessary test of intelligence, only
a sufficient one. The arguments that I am presenting, then,
suggest that the Turing test also does not represent a sufficient condition for intelligence, only evidence for intelligence (Shieber 2004).
4. I take this point to be fairly uncontroversial in AI: a manual with a picture describing some action (such as setting
up a tent) is often fairly useless without the pictures.
5. A similar observation was made in the context of the spatial manipulation of buttons (Davis 2011).
6. Put most simply, the best that the Turing test could test
for is whether a subject would answer correctly to something like, “Suppose I had a key that looked like . . . and a
lock that looked like . . . Would it fit?” How on earth is one
to find something substantive to substitute (that is, to say)
for the ellipses here that would have any relevant consequence for the desired conclusion in the actual physical
7. Beyond the Turing Test: AAAI-15 Workshop WS06. January 25, 2015, Austin, Texas.
8. One might be concerned that the inclusion of language
is overly ambitious. However, without it one would be left
with a set of challenge problems that could just as easily be
sponsored by the robotics or computer vision communities
alone. The inclusion of language makes this proposed challenge more appropriately part of the concerns of general AI.
9. See, for example,
10. I am grateful to an anonymous reviewer for bringing up
this point.
12. See
Davis, E. 2011. Qualitative Spatial Reasoning in Interpreting Narrative. Keynote talk presented at the 2011 Conference on Spatial Information Theory, September 14, Belfast,
Knepper, R. A.; Layton, T.; Romanishin, J.; and Rus, D. 2013.
IkeaBot: An Autonomous multiRobot Coordinated Furni-
ture Assembly System. In Proceedings of the 2013 IEEE International Conference on Robotics and Automation (ICRA). Piscataway, NJ: Institute for Electrical and Electronics Engineers.
Le, Q. V.; Ranzato, M.; Monga, R.; Devin, M.; Chen, K.; Corrado, G. S.; Dean J.; and Ng, A. Y. 2012. Building High-Level Features Using Large Scale Unsupervised Learning. In Proceedings of the 29th International Conference on Machine
Learning. Madison, WI: Omnipress.
Levesque, H.; Davis, E.; and Morgenstern, L. 2012. The
Winograd Schema Challenge. In Principles of Knowledge Representation and Reasoning: Proceedings of the Thirteenth International Conference (KR2012), 552–561. Palo Alto: AAAI
Li, F. F., and Li, L.-J. 2010. What, Where, and Who? 2010.
Telling the Story of an Image by Activity Classification,
Scene Recognition, and Object Categorization. In Computer
Vision: Detection, Recognition, and Reconstruction, Studies in
Computational Intelligence Volume 285. Berlin: Springer.
Mason, M. T. 2001. Mechanics of Robotic Manipulation. Cambridge, MA: The MIT Press..
McCarthy, J., and Hayes, P. J. Some Philosophical Problems
from the Standpoint of Artificial Intelligence. Machine Intelligence 4, 463–502. Edinburgh, UK: Edinburgh University
Shieber, S., ed. 2004. The Turing Test: Verbal Behavior as the
Hallmark of Intelligence. Cambridge, MA: The MIT Press.
Strabala, K.; Lee, M. K.; Dragan, A.; Forlizzi, J.; and Srinivasa,
S. 2012. Learning the Communication of Intent Prior to
Physical Collaboration. In Proceedings of the 21st IEEE International Symposium on Robot and Human Interactive Communication. Piscataway, NJ: Institute of Electrical and Electronics Engineers.
Tennenbaum, Josh. Cognitive Foundations for CommonsSense Knowledge Representation. Invited talk presented at
the AAAI 2015 Spring Symposium on Knowledge Representation and Reasoning: Integrating Symbolic and Neural
Approaches. Alexandria, VA, 23–25 March.
Turing, A. M. 1950. Computing Machinery and Intelligence.
Mind 59(236): 433–460.
Charles Ortiz is director of the Laboratory for Artificial
Intelligence and Natural Language at the Nuance Communications. Prior to joining Nuance, he was the director of
research in collaborative multiagent systems at the AI Center at SRI International. His research interests and contributions are in multiagent systems (collaborative dialog-structured assistants and logic-based BDI theories), knowledge
representation and reasoning (causation, counterfactuals,
and commonsense reasoning), and robotics (cognitive and
team robotics). He is also involved in the organization of
the Winograd Schema Challenge with Leora Morgenstern
and others. He holds an S.B. in physics from the Massachusetts Institute of Technolgoy and a Ph.D. in computer and
information science from the University of Pennsylvania.
He was a postdoctoral research fellow at Harvard University
and has taught courses at Harvard and the University of California, Berkeley (as an adjunct professor) and has also presented tutorials at many technical conferences such as
Measuring Machine Intelligence
Through Visual Question Answering
C. Lawrence Zitnick, Aishwarya Agrawal, Stanislaw Antol,
Margaret Mitchell, Dhruv Batra, Devi Parikh
I As machines have become more
intelligent, there has been a renewed
interest in methods for measuring their
intelligence. A common approach is to
propose tasks for which a human excels,
but one that machines find difficult.
However, an ideal task should also be
easy to evaluate and not be easily gameable. We begin with a case study exploring the recently popular task of image
captioning and its limitations as a task
for measuring machine intelligence. An
alternative and more promising task is
visual question answering, which tests
a machine’s ability to reason about language and vision. We describe a data
set, unprecedented in size and created
for the task, that contains more than
760,000 human-generated questions
about images. Using around 10 million
human-generated answers, researchers
can easily evaluate the machines.
umans have an amazing ability to both understand
and reason about our world through a variety of senses or modalities. A sentence such as “Mary quickly
ran away from the growling bear” conjures both vivid visual
and auditory interpretations. We picture Mary running in the
opposite direction of a ferocious bear with the sound of the
bear being enough to frighten anyone. While interpreting a
sentence such as this is effortless to a human, designing intelligent machines with the same deep understanding is anything but. How would a machine know Mary is frightened?
What is likely to happen to Mary if she doesn’t run? Even
simple implications of the sentence, such as “Mary is likely
outside” may be nontrivial to deduce.
How can we determine whether a machine has achieved
the same deep understanding of our world as a human? In
our example sentence above, a human’s understanding is
rooted in multiple modalities. Humans can visualize a scene
depicting Mary running, they can imagine the sound of the
bear, and even how the bear’s fur might feel when touched.
Conversely, if shown a picture or even an auditory recording
of a woman running from a bear, a human may similarly
describe the scene. Perhaps machine intelligence could be
tested in a similar manner? Can a machine use natural language to describe a picture similar to a human? Similarly,
could a machine generate a scene given a written description? In fact these tasks have been a goal of artificial intelligence research since its inception. Marvin Minsky famously
stated in 1966 (Crevier 1993) to one of his students, “Connect a television camera to a computer and get the machine
Copyright © 2016, Association for the Advancement of Artificial Intelligence. All rights reserved. ISSN 0738-4602
SPRING 2016 63
A man holding a beer bottle with two hands and
looking at it.
A man in a white t-shirt looks at his beer bottle.
A man with black curly hair is looking at a beer.
A man holds a bottle of beer examining the label.
A guy holding a beer bottle.
A man holding a beer bottle.
A man holding a beer.
A man holds a bottle.
Man holding a beer.
Figure 1. Example Image Captions Written for an Image Sorted by Caption Length.
to describe what it sees.” At the time, and even today,
the full complexities of this task are still being discovered.
Image Captioning
Are tasks such as image captioning (Barnard and
Forsyth 2001; Kulkarni et al. 2011; Mitchell et al.
2012; Farhadi et al. 2010; Hodosh, Young, and Hockenmaier 2013; Fang et al. 2015; Chen and Zitnick
2015; Donahue et al. 2015; Mao et al. 2015; Kiros,
Salakhutdinov, and Zemel 2015; Karpathy and Fei-Fei
2015; Vinyals et al. 2015) promising candidates for
testing artificial intelligence? These tasks have advantages, such as being easy to describe and being capable of capturing the imagination of the public
(Markoff 2014). Unfortunately, tasks such as image
captioning have proven problematic as actual tests of
intelligence. Most notably, the evaluation of image
captions may be as difficult as the image captioning
task itself (Elliott and Keller 2014; Vedantam, Zitnick,
and Parikh 2015; Hodosh, Young, and Hockenmaier
2013; Kulkarni et al. 2011; Mitchell et al. 2012). It has
been observed that captions judged to be good by
human observers may actually contain significant
variance even though they describe the same image
(Vedantam, Zitnick, and Parikh 2015). For instance
see figures 1. Many people would judge the longer,
more detailed captions as better. However, the details
described by the captions vary significantly, for
example, two hands, white T-shirt, black curly hair,
label, and others. How can we evaluate a caption if
there is no consensus on what should be contained
in a good caption? However, for shorter, less detailed
captions that are commonly written by humans, a
rough consensus is achieved: “A man holding a beer
bottle.” This leads to the somewhat counterintuitive
conclusion that captions humans like aren’t necessarily humanlike.
The task of image captioning also suffers from
another less obvious drawback. In many cases it
might be too easy! Consider an example success from
a recent paper on image captioning (Fang et al.
2015), figure 4. Upon first inspection this caption
appears to have been generated from a deep understanding of the image. For instance, in figure 4 the
machine must have detected a giraffe, grass, and a
tree. It understood that the giraffe was standing, and
the thing it was standing on was grass. It knows the
tree and giraffe are next to each other, and others. Is
this interpretation of the machine’s depth of understanding correct? When judging the results of an AI
system, it is important to analyze not only its output
but also the data used for its training. The results in
figure 4 were obtained by training on the Microsoft
common objects in context (MS COCO) data set (Lin
et al. 2014). This data set contains five independent
captions written by humans for more than 120,000
images (Chen et al. 2015). If we examine the image in
figure 4 and the images in the training data set we
can make an interesting observation. For many testing images, there exist a significant number of
semantically similar training images, figure 4 (right).
If two images share enough semantic similarity, it is
What color are her eyes?
What is the mustache made of?
How many slices of pizza are there?
Is this a vegetarian pizza?
Is this location good for a tan?
What flag is being displayed?
Does it appear to be rainy?
Does this person have 20/20 vision?
Figure 2. Example Images and Questions in the Visual Question-Answering Data Set. (
possible a single caption could describe them both.
This observation leads to a surprisingly simple
algorithm for generating captions (Devlin et al.
2015). Given a test image, collect a set of captions
from images that are visually similar. From this set,
select the caption with highest consensus (Vedantam, Zitnick, and Parikh 2015), that is, the caption
most similar to the other captions in the set. In many
cases the consensus caption is indeed a good caption.
When judged by humans, 21.6 percent of these borrowed captions are judged to be equal to or better
than those written by humans for the image specifically. Despite its simplicity, this approach is competitive with more advanced approaches that use recurrent neural networks (Chen and Zitnick 2015;
Donahue et al. 2015; Mao et al. 2015; Kiros, Salakhutdinov, and Zemel 2015; Karpathy and Fei-Fei 2015;
Vinyals et al. 2015) and other language models (Fang
et al. 2015) that can achieve 27.3 percent when compared to human captions. Even methods using recurrent neural networks commonly produce captions
that are identical to training captions even though
they’re not explicitly trained to do so. If captions are
generated by borrowing them from other images,
these algorithms are clearly not demonstrating a
deep understanding of language, semantics, and
their visual interpretation. In comparison, the odds
of two humans repeating a sentence are quite rare.
One could make the case that the fault is not with
the algorithms but in the data used for training. That
is, the data set contains too many semantically similar images. However, even in randomly sampled
images from the web, a photographer bias is found.
Humans capture similar images to each other. Many
of our tastes or preferences are conventional.
Visual Question Answering
As we demonstrated using the task of image captioning, determining a multimodal task for measuring a
machine’s intelligence is challenging. The task must
be easy to evaluate, yet hard to solve. That is, its evaluation shouldn’t be as hard as the task itself, and it
must not be solvable using shortcuts or cheats. To
solve these two problems we propose the task of visual question answering (VQA) (Antol et al. 2015;
Geman et al. 2015; Malinowski and Fritz 2014; Tu et
al. 2014; Bigham et al. 2010; Gao et al. 2015).
SPRING 2016 65
Figure 3. Distribution of Questions by Their First Four Words.
The ordering of the words starts toward the center and radiates outwards. The arc length is proportional to the number of
questions containing the word. White areas indicate words with contributions too small to show.
The task of VQA requires a machine to answer a
natural language question about an image as shown
in figure 2. Unlike the captioning task, evaluating
answers to questions is relatively easy. The simplest
approach is to pose the questions with multiple choice
answers, much like standardized tests administered to
students. Since computers don’t get tired of reading
through long lists of answers, we can even increase the
length of the answer list. Another more challenging
option is to leave the answers open ended. Since most
answers are single words such as yes, blue, or two, evaluating their correctness is straightforward.
Is the visual question-answering task challenging?
The task is inherently multimodal, since it requires
knowledge of language and vision. Its complexity is
further increased by the fact that many questions
require commonsense knowledge to answer. For
instance, if you ask, “Does the man have 20/20
vision?” you need the commonsense knowledge that
having 20/20 vision implies you don’t wear glasses.
Going one step further, one might be concerned that
commonsense knowledge is all that’s needed to
answer the questions. For example if the question
was “What color is the sheep?,” our common sense
would tell us the answer is white. We may test the sufficiency of commonsense knowledge by asking subjects to answer questions without seeing the accompanying image. In this case, human subjects did
indeed perform poorly (33 percent correct), indicating that common sense may be necessary but is not
sufficient. Similarly, we may ask subjects to answer
the question given only a caption describing the
image. In this case the humans performed better (57
percent correct), but still not as accurately as those
able to view the image (78 percent correct). This
helps indicate that the VQA task requires more
A giraffe standing in the grass
next to a tree.
Figure 4. Example Image Caption and a Set of Semantically Similar Images.
Left: An image caption generated from Fang et al. (2015). Right: A set of semantically similar images in the MS COCO training data set for
which the same caption could apply.
detailed information about an image than is typically provided in an image caption.
How do you gather diverse and interesting questions for 100,000s of images? Amazon’s Mechanical
Turk provides a powerful platform for crowdsourcing
tasks, but the design and prompts of the experiments
must be careful chosen. For instance, we ran trial
experiments prompting the subjects to write questions that would be difficult for a toddler, alien, or
smart robot to answer. Upon examination, we determined that questions written for a smart robot were
most interesting given their increased diversity and
difficulty. In comparison, the questions stumping a
toddler were a bit too easy. We also gathered three
questions per image and ensured diversity by displaying the previously written questions and stating,
“Write a different question from those above that
would stump a smart robot.” In total over 760,000
questions were gathered.1
The diversity of questions supplied by the subjects
on Amazon’s Mechanical Turk is impressive. In figure
3, we show the distribution of words that begin the
questions. The majority of questions begin with
What and Is, but other questions include How, Are,
Does, and others. Clearly no one type of question
dominates. The answers to these questions have a
varying diversity depending on the type of question.
Since the answers may be ambiguous, for example,
“What is the person looking at?” we collected 10
answers per question. As shown in figure 5, many
question types are simply answered yes or no. Other
question types such as those that start with “What
is” have a greater variety of answers. An interesting
comparison is to examine the distribution of answers
when subjects were asked to answer the questions
with and without looking at the image. As shown in
Figure 5 (bottom), there is a strong bias to many
questions when subjects do not see the image. For
SPRING 2016 67
Answers with Images
Answers without Images
Figure 5. Distribution of Answers Per Question Type.
Top: When subjects provide answers when given the image. Bottom: When not given the image.
instance “What color” questions invoke red as an
answer, or for questions that are answered by yes or
no, yes is highly favored.
Finally it is important to measure the difficulty of
the questions. Some questions such as “What color is
the ball?” or “How many people are in the room?”
may seem quite simple. In contrast, other questions
such as “Does this person expect company?” or
“What government document is needed to partake in
this activity?” may require quite advanced reasoning
to answer. Unfortunately, the difficultly of a question
is in many cases ambiguous. The question’s difficultly is as much dependent on the person or machine
answering the question as the question itself. Each
person or machine has different competencies.
In an attempt to gain insight into how challenging each question is to answer, we asked human subjects to guess how old a person would need to be to
answer the question. It is unlikely most human subjects have adequate knowledge of human learning
development to answer the question correctly. However, this does provide an effective proxy for question
3-4 (15.3%)
5-8 (39.7%)
9-12 (28.4%)
13-17 (11.2%)
18+ (5.5%)
Is that a bird in
the sky?
How many pizzas
are shown?
Where was this picture
Is he likely to get
mugged if he walked
down a dark alleyway
like this?
What type of
architecture is this?
What color is the
What are the sheep
What ceremony does
the cake commemorate?
Is this a
vegetarian meal?
Is this a Flemish
bricklaying pattern?
How many zebras
are there?
What color is
his hair?
Are these boats too tall
to fit under the bridge?
What type of beverage
is in the glass?
How many calories
are in this pizza?
Is there food on
the table?
What sport is being
What is the name of the
white shape under
the batter?
Can you name the
performer in the purple
What government
document is needed
to partake in this
Is this man
wearing shoes?
Name one ingredient
in the skillet.
Is this at the stadium?
Besides these humans,
what other animals
eat here?
What is the make and
model of this vehicle?
Figure 6.
6 Example
l Questions
Q ti
J d d to
t B
Be Answerable
bl b
by Diff
Age G
The percentage of questions falling into each age group is shown in parentheses.
difficulty. That is, questions judged to be answerable
by a 3–4 year old are easier than those judged answerable by a teenager. Note, we make no claims that
questions judged answerable by a 3–4 year old will
actually be answered correctly by toddlers. This
would require additional experiments performed by
the appropriate age groups. Since the task is ambiguous, we collected 10 responses for each question. In
Figure 6 we show several questions for which a
majority of subjects picked the specified age range.
Surprisingly the perceived age needed to answer
the questions is fairly well distributed across the different age ranges. As expected the questions that
were judged answerable by an adult (18+) generally
need specialized knowledge, where those answerable
by a toddler (3–4) are more generic.
Abstract Scenes
The visual question-answering task requires a variety
of skills. The machine must be able to understand the
image, interpret the question, and reason about the
answer. For many researchers exploring AI, they may
not be interested in exploring the low-level tasks
involved with perception and computer vision.
Many of the questions may even be impossible to
solve given the current capabilities of state-of-the-art
computer vision algorithms. For instance the question “How many cellphones are in the image?” may
not be answerable if the computer vision algorithms
cannot accurately detect cellphones. In fact, even for
state-of-the-art algorithms many objects are difficult
to detect, especially small objects (Lin et al. 2014).
To enable multiple avenues for researching VQA,
we introduce abstract scenes into the data set (Antol,
Zitnick, and Parikh 2014; Zitnick and Parikh 2013;
Zitnick, Parikh, and Vanderwende 2013; Zitnick,
Vedantam, and Parikh 2015). Abstract scenes or cartoon images are created from sets of clip art, figure 7.
The scenes are created by human subjects using a
graphical user interface that allows them to arrange a
wide variety of objects. For clip art depicting humans,
their poses and expression may also be changed.
Using the interface, a wide variety of scenes can be
created including ordinary scenes, scary scenes, or
funny scenes.
Since the type of clip art and its properties are
exactly known, the problem of recognizing objects
and their attributes is greatly simplified. This provides researchers an opportunity to study more
directly the problems of question understanding and
answering. Once computer vision algorithms catch
up, perhaps some of the techniques developed for
abstract scenes can be applied to real images. The
abstract scenes may be useful for a variety of other
tasks as well, such as learning commonsense knowledge (Zitnick, Parikh, and Vanderwende 2013; Antol,
Zitnick, and Parikh 2014; Chen, Shrivastava, and
Gupta 2013; Divvala, Farhadi, and Guestrin 2014;
Vedantam et al. 2015).
While visual question answering appears to be a
promising approach to measuring machine intelligence for multimodal tasks, it may prove to have
SPRING 2016 69
How many glasses are on the table?
What is the woman reaching for?
Is this person expecting company?
What is just under the tree?
Do you think the boy on the ground has
broken legs?
Why is the boy on the right freaking out?
Are the kids in the room the grandchildren of
the adults?
What is on the bookshelf?
Figure 7. Example Abstract Scenes and Their Questions in the Visual Question-Answering Data Set.
unforeseen shortcomings. We’ve explored several
baseline algorithms that perform poorly when compared to human performance. As the data set is
explored, it is possible that solutions may be found
that don’t require true AI. However, using proper
analysis we hope to update the data set continuously to reflect the current progress of the field. As certain question or image types become too easy to
answer we can add new questions and images. Other
modalities may also be explored such as audio and
text-based stories (Fader, Zettlemoyer, and Etzioni
2013a, 2013b; Weston et al. 2014, Richardson,
Burges, and Renshaw 2013).
In conclusion, we believe designing a multimodal
challenge is essential for accelerating and measuring
the progress of AI. Visual question answering offers
one approach for designing such challenges that
allows for easy evaluation while maintaining the difficultly of the task. As the field progresses our tasks
and challenges should be continuously reevaluated
to ensure they are of appropriate difficultly given the
state of research. Importantly, these tasks should be
designed to push the frontiers of AI research and help
ensure their solutions lead us toward systems that are
truly AI complete.
Antol, S.; Agrawal, A.; Lu, J.; Mitchell, M.; Batra, D.; Zitnick,
C. L.; and Parikh, D. 2015. VQA: Visual Question Answering. Unpublished paper deposited in The Computing
Research Repository (CoRR) 1505.00468. Association for
Computing Machinery.
Antol, S.; Zitnick, C. L.; and Parikh, D. 2014. Zero-Shot
Learning via Visual Abstraction. In Computer Vision-ECCV
2014: Proceedings of the 13th European Conference, Part IV.
Lecture Notes in Computer Science Volume 8692. Berlin:
Barnard, K., and Forsyth, D. 2001. Learning the Semantics
of Words and Pictures. In Proceedings of the IEEE International Conference on Computer Vision (ICCV-01), 408–415. Los
Alamitos, CA: IEEE Computer Society.
Conference on Computer Vision, Part IV. Lecture Notes in
Computer Science Volume 6314. Berlin: Springer.
Bigham, J.; Jayant, C.; Ji, H.; Little, G.; Miller, A.; Miller, R.;
Miller, R.; Tatarowicz, A.; White, B.; White, S.; and Yeh, T.
2010. VizWiz: Nearly Real-Time Answers to Visual Questions. In Proceedings of the 23rd Annual ACM Symposium on
User Interface Software and Technology. New York: Association
for Computing Machinery.
Fang, H.; Gupta, S.; Landola, F. N.; Srivastava, R.; Deng, L.;
Doll, P.; Gao, J.; He, X.; Mitchell, M. Platt, J. C.; Zitnick, C.
L.; and Zweig, G. 2015. From Captions to Visual Concepts
and Back. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, NJ: Institute
for Electrical and Electronics Engineers.
Chen, X., and Zitnick, C. L. 2015. Mind’s Eye: A Recurrent
Visual Representation for Image Caption Generation. In Proceedings of the 2015 IEEE Conference on Computer Vision and
Pattern Recognition. Piscataway, NJ: Institute for Electrical
and Electronics Engineers.
Chen, X.; Fang, H.; Lin, T. Y.; Vedantam, R.; Gupta, S.; Dollár, P.; and Zitnick, C. L. 2015. Microsoft COCO Captions:
Data Collection and Evaluation Server. Unpublished paper
deposited in The Computing Research Repository (CoRR)
1504.00325. Association for Computing Machinery.
Chen, X.; Shrivastava, A.; and Gupta, A. 2013. NEIL:
Extracting Visual Knowledge from Web Data. In Proceedings
of the IEEE International Conference on Computer Vision, ICCV
2013. Piscataway, NJ: Institute for Electrical and Electronics
Crevier, D. 1993. AI: The Tumultuous History of the Search for
Artificial Intelligence. New York: Basic Books, Inc.
Devlin, J.; Gupta, S; Girshick, R.; Mitchell, M.; and Zitnick,
C. L. 2015. Exploring Nearest Neighbor Approaches for
Image Captioning. Unpublished paper deposited in The
Computing Research Repository (CoRR) 1505.04467. Association for Computing Machinery.
Divvala, S.; Farhadi, A.; and Guestrin, C. 2014. Learning
Everything About Anything: Webly-Supervised Visual Concept Learning. In Proceedings of the 2014 IEEE Conference on
Computer Vision and Pattern Recognition. Piscataway, NJ:
Institute for Electrical and Electronics Engineers.
Donahue, J.; Hendricks, L. A.; Guadarrama, S.; Rohrbach,
M.; Venugopalan, S.; Saenko, K.; and Darrell, T. 2015. LongTerm Recurrent Convolutional Networks for Visual Recognition and Description. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition.
Piscataway, NJ: Institute for Electrical and Electronics Engineers.
Elliott, D., and Keller, F. 2014. Comparing Automatic Evaluation Measures for Image Description. In Proceedings of the
52nd Annual Meeting of the Association for Computational Linguistics. Stroudsburg PA: Association for Computational Linguistics.
Fader, A.; Zettlemoyer, L.; and Etzioni, O. 2013a. Open
Question Answering over Curated and Extracted Knowledge
Bases. In Proceedings of the 20th ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining. New
York: Assocation for Computing Machinery.
Fader, A.; Zettlemoyer, L.; and Etzioni, O. 2013b. Paraphrase-Driven Learning for Open Question Answering. In
Proceedings of the 51st Annual Meeting of the Association for
Computational Linguistics. Stroudsburg, PA: Association for
Computational Linguistics.Engineers.
Farhadi, A.; Hejrati, M.; Sadeghi, M. A.; Young, P.; Rashtchian, C.; Hockenmaier, J.; and Forsyth, D. 2010. Every Picture
Tells a Story: Generating Sentences from Images. In Computer Vision–ECCV 2010, Proceedings of the 11th European
Gao, H.; Mao, J.; Zhou, J.; Huang, Z.; Wang, L.; and Xu, W.
2015. Are You Talking to a Machine? Dataset and Methods
for Multilingual Image Question Answering. Unpublished
paper deposited in The Computing Research Repository
(CoRR) 1505.05612. Association for Computing Machinery.
Geman, D.; Geman, S.; Hallonquist, N.; and Younes, L.
2015. A Visual Turing Test for Computer Vision Systems.
Proceedings of the National Academy of Sciences 112(12):
Hodosh, M.; Young, P.; Hockenmaier, J. 2013. Framing
Image Description as a Ranking Task: Data, Models and
Evaluation Metrics. JAIR 47: 853–899.
Karpathy, A., and Fei-Fei, L. 2015. Deep Visual-Semantic
Alignments for Generating Image Descriptions. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, NJ: Institute for Electrical and
Electronics Engineers.
Kiros, R.; Salakhutdinov, R.; and Zemel, R. 2015. Unifying
Visual-Semantic Embeddings with Multimodal Neural Language. Unpublished paper deposited in The Computing
Research Repository (CoRR) 1411.2539. Association for
Computing Machinery.
Kulkarni, F.; Premraj, V.; Dhar, S.; Li, S.; Choi, Y.; Berg, A. C.;
and Berg, T. L. 2011. Baby Talk: Understanding and Generating Simple Image Descriptions. In Proceedings of the 2011
IEEE Conference on Computer Vision and Pattern Recognition.
Piscataway, NJ: Institute for Electrical and Electronics Engineers.
Lin, T. Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.;
Ramanan, D.; Dollár, P.; and Zitnick, C. L. 2014. Microsoft
COCO: Common Objects in Context. In Computer Vision–
ECCV 2014: Proceedings of the 13th European Conference, Part
V. Lecture Notes in Computer Science Volume 8693. Berlin:
Malinowski, M., and Fritz, M. 2014. A Multi-World
Approach to Question Answering about Real-World Scenes
Based on Uncertain Input. In Advances in Neural Information
Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, 1682–1690. La Jolla, CA: Neural Information Processing Systems Foundation.
Mao, J.; Xu, W.; Yang, Y.; Wang, J.; Huan, Z.; and Yuille, A.
L. 2015. Deep Captioning with Multimodal Recurrent Neural Networks (m-RNN). Unpublished paper deposited in
arXiv. arXiv preprint arXiv:1412.6632. Ithaca, NY: Cornell
Markoff, J. 2014. Researchers Announce Advance in ImageRecognition Software. New York Times, Science Section
(November 17).
Mitchell, M.; Han, X.; Dodge, J.; Mensch, A.; Goyal, A.;
Berg, A.; Yamaguchi, K.; Berg, T.; Stratos, K.; Daumé, H.
2012. Midge: Generating Image Descriptions from Computer Vision Detections. In Proceedings of the 13th Conference of the European Chapter of the Association for Computa-
SPRING 2016 71
ECCV 2014: Proceedings of the 13th European Conference, Part
IV. Lecture Notes in Computer Science Volume 8692. Berlin:
Zitnick, C. L.; Vedantam, R; and Parikh, D. 2015. Adopting
Abstract Images for Semantic Scene Understanding. In IEEE
Transactions on Pattern Analysis and Machine Intelligence.
Issue 99.
Visit AAAI on
AAAI is on LinkedIn! If you are a current
member of AAAI, you can join us! We welcome your feedback at [email protected]
tional Linguistics. Stroudsburg, PA: Association for Computational Linguistics.
Richardson, M.; Burges, C.; Renshaw, E. 2013. MCTest: A
Challenge Dataset for the Machine Comprehension of Text.
In EMNLP 2013: Proceedings of the Empirical Methods in Natural Language Processing Conference. Stroudsburg, PA: Association for Computational Linguistics.
Tu, K.; Meng, M.; Lee, M. W.; Choe, T. E.; and Zhu, S. C.
2014. Joint Video and Text Parsing for Understanding
Events and Answering Queries. IEEE MultiMedia 21(2): 42–
Vedantam, R.; Lin, X.; and Batra, T.; Zitnick, C. L.; and
Parikh, D. 2015. Learning Common Sense through Visual
Abstraction. In Proceedings of the IEEE International Conference on Computer Vision, ICCV 2015. Piscataway, NJ: Institute for Electrical and Electronics Engineers.
Vedantam, R.; Zitnick, C. L.; and Parikh, D. 2015. CIDEr:
Consensus-Based Image Description Evaluation. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, NJ: Institute for Electrical and
Electronics Engineers.
Vinyals, O.; Toshev, A.; Bengio, S.; and Erhan, D. 2015.
Show and Tell: A Neural Image Caption Generator. In Proceedings of the 2015 IEEE Conference on Computer Vision and
Pattern Recognition. Piscataway, NJ: Institute for Electrical
and Electronics Engineers.
Weston, J.; Bordes, A.; Chopra, S.; and Mikolov, T. 2015.
Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks. Unpublished paper deposited in arXiv. arXiv preprint arXiv:1502.05698. Ithaca, NY: Cornell University.
Zitnick, C. L., and Parikh, D. 2013. Bringing Semantics into
Focus Using Visual Abstraction. In Proceedings of the 2013
IEEE Conference on Computer Vision and Pattern Recognition.
Piscataway, NJ: Institute for Electrical and Electronics Engineers.
Zitnick, C. L.; Parikh, D.; and Vanderwende, L. 2013. ZeroShot Learning via Visual Abstraction. In Computer Vision-
C. Lawrence Zitnick is interested in a broad range of topics related to visual recognition, language, and commonsense reasoning. He developed the PhotoDNA technology
used by Microsoft, Facebook, Google, and various law
enforcement agencies to combat illegal imagery on the web.
He received the Ph.D. degree in robotics from Carnegie Mellon University in 2003. In 1996, he coinvented one of the
first commercial portable depth cameras. Zitnick was a principal researcher in the Interactive Visual Media group at
Microsoft Research, and an affiliate associate professor at the
University of Washington at the time of the writing of this
article. He is now a research mananager at Facebook AI
Aishwarya Agrawal is a graduate student in the Bradley
Department of Electrical and Computer Engineering at Virginia Polytechnic Institute and State University. Her
research interests lie at the intersection of machine learning, computer vision, and natural language processing.
Stanislaw Antol is a Ph.D. student in the Computer Vision
Lab at Virginia Polytechnic Institute and State University.
His research area is computer vision — in particular, finding
new ways for humans to communicate with vision algorithms.
Margaret Mitchell is a researcher in Microsoft’s NLP Group.
She works on grounded language generation, focusing on
how to help computers communicate based on what they
can process. She received her MA in computational linguistics from the University of Washington, and her Ph.D. from
the University of Aberdeen.
Dhruv Batra is an assistant professor at the Bradley Department of Electrical and Computer Engineering at Virginia
Polytechnic Institute and State University, where he leads
the VT Machine Learning and Perception group. He is a
member of the Virginia Center for Autonomous Systems
(VaCAS) and the VT Discovery Analytic Center (DAC). He
received his M.S. and Ph.D. degrees from Carnegie Mellon
University in 2007 and 2010, respectively. His research
interests lie at the intersection of machine learning, computer vision, and AI.
Devi Parikh is an assistant professor in the Bradley Department of Electrical and Computer Engineering at Virginia
Polytechnic Institute and State University and an Allen Distinguished Investigator of Artificial Intelligence. She leads
the Computer Vision Lab at VT, and is also a member of the
Virginia Center for Autonomous Systems (VaCAS) and the
VT Discovery Analytics Center (DAC). She received her M.S.
and Ph.D. degrees from the Electrical and Computer Engineering Department at Carnegie Mellon University in 2007
and 2009, respectively. She received her B.S. in electrical and
computer engineering from Rowan University in 2005. Her
research interests include computer vision, pattern recognition, and AI in general, and visual recognition problems in
Turing++ Questions: A Test for
the Science of (Human) Intelligence
Tomaso Poggio, Ethan Meyers
I There is a widespread interest among
scientists in understanding a specific
and well defined form of intelligence,
that is human intelligence. For this reason we propose a stronger version of the
original Turing test. In particular, we
describe here an open-ended set of Turing++ questions that we are developing
at the Center for Brains, Minds, and
Machines at MIT — that is questions
about an image. For the Center for
Brains, Minds, and Machines the main
research goal is the science of intelligence rather than the engineering of
intelligence — the hardware and software of the brain rather than just
absolute performance in face identification. Our Turing++ questions reflect fully these research priorities.
t is becoming increasingly clear that there is an infinite
number of definitions of intelligence. Machines that are
intelligent in different narrow ways have been built since
the 50s. We are entering now a golden age for the engineering of intelligence and the development of many different
kinds of intelligent machines. At the same time there is a
widespread interest among scientists in understanding a specific and well defined form of intelligence, that is human
intelligence. For this reason we propose a stronger version of
the original Turing test. In particular, we describe here an
open-ended set of Turing++ questions that we are developing
at the Center for Brains, Minds, and Machines (CBMM) at
MIT — that is, questions about an image. Questions may
range from what is there to who is there, what is this person
doing, what is this girl thinking about this boy, and so on.
The plural in questions is to emphasize that there are many
different intelligent abilities in humans that have to be characterized, and possibly replicated in a machine, from basic
visual recognition of objects, to the identification of faces, to
gauge emotions, to social intelligence, to language, and
much more. Recent advances in cognitive neuroscience have
shown that even in the more limited domain of visual intelligence, answering these questions requires different competences and abilities, often rather independent from each other, often corresponding to separate modules in the brain. The
term Turing++ is to emphasize that our goal is understanding
human intelligence at all Marr’s levels — from the level of
the computations to the level of the underlying circuits.
Answers to the Turing++ questions should thus be given in
terms of models that match human behavior and human
physiology — the mind and the brain. These requirements
are thus well beyond the original Turing test. A whole scientific field that we call the science of (human) intelligence is
Copyright © 2016, Association for the Advancement of Artificial Intelligence. All rights reserved. ISSN 0738-4602
SPRING 2016 73
required to make progress in answering our Turing++
questions. It is connected to neuroscience and to the
engineering of intelligence but also separate from
both of them.
Definitions of Intelligence
We may call a person intelligent and even agree
among us. But what about a colony of ants and their
complex behavior? Is this intelligence? Were the
mechanical computers built by Turing to decode the
encrypted messages of the German U-boats, actually
intelligent? Is Siri intelligent? The truth is that the
question of What is intelligence is kind of ill-posed as
there are many different answers, an infinite numbers of different kinds of intelligence. This is fine for
engineers who may be happy to build many different
types of intelligent machines. The scientists among
us may instead prefer to focus on a question that is
well defined and can be posed in a scientific way, on
the question of human intelligence. In the rest of the
paper we use the term intelligence to mean human
Understanding Human Intelligence
Consider the problem of visual intelligence. Understanding such a complex system requires understanding it at different levels (in the Marr sense; see
Poggio [1981, 2012]), from the computations to the
underlying circuits. Thus we need to develop algorithms that provide answers of the type humans do.
But we really need to achieve more than just simulate
the brain’s output, more than what Turing asked. We
need to understand what understanding an image by
a human brain means. We need to understand the
algorithms used by the brain, but we also need to
understand the circuits that run these algorithms.
This may also be useful if we want to be sure that our
model is not just faking the output of a human brain
by using a giant lookup table of what people usually
do in similar situations, as hinted at the end of the
movie Ex Machina. Understanding a computer means
understanding the level of the software and the level
of the hardware. Scientific understanding of human
intelligence requires something similar — understanding of the mind as well as of the brain.
Using Behavior and
Physiology as a Guide
To constrain our search for intelligent algorithms, we
are focusing on creating computational models that
match human behavior and neural physiology. There
are several reasons why we are taking this approach.
The first reason, as hinted above, is to avoid superficial
solutions that mimic intelligent behavior under very
limited circumstances, but that do not capture the true
essence of the problem. Such superficial solutions
have been a prominent approach to the traditional
Turing test going back to the ELIZA program written in
the 1960’s (Weizenbaum 1966). While these approaches might occasionally fool humans, they do not
address many of the fundamental issues and thus this
approach will fail to match many aspects of human
behavior. A second related reason is that algorithms
might appear to perform well when tested under limited circumstances, but when compared to the full
range of human abilities they might not do nearly as
well. For example, deep neural networks work very
well on object-recognition tasks, but also fail in simple
ways that would never be seen in human behavior
(Szegedy et al. 2006). By directly comparing computer
systems’ results to human behavioral results we
should be able to assess whether a system that is displaying intelligent behavior is truly robust (Sinha et al.
2006). A final reason is that studying primate physiology can give us guidance about how to approach the
problem. For example, to recognize people based on
their faces appears to occur in discrete face patches in
the primate brain (see Freiwald and Tsao [2010], and
the section below). By understanding the computational roles of these patches we aim to understand the
algorithms that are used by primates to solve these
tasks (Meyers et al. 2015).
Intelligence Is One Word
but Many Problems
Recent advances in cognitive neuroscience have
shown that different competencies and abilities are
needed to solve visual tasks, and that they seem to
correspond to separate modules in the brain. For
instance, the apparently similar questions of object
and face recognition (what is there versus who is
there) involve rather distinct parts of the visual cortex (for example, the lateral occipital cortex versus a
section of the fusiform gyrus). The word intelligence
can be misleading in this context, like the word life
was during the first half of the last century when popular scientific journals routinely wrote about the
problem of life, as if there was a single substratum of
life waiting to be discovered to unveil the mystery
completely. Of course, speaking today about the
problem of life sounds amusing: biology is a science
dealing with many different great problems, not just
one. Thus we think that intelligence is one word but
many problems, not one but many Nobel prizes. This
is related to Marvin Minsky’s view of the problem of
thinking, well captured by the slogan Society of
Minds. In the same way, a real Turing test is a broad
set of questions probing the main aspects of human
thinking. Because “intelligence” encompasses a large
set of topics, we have chosen visual intelligence in
human and nonhuman primates as a primary focus.
Our approach at the Center for Brains, Minds, and
Machines to visual intelligence includes connections
Figure 1. Street Fair.
Courtesy of Boris Katz, CBMM, from the LableMe database.
to some developmental, spatial, linguistic, and social
questions. To further sharpen our focus, we are
emphasizing measuring our progress using questions,
described in more detail below, that might be viewed
as extensions of the Turing test. We have dubbed
these Turing++ questions. Computational models we
develop will be capable of responding to queries
about visual scenes and movies — who, what, why,
where, how, with what motives, with what purpose,
and with what expectations. Unlike a conventional
engineering enterprise that tests only absolute (computational) performance, we will require that our
models exhibit consistency with human performance/behavior, with human and primate physiology,
and with human development. The term Turing++
refers to these additional levels of understanding that
our models and explanations must satisfy.
The Turing++ Questions
Our choice of questions follows in part from our
understanding of human intelligence grounded in
the neuroscience of the brain. Each question roughly corresponds to a distinct neural module in the
brain. We have begun defining an initial set of such
problems/questions about visual intelligence, since
vision is our entry point into the problem of intelli-
gence. We call such questions Turing++ questions
because they are inspired by the classical Turing test
but go well beyond it. Traditional Turing tests permit
counterfeiting and require matching only a narrowly defined level of human performance. Successfully
answering Turing++ questions will require us not
only to build systems that emulate human performance, but also to ensure that such systems are consistent with our data on human behavior, brains, neural systems, and development. An open-ended set of
Turing++ questions can be effectively used to measure progress in studying the brain-based intelligence
needed to understand images and video.
As an example consider the image shown in figure
1. A deep-learning network might locate faces and
people. One could not interrogate such a network,
however, with a list of Turing++ questions such as
What is there? Who is there? What are they doing?
How, in detail, are they performing actions? Are they
friends or enemies or strangers? Why are they there?
What will they do next? Have you seen anything like
this before?
We effortlessly recognize objects, agents, and
events in this scene. We, but not a computer program, could recognize that this is a street market; several people are shopping; three people are conversing
around a stroller; a woman is shopping for a shirt;
SPRING 2016 75
Figure 2. Macque Visual Cortex Patches Involved in Face Perception.
Courtesy of Winrich Freiwald, CBMM. Modified from Tsao, D. Y.; Moeller, S.; Freiwald, W. A. Comparing Face Patch Systems in Macaques
and Humans. In Proceedings of the National Academy of Sciences of the United States of America. 2008;105(49): 19514–9.
although the market takes place on a street, clearly
no cars are permitted to drive down it; we can distinguish between the pants that are for sale and the
pants that people are wearing. We, but not a computer program, could generate a narrative about the
scene. It’s a fairly warm, sunny day at a weekend market. The people surrounding the stroller are a mother and her parents. They are deciding where they
would like to eat lunch. We would assess the performance of a model built
to answer questions like these by evaluating (1) how
similarly to humans our neural models of the brain
answer the questions, and (2) how well their implied
physiology correlates with human and primate data
obtained by using the same stimuli.
Our Turing++ questions require more than a good
imitation of human behavior; our computer models
should also be humanlike at the level of the implied
physiology and development. Thus the CBMM test
of models uses Turing-like questions to check for
humanlike performance/behavior, humanlike physiology, and humanlike development.
Because we aim to understand the brain and the
mind and to replicate human intelligence, the challenge intrinsic to the testing is not to achieve best
absolute performance, but performance that correlates strongly with human intelligence measured in
terms of behavior and physiology. We will compare
models and theories with fMRI and MEG recordings, and will use data from the latter to inform our
models. Physiological recordings in human patients
and monkeys will allow us to probe neural circuitry
during some of the tests at the level of individual
neurons. We will carry out some of the tests in
babies to study the development of intelligence.
The series of tests is open ended. The initial ones,
such as face identification, are tasks that computers
are beginning to do and where we can begin to develop models and theories of how the brain performs
the task. The later ones, such as generating stories
explaining what may have been going on in the
videos and answering questions about previous
answers, are goals for the next few years of the center
and beyond.
The modeling and algorithm development will be
guided by scientific concerns, incorporating constraints and findings from our work in cognitive
development, human cognitive neuroscience, and
systems neuroscience. These efforts likely would not
produce the most effective AI programs today (measuring success against objectively correct performance); the core assumption behind this challenge is
that by developing such programs and letting them
learn and interact, we will get systems that are ultimately intelligent at the human level.
An Example of a Turing++ Question:
Who Is There (Face Identification)
The Turing++ question that is most ripe, in the sense
of possibility to answer it at all the required levels, is
face identification. We have data about human performance in face identification — from a field that is
called psychophysics of face recognition. We know
which patches of visual cortex in humans are
involved in face perception by using fMRI techniques.
We can identify the homologue areas in the visual
cortex of the macaque where there is a
similar network of interconnected
patches shown in figure 3. In the monkey it is possible to record from individual neurons in the various patches
and characterize their properties. Neurons in patch ML are view- and identity-tuned, neurons in AM are identityspecific but more view-invariant.
Neurons in the intermediate patch AL
tend to be mirror-symmetric: if they
are tuned to a view they are also likely
to be tuned to the symmetric one.
We begin to have models that perform face identification well and are
consistent with the architecture and
the properties of face patches (that is,
we can make a correspondence
between stages in the algorithm and
properties of different face patches).
The challenge is to have performance
that correlates highly with human performance on the same data sets of face
images and that predict the behavior
of neurons in the face patches for the
same stimuli.
In September of 2015, CBMM organized the first Turing++ questions workshop, focused on face identification.
The title of the workshop is A Turing++
Question: Who Is There? The workshop introduced databases and
reviewed the states of existing models
to answer the question who is there at
the levels of performance and neural
The Science of Intelligence
For the Center for Brains, Minds, and
Machines the main research goal is the
science of intelligence rather than the
engineering of intelligence — the
hardware and software of the brain
rather than just absolute performance
in face identification. Our Turing++
questions reflect fully these research
The emphasis on answers at the different levels of behaviour and neural
circuits reflects the levels-of-understanding paradigm (Marr 2010). The
argument is that a complex system —
like a computer and like the
brain/mind — must be understood at
several different levels, such as hardware and algorithms/computations.
Though Marr emphasizes that explanations at different levels are largely
independent of each other, it has been
argued (Poggio 2012) that it is now
important to reemphasize the connections between levels, which was
described in the original paper about
levels of understanding (Marr and Poggio 1977). In that paper we argued that
one ought to study the brain at different levels of organization, from the
behavior of a whole organism to the
signal flow, that is, the algorithms, to
circuits and single cells. In particular,
we expressed our belief that (1)
insights gained on higher levels help to
ask the right questions and to do
experiments in the right way on lower
levels and (2) it is necessary to study
nervous systems at all levels simultaneously. Otherwise there are not
enough constraints for a unique solution to the problem of human intelligence.
Freiwald, W. A., and Tsao, D. Y. 2010. Functional Compartmentalization and Viewpoint Generalization Within the Macaque
Face-Processing System. Science 330: 845–
Marr, D. 2010. Vision. Cambridge, MA: The
MIT Press.
Marr, D., and Poggio, T. 1977. From Understanding Computation to Understanding
Neural Circuitry. In Neuronal Mechanisms
in Visual Perception, ed. E. Poppel, R. Held,
and J. E. Dowling. Neurosciences Research
Program Bulletin 15: 470–488.
Meyers, E.; Borzello, M.; Freiwald, W.; Tsao,
D. 2015. Intelligent Information Loss: The
Coding of Facial Identity, Head Pose, and
Non-Face Information in the Macaque Face
Patch System. Journal of Neuroscience 35(18):
Poggio, T. 1981. Marr’s Computational
Approach to Vision. Trends in Neurosciences
10(6): 258–262.
Poggio, T. 2012. The Levels of Understanding Framework, Revised. Perception 41(9):
94(11): 1948–1962.
Szegedy, C.; Zaremba, W.; Sutskever, I.;
Bruna, J.; Erhan, D.; Goodfellow, I. J.; and
Fergus, R. 2013. Intriguing Properties of
Neural Networks. CoRR (Computing
Research Repository), abs/1312.6199. Association for Computing Machinery.
Weizenbaum, J. 1966. ELIZA—A Computer
Program for the Study of Natural Language
Communication Between Man and
Machine. Communications of the ACM 9(1):
Tomaso A. Poggio is the Eugene McDermott Professor in the Department of Brain
and Cognitive Sciences at the Massachusetts Institute of Technology, and the director of the new National Science Foundation
Center for Brains, Minds, and Machines at
the Massachusetts Institute of Technology,
of which MIT and Harvard are the main
member Institutions. He is a member of
both the Computer Science and Artificial
Intelligence Laboratory and of the McGovern Brain Institute. He is an honorary member of the Neuroscience Research Program,
a member of the American Academy of Arts
and Sciences, a Founding Fellow of AAAI,
and a founding member of the McGovern
Institute for Brain Research. Among other
honors he received the Laurea Honoris
Causa from the University of Pavia for the
Volta Bicentennial, the 2003 Gabor Award,
the Okawa Prize 2009, the AAAS Fellowship,
and the 2014 Swartz Prize for Theoretical
and Computational Neuroscience. His
research has always been interdisciplinary,
between brains and computers. It is now
focused on the mathematics of learning
theory, the applications of learning techniques to computer vision, and especially
on computational neuroscience of the visual cortex.
Ethan Meyers is an assistant professor of
statistics at Hampshire College. He received
his BA from Oberlin College in computer
science, and his Ph.D. in computational
neuroscience from MIT. His research examines how information is coded in neural
activity, with a particular emphasis on
understanding the processing that occurs in
high-level visual and cognitive brain
regions. To address these questions, he
develops computational tools that can analyze high-dimensional neural recordings.
Reichardt, W., and Poggio, T. 1976. Visual
Control of Orientation Behavior in the Fly:
A Quantitative Analysis. Quarterly Review of
Biophysics 9(3): 311–375.
Sinha, P.; Balas, B.; Ostrovsky, Y.; and Russell, R. 2006. Face Recognition by Humans:
19 Results All Computer Vision Researchers
Should Know About. Proceedings of the IEEE
SPRING 2016 77
I-athlon: Toward a
Multidimensional Turing Test
Sam S. Adams, Guruduth Banavar, Murray Campbell
I While the Turing test is a wellknown method for evaluating machine
intelligence, it has a number of drawbacks that make it problematic as a rigorous and practical test for assessing
progress in general-purpose AI. For
example, the Turing test is deception
based, subjectively evaluated, and narrowly focused on language use. We suggest that a test would benefit from
including the following requirements:
focus on rational behavior, test several
dimensions of intelligence, automate as
much as possible, score as objectively as
possible, and allow incremental
progress to be measured. In this article
we propose a methodology for designing
a test that consists of a series of events,
analogous to the Olympic Decathlon,
which complies with these requirements. The approach, which we call the
I-athlon, is intended ultimately to
enable the community to evaluate
progress toward machine intelligence in
a practical and repeatable way.
he Turing test, as originally described (Turing 1950),
has a number of drawbacks as a rigorous and practical
means of assessing progress toward human-level intelligence. One major issue with the Turing test is the requirement for deception. The need to fool a human judge into
believing that a computer is human seems to be peripheral,
and even distracting, to the goal of creating human-level
intelligence. While this issue can be sidestepped by modifying the test to reward rational intelligent behavior (rational
Turing test) rather than humanlike intelligent behavior, there
are additional drawbacks to the original Turing test, including its language focus, complex evaluation, subjective evaluation, and the difficulty in measuring incremental progress.
Copyright © 2016, Association for the Advancement of Artificial Intelligence. All rights reserved. ISSN 0738-4602
Figure 1. The Olympic Decathalon.
Language focused: While language use is perhaps the
most important dimension of intelligence, there are
many other dimensions that are relevant to intelligence, for example, visual understanding, creativity,
reasoning, planning, and others.
Complex evaluation: The Turing test, if judged rigorously, is expected to require extensive human input to
prepare, conduct, and evaluate.1
Subjective evaluation: Tests that can be objectively evaluated are more useful in a practical sense, requiring
less testing to achieve a reliable result.
Difficult to measure incremental progress: In an unrestricted conversation, it is difficult to know the relative importance of various kinds of successes and failures. This adds an additional layer of subjectivity in
trying to judge the degree of intelligence.
In this article we propose an approach to measuring
progress toward intelligent systems through a set of
tests chosen to avoid some of the drawbacks of the
Turing test. In particular, the tests (1) reward rational behavior (as opposed to humanlike behavior); (2)
exercise several dimensions of intelligence in various
combinations; (3) limit the requirement for human
input in test creation and scoring; (4) use objective
scoring to the extent possible; (5) permit measuring
of incremental progress; (6) make it difficult to engineer a narrow task-specific system; and (7) eliminate,
as much as possible, the possibility of gaming the system, as in the deception scenarios for the classic Turing test.
The proposed approach, called here the I-athlon,
by analogy with the Olympic Decathlon2 (figure 1), is
intended to provide a framework for constructing a
set of tests that require a system to demonstrate a
wide variety of intelligent behaviors. In the
Olympics, 10 events test athletes across a wide variety of athletic abilities as well as learned skills. In
addition, the Decathlon tests their stamina and focus
as they move among the 10 events over the two days
of the competition. In all events, decathletes compete against specialist athletes, so it is not uncommon for them to fail to win any particular event. It is
their aggregate score that declares them the World’s
Greatest Athlete. One of the values of this approach
for the field of artificial intelligence is that it would
be inclusive of specialist systems that might achieve
high levels of proficiency, and be justly recognized
for the achievement, while still encouraging generalist systems to compete on the same level playing
Principles for Constructing
a Set of Tests
Given our desire for broad-based, automated, objectively scored tests that can measure incremental
progress and compare disparate systems on a common ground, we propose several principles for the
construction of I-athlon events:
Events Should Focus on Testing Proficiency in a Small
Number of Dimensions.
Testing a single dimension at a time could fall prey to
a switch system, where a number of narrow systems
are loosely coupled through a switch that selects the
appropriate system for the current event. While
events should be mostly self-contained, it may make
sense to use the results of one event as the input for
Events Should All Be Measured Against a Common,
Simple Model of Proficiency.
A common scoring model supports more direct comparisons and categorizations of systems. We propose
a simple five-level rating system for use across all
events. Levels one through four will represent levels
of human proficiency based on baseline data gath-
SPRING 2016 79
ered from crowdsourced human competitions. Level
five will represent superhuman proficiency, an X-factor over human level four, so there is a clear, unambiguous measure of achievement above human level.
Levels one through four could be mapped to human
age ranges or levels of proficiency, though some tests
will not map to human development and proficiency but to domain expertise. It will be the responsibility of the developers of each event to map their scoring algorithms to these levels, and the overall
I-athlon score for any competing system will be a
standard formula applied to attainment of these levels.
Multiple Events.
The overall goal of this effort is to create broadly intelligent systems rather than narrow savants. As in the
Olympic Decathlon, the total score across events
should be more important than the score in any one
event. Relative value of proficiency level achievement
needs to recognize that all events are not equal in
intelligence value. This might be difficult to agree on,
and even the Olympic Decathlon scoring system has
evolved over time to reflect advances in performance.3
Event Tests Should Be Automatically Generated
Without Significant Human Intervention.
One of the major drawbacks to the current Turing
test is its requirement for extensive human involvement in performing and evaluating the test. This
requirement for direct human involvement effectively rules out highly desirable approaches to developing solutions that operate much faster than humans
can interact with effectively. Another challenge in
designing a good replacement for the Turing test is
eliminating, as much as possible, the potential for
someone to game the system. At the very least this
means that specific test instances must not be reused
except for repeatability and validation. Automatic
generation of repeatable high-quality tests is a significant research area on its own, and this approach
allows for more efficient division of labor across the
AI research community. Some researchers may focus
on defining or improving events, possibly in collaboration with other disciplines like psychology or philosophy. Some may focus on developing test generators and scoring systems. Others may develop
systems to compete in existing I-athlon events themselves. Generators should be able to reproduce specific tests using the same pseudorandom seed value
so tests can be replayed for head-to-head competition
and to allow massively parallel search and simulation
of the solution space.
Human intelligence has many facets and comes in
many varieties and combinations. Philosophers, psychologists, cognitive and computer scientists have
debated the definition of intelligence for centuries,
and there are many different factorings of what we
here call the “dimensions of intelligence.” Our goal
in this article is not to declare a definitive set of
dimensions or even claim complete coverage of the
various aspects of human intelligence. We take up
this terminology to enable us to identify aspects of
intelligence that might be tested separately and in
combinations for the purpose of evaluating the capabilities of AI systems compared to humans. The
dimensions listed below are not all at the same level
of abstraction; indeed, proficiency at some dimensions will require proficiency at several others. We
fully expect there to be debate over which aspects of
intelligence should be tested for separately or in concert with others. Our goal here is to define an
approach that moves the AI research community in
the positive direction of coordinated effort toward
achieving human-level AI in computer systems. As
stated earlier, we believe reaching this goal will
require such a coordinated effort, and a key aspect of
coordination is the ability to assess incremental
progress toward the goal in a commonly accepted
manner. What follows is a brief description of what
we consider good candidates for I-athlon events (figure 2).
Image Understanding — Identify both the content
and context of a given image, the objects, their attributes and relationships to each other in the image,
implications of scene background and object arrangement.
Diagram Understanding — Given a diagram,
describe each of the elements and their relationships,
identify the intended purpose/message of the diagram (infographic, instructional, directional, design,
and others).
Speech Generation — Given a graph of concepts
describing a situation, deliver an appropriate verbal/auditory presentation of the situation.
Natural Language Generation — Given nonverbal
information, provide a natural language description
sufficient to identify the source information among
Event Tests Should Be Automatically Scored Without
Significant Human Intervention.
Deception of human judges became the primary
strategy for the classic Turing test instead of honest
attempts at demonstrating true artificial intelligence.
Human bias on the part of the panel of judges also
made the results of each run of the Turing test highly unpredictable and even suspect. To the degree possible, scoring should be consistent and unambiguous,
with clearly defined performance criteria aligning
with standard proficiency level scoring. These scoring constraints should also significantly influence
test design and generation itself. To prevent tampering and other fakery, all test generators and scoring
systems should run in a common secure cloud, and
all tests and results should be immutably archived
there for future validation.
The Scoring System Should Reward Proficiency over
Dimensions of Intelligence
Theory of Mind
A, A ‚B
“a black cat with
a white face”
“a black cat with
a white face”
Natural Language Natural Language
1. C
2. B
3. A
“Spain is about
to win that match”
Common Sense
Figure 2. Good Candidates for the I-athlon Events.
Natural Language Understanding — Given a verbal
description of a situation, select the image that best
describes the situation. Vary the amount of visual distraction.
Collaboration — Given descriptions of a collection
of agents with differing capabilities, describe how to
achieve one of more goals within varying constraints
such as time, energy consumption, and cost.
Competition — Given two teams of agents, their
capabilities and a zero-sum goal, describe both offensive and defensive strategies for each team for winning, initially based on historical performance but
eventually in near real time.
Reasoning — Given a set of states, constraints, and
rules, answer questions about inferred states and relationships. Explain the answers. Variations require use
of different logics and combinations of them.
Reasoning Under Uncertainty — Given a set of probable states, constraints, and rules, answer questions
about inferred states and relationships. Explain the
Creativity — Given a goal and a set of assets, construct a solution. Vary by number and variety of
assets, complexity of goals, environmental constraints. Alternatively, provide a working solution
and attempt to improve it. Explain your solution.
Video Understanding — Given a video sequence,
describe its contents, context, and flow of activity.
Identify objects and characters, their degree of
agency and theory of mind. Predict next activity for
characters. Identify purpose of video (educational,
how-to, sporting event, storytelling, news, and others). Answer questions about the video and explain
Initiative — Given a set of agents with different
capabilities, goals, and attitudes, organize and direct
a collaborative effort to achieve a goal. Key here is
utilizing theory of mind to build and maintain the
team throughout the activity.
Learning — Given a collection of natural language
documents, successfully answer a series of questions
about the information expressed in the documents.
Vary question complexity and corpora size for different levels. Similar tests for nonverbal or mixed media
Planning — Given a situation in an initial state,
describe a plan to achieve a desired end state. Vary
the number and variety of elements, and the complexity of initial and end states, as well as the constraints to be obeyed in the solution (for example,
time limit).
Common Sense Physics — Given a situation and a
proposed change to the situation, describe the reactions to the change and the final state. Vary the complexity of the situation and the number of changes
and their order.
Language Translation — Given text/speech in one
language, translate it to another language. Vary by
simplicity of text, number of idioms used, slang, and
Interaction — Given a partial dialogue transcript
between two or more agents, predict what will be the
next interactions in the exchange. Alternatively, given an anonymous collection of statements and a
description of multiple agents, assign the statements
to each agent and order the dialogue in time.
Embodiment — Given an embodiment with a collection of sensors and effectors, and an environment
SPRING 2016 81
surrounding that body, perform a given task in the
environment. Vary the number and sophistication of
sensors and effectors and tasks, the complexity of the
environment, the time allowed. Added bonus for
adapting to sensors/effectors added or disabled during the test.
Audio Understanding — Given an audio sequence,
describe the scene with any objects, actions, and
implications. Vary length and clarity, along with
complexity of audio sources in the scene.
Diagram Generation — Given a verbal description
of a process, generate a series of diagrams describing
the process. Alternatively use video input.
Imagination — Given a set of objects and agents
from a common domain along with their attributes
and capabilities, construct and describe a plausible
scenario. Score higher for richer, more complex interactions involving more agents and objects. Alternatively, provide two or more sets of objects and agents
from different domains and construct a plausible scenario incorporating both sets. Score higher for more
interaction across domains.
Approach for Designing
I-athlon Events
Given the requirement for automatic test generation
and scoring, we have explored applying the
CAPTCHA (von Ahn et al. 2003) approach to the general design of I-athlon events, and the results are
intriguing. CAPTCHA, which stands for “Completely
Automated Public Turing test to tell Computers and
Humans Apart,” was originally conceived as a means
to validate human users of websites while restricting
programmatic access by bots. By generating warped
or otherwise obscured images of words or alphanumeric sequences, the machine or human desiring to
access the website had to correctly declare the original sequence of characters that was used to generate
the test image, a task that was far beyond the ability
of current optical character recognition (OCR) programs or other known image processing algorithms.
Over time, an arms race of sorts has evolved, with
systems learning to crack various CAPTCHA schemes,
which in turn has driven the development of more
difficult CAPTCHA images. The effectiveness or security of CAPTCHA-based human interaction proofs
(HIPs) is not our interest here, but an explicit side
effect of the evolution of CAPTCHA technology is:
once an existing CAPTCHA-style test is passed by a
system, an advance has been achieved in AI. We feel
that by applying this approach to other dimensions
of intelligence we can motivate and sustain continual progress in achieving human-level AI and beyond.
There are several keys to developing a good
CAPTCHA-style test, many of which have to do with
its application as a cryptographic security measure.
For our purposes, however, we are only concerned
with the generalization of the approach for automat-
ed test generation and scoring where both humans
and machines can compete directly, not for any security applications. For the original CAPTCHA images
consisting of warped and obscured text, the generation script was designed to create any number of
testable images, and the level of obscuration was
carefully matched to what was relatively easy for
most humans while being nearly impossible for
machines. This pattern can be followed to develop Iathlon event tests by keeping the test scenario the
same each time but varying the amount of information provided or the amount of noise in that information for each level of proficiency. This approach
could be adapted for many of the dimensions of
intelligence described above.
For I-athlon events, the generation algorithms
must also be able to produce any number of distinct
test scenarios, but at different levels of sophistication
that will require different levels of intelligence to
succeed, four levels for human achievement and a
fifth for superhuman. It would also be important for
the generation algorithms to produce identical tests
based from a given seed value. This would allow for
efficient documentation of the tests generated as well
as provide for experimental repeatability by different
researchers. We anticipate that both the definition of
each event, the design of its standard test generator,
and the scoring system and levels will be active areas
of research and debate. We include in this article a
brief outline for several events to demonstrate the
idea. Since the goal of the I-athlon is continual coordinated progress toward the goals of AI, all this effort
adds significantly to our understanding of intelligence as well as our ability to add intelligence to
computer systems.
To support automatic test generation and scoring
for an event, the key is to construct the test so that a
small number of variables can programmatically
drive a large number of variant test cases that directly map to clear levels of intelligent human ability.
Providing human baselines for these events can be
obtained through crowdsourcing, incentivizing large
numbers of humans to take the tests, probably
through mobile apps. This raises the requirement for
an I-athlon event to provide appropriate interfaces
for both human and machine contestants.
Some examples include events that involve simple
planning, video understanding, embodiment, and
object identification.
A Simple Planning Event
For example, consider an I-athlon event for planning
based on a blocks world. An entire genre of twodimensional physics-based mobile apps already generates puzzles of this type for humans.4 Size, shape,
initial location, and quantity of blocks for each test
can be varied, along with the complexity of the environment (gravity, wind, earthquakes) of the goal
state. For a blocks world test, the goal would likely be
reaching a certain height or shape with the available
blocks, with extra points given for using fewer blocks
to reach the goal in fewer attempts. Providing a completed structure as a goal might be too easy, unless
ability to manipulate blocks through some virtual
device is also a part of the test. Automatic scoring
could be based on the test environment reaching a
state that passes the constraints of the goal, which
could be straightforward programming for a blocks
world but likely more challenging for other aspects
of intelligence. The test interface could be a touchbased graphical interface for humans and a REST API
for machines.
A Video Understanding Event
Given a set of individual video frames in random
order, discover the original order by analyzing content and context. Vary the “chunk size” of ordered
frames randomized to produce the test. Decimate the
quality of the video by masking or adding noise.
Scoring could be based on the fraction of frames correctly assembled in order within a time limit, or the
total time to complete the task.
An Embodiment Event
Given a sensor/effector API to an embodied agent in
a virtual environment, complete a task in the environment using the sensory/motor abilities of the
agent. Vary the number and kinds of sensors and
effectors. Vary the complexity of the task and the
nature of the environment. Environments could be a
limited as ChipWits5 or as open ended as MineCraft.6
A more sophisticated event would include potential
identification and use of tools, or the ability to adapt
to gaining or losing sensors and effectors during the
An Object-Identification Event
Given recent advances applying DNNs to object
recognition, one might think this event would not be
interesting. But human visual intelligence allows us
to recognize millions of distinct objects in many
thousands of classes, and the breadth of this ability is
important for general intelligence. This event would
generate test images by mixing and overlaying partial images from a very large collection of sources.
Scoring would be based on the number of correctly
identified objects per image and the time required
per image and per test.
Competition Framework
and Ecosystem
Our goal to motivate coordinated effort toward the
goal of AI requires not only a standard set of events,
test generators, and scorers, but also an overall frame-
work for public competition and comparison of
results in an unbiased manner. Given the large number of successful industrywide competitions in different areas of computer science and engineering, we
propose taking key aspects of each and combining
them into a shared platform of ongoing I-athlon
Sites like Graph 5007 provide an excellent model
for test generation and common scoring. A common
cloud-hosted platform for developing and running
events and for archiving tests and results will be
required, even if competitors run their systems on
their own resources. A central location for running
the competitions would help limit bias and would
also provide wider publicity for successes. Having
such a persistent platform along with automated test
generation and scoring would support the concept
of continuous competition, allowing new entrants at
any time with an easy on-ramp to the AI research
community. Continuous competitions can prequalify participants in head-to-head playoffs held concurrently with major AI conferences, similar to the
RoboCup8 competitions.
In addition to the professional and graduate-level
research communities, such a framework could support competitions at undergraduate and secondary
school levels. Extensive programming and engineering communities have been created using this
approach, with TopCoder9 and First Robotics10 as
prime examples. These not only serve a valuable
mentoring role in the development of skills, but also
recruit high-potential students into the AI research
Incentives beyond eminence and skill building
also have proven track records for motivating
progress. The X-Prize11 approach has proven to be
highly successful in focusing research attention, as
have the DARPA Challenges12 for self-driving vehicles and legged robots. Presenting a unified, organized framework for progress in AI would go a long
way to attract this kind of incentive funding.
The division of labor made possible by the proposed approach could fit nicely within the research
agendas of numerous universities at all levels, supporting common curricula development in AI and
supporting research programs targeted at different
aspects of the I-athlon ecosystem.
Call to Action
We welcome feedback and collaboration from the
broad research community to develop and administer a continuing series of I-athlon events according
to the model proposed in this article. Our ultimate
goal is to motivate the AI research community to
understand and develop research agendas that get to
the core of general machine intelligence. As we know
from the history of AI, this is such a complex problem with so many yet-unknown dimensions, that
SPRING 2016 83
Turing, A. 1950. Computing Machinery and Intelligence.
Mind 59(236): 433–460. LIX.236.
Visit the AAAI Member Site
and Create Your Own Circle!
We encourage you to explore the features of the AAAI
Member website, where you can renew your membership
in AAAI and update your contact information directly.
In addition, you are directly connected with other members of the largest worldwide AI community via the AAAI
online directory and other social media features. Direct
links are available for AI Magazine features, such as the
online and app versions.
Finally, you will receive announcements about all AAAI
upcoming events, publications, and other exciting initiatives. Be sure to spread the word to your colleagues about
this unique opportunity to tap into the premier AI society!
the only way to make measurable progress is to develop rigorous, practical, yet flexible tests that require
the use of multiple dimensions. The tests themselves
can evolve, as we understand the nature of intelligence. We look forward to making progress in the AI
field through such an activity.
1. See, for example, the Kapor-Kurzweil bet:
4. For example,,
von Ahn, L.; Blum, M.; Hopper, N.; and Langford, J. 2003.
CAPTCHA: Using Hard AI Problems for Security. In Proceedings of the Annual International Conference on the Theory and
Applications of Cryptographic Techniques (EUROCRYPT-03).
Carson City, NV: International Association for Cryptologic
Sam S. Adams ([email protected]) works for IBM
Research and was appointed one of IBM’s first distinguished
engineers in 1996. His far-ranging contributions include
founding IBM’s first object technology practice, authoring
IBM’s XML technical strategy, originating the concept of
service-oriented architecture, pioneering work in self-configuring and autonomic systems, artificial general intelligence, end-user mashup programming, massively parallel
many-core programming, petascale analytics, and data-centered systems. Adams is currently working on cloud-scale
cognitive architectures for the Internet of Things, and has
particular interests in artificial consciousness and autocognitive systems.
Guruduth Banavar, as vice president of cognitive computing at IBM Research, currently leads a global team of
researchers creating the next generation of IBM’s Watson
systems — cognitive systems that learn, reason, and interact
naturally with people to perform a variety of knowledgebased tasks. Previously, as the chief technical officer of IBM’s
Smarter Cities initiative, he designed and implemented big
data and analytics-based systems to make cities more livable
and sustainable. Prior to that, he was the director of IBM
Research in India, which he helped establish as a preeminent center for services research and mobile computing. He
has published extensively, holds more than 25 patents, and
his work has been featured in the New York Times, the Wall
Street Journal, the Economist, and other international media.
He received a Ph.D. from the University of Utah before joining IBM’s Thomas J. Watson Research Center in 1995.
Murray Campbell is a principal research staff member at
the IBM Thomas J. Watson Research Center in Yorktown
Heights, NY. He was a member of the team that developed
Deep Blue, the first computer to defeat the human world
chess champion in a match. Campbell has conducted
research in artificial intelligence and computer chess, with
numerous publications and competitive victories, including
eight computer chess championships. This culminated in
the 1997 victory of the Deep Blue chess computer, for which
he was awarded the Fredkin Prize and the Allen Newell
Research Excellence Medal. He has a Ph.D. in computer science from Carnegie Mellon University, and is an ACM Distinguished Scientist and a Fellow of the Association for the
Advancement of Artificial Intelligence. He currently manages the AI and Optimization Department at IBM Research.
Software Social Organisms:
Implications for Measuring
AI Progress
Kenneth D. Forbus
I In this article I argue that achieving
human-level AI is equivalent to learning how to create sufficiently smart software social organisms. This implies
that no single test will be sufficient to
measure progress. Instead, evaluations
should be organized around showing
increasing abilities to participate in our
culture, as apprentices. This provides
multiple dimensions within which
progress can be measured, including
how well different interaction modalities can be used, what range of domains
can be tackled, what human-normed
levels of knowledge they are able to
acquire, as well as others. I begin by
motivating the idea of software social
organisms, drawing on ideas from other areas of cognitive science, and provide
an analysis of the substrate capabilities
that are needed in social organisms in
terms closer to what is needed for computational modeling. Finally, the implications for evaluation are discussed.
oday’s AI systems can be remarkably effective. They can
solve planning and scheduling problems that are
beyond what unaided people can accomplish, sift
through mountains of data (both structured and unstructured) to help us find answers, and robustly translate speech
and handwriting into text. But these systems are carefully
crafted for specific purposes, created and maintained by
highly trained personnel who are experts in artificial intelligence and machine learning. There has been much less
progress on building general-purpose AI systems, which
could be trained and tasked to handle multiple jobs. Indeed,
in my experience, today’s general-purpose AI systems tend to
skate a very narrow line between catatonia and attention
deficit disorder.
People and other mammals, by contrast, are not like that.
Consider dogs. A dog can be taught to do tasks like shaking
hands, herding sheep, guarding a perimeter, and helping a
blind person maneuver through the world. Instructing dogs
can be done by people who don’t have privileged access to
the internals of their minds. Dogs don’t blue screen. What if
AI systems were as robust, trainable, and taskable as dogs?
That would be a revolution in artificial intelligence.
In my group’s research on the companion cognitive architecture (Forbus et al. 2009), we are working toward such a
revolution. Our approach is to try to build software social
organisms. By that we mean four things:
First, companions should be able to work with people
using natural interaction modalities. Our focus so far has
been on natural language (for example, learning by reading
[Forbus et al. 2007; Barbella and Forbus 2011]) and sketch
understanding (Forbus et al. 2011).
Second, companions should be able to learn and adapt
over extended periods of time. This includes formulating
their own learning goals and pursuing them, in order to
improve themselves.
Third, companions should be able to maintain themselves. This does not mean a 24-hour, 7-day-a-week operation
— even people need to sleep, to consolidate learning. But
Copyright © 2016, Association for the Advancement of Artificial Intelligence. All rights reserved. ISSN 0738-4602
SPRING 2016 85
they should not need AI experts peering into their
internal operations just to keep them going.
Fourth, people should be able to relate to companions as collaborators, rather than tools. This
requires companions to learn about the people that
they are working with, and build relationships with
them that are effective over the long term.
Just to be clear, our group is a long way from
achieving these goals. And this way of looking at the
problems is far from standard in AI today. Consider for
example IBM’s Watson. While extremely adept at factoid question and answering, Watson would not be
considered an organism by these criteria. It showed a
groundbreaking ability to do broad natural language
processing, albeit staying at a fairly shallow, syntactic
level much of the time. But it did not formulate its
own learning goals nor maintain itself. It required a
team of AI experts inspecting its internals constantly
through development, adding and removing by hand
component algorithms and input texts (Baker 2011).
Another example are cognitive architectures that started as models of skill learning, like ACT-R (Anderson
and Lebiere 1998) or SOAR (Laird 2012). Such architectures have done an impressive job at modeling a
variety of psychological phenomena, and have also
been used successfully in multiple performance-oriented systems. However, using them typically involves
generating by hand a model of a specific cognitive
phenomenon, such as learning to solve algebraic
equations. The model is typically expressed in the rule
language of the architecture, although for some experiments simplified English is used to provide declarative knowledge that the system itself proceduralizes.
The model is run multiple times to satisfy the conditions of the experiment, and then is turned off. More
ambitious uses (for example, as pilots in simulated
training exercises [Laird et al. 1998], or as
coaches/docents [Swartout et al. 2013]) work in narrow domains, for short periods of time, and with most
of the models being generated by hand. Creating systems that live and learn over extended periods of time
on their own is beyond the state of the art today.
Recently, more people are starting to work on
aspects of this. Research on interactive task learning
(Hinrichs and Forbus 2014, Kirk and Laird 2014) is
directly concerned with the first two criteria above,
and to some degree the third. Interactive task learning is a sweet spot in this research path. But I think
the importance of the shift from treating software as
tools versus collaborators should not be underestimated, both for scientific and for practical reasons.
The scientific reasons are explained below. As a practical matter, the problems humanity faces are growing more complex, while human cognitive capacities
remain constant. Working together fluently in teams
with systems that are enough like us to be trusted,
and have complementary strengths and weaknesses,
could help us solve problems that are beyond our
reach today.
The companion cognitive architecture incorporates two other scientific hypotheses. The first is that
analogical reasoning and learning, over structured,
relational representations, is ubiquitous in human
cognition (Gentner 2003). There is evidence that the
comparison process defined by Gentner’s structuremapping theory (Gentner 1983) operates across a
span of phenomena that includes high-level vision
and auditory processing, inductive reasoning, problem solving, and conceptual change. The second
hypothesis is that qualitative representations are central in human cognition. They provide a level of
description that is appropriate for commonsense reasoning, grounding for professional knowledge of
continuous systems (for example, scientists, engineers, analysts), and a bridge between perception and
cognition (Forbus 2011). These two hypotheses are
synergistic, for example, qualitative representations
provide excellent grist for analogical learning and
These two specific hypotheses might be correct or
might be wrong. But independent of them, I think
the concept of software social organisms is crucial, a
way of reframing what we mean by human-level AI,
and does so in a way that suggests better measurements than we have been using. So let us unpack this
idea further.
Why Software Social Organisms?
I claim that human-level AI is equivalent to sufficiently smart software social organisms. I start by
motivating the construction of organisms, then
argue that they need to be social organisms. A specification for the substrate capabilities that are needed
to be a social organism is proposed, based on evidence from the cognitive science literature.
Why Build Organisms?
There are two main reasons for thinking about building AI systems in terms of constructing software
organisms. The first is autonomy. We have our own
goals to pursue, in addition to those provided by others. We take those external goals as suggestions,
rather than as commands that we run as programs in
our heads. This is a crucial difference between people
and today’s AI systems. Most AI systems today can’t
be said to have an inner life, a mix of internally and
externally generated plans and goals, whose pursuit
depends on its estimation of what it should be doing.
The ability to punt on an activity that is fruitless, and
to come up with better things to do, is surely part of
the robustness that mammals exhibit. There has been
some promising work on metacognition that is starting to address these issues (Cox and Raja 2011), but
the gap between human abilities and AI systems
remains wide.1
Another aspect of autonomy is the separation of
internal versus external representations. We do not
have direct access to the internal representations of
children or our collaborators. (Cognitive science
would be radically simpler if we did.) Instead, we
communicate through a range of modalities, including natural language, sketching, gesture, and physical demonstrations. These work because the recipient
is assumed to have enough smarts to figure them out.
The imperfections of such communications are well
known, that is, the joint construction of context in
natural language dialogue involves a high fraction of
exchanges that are diagnosing and repairing miscommunications. To be sure, there are strong relationships between internal and external representations: Vygotsky (1962), for example, argues that
much of thought is inner speech, which is learned
from external speech. But managing that relationship for itself is one of the jobs of an intelligent
The second reason for building organisms is adaptation. Organisms adapt. We learn incrementally and
incidentally in everyday life constantly. We learn
about the world, including learning on the job. We
learn things about the people around us, both people
we work and play with and people who are part of
our culture that we have never interacted with and
likely never will (for example, political figures,
celebrities). We learn about ourselves as well: what we
like and dislike, how to optimize our daily routines,
what we are good at, bad at, and where we’d like to
improve. We build up this knowledge over days,
weeks, months, and years. We are remarkably good
at this, adapting stably — very few people go off the
rails into insanity. I know of no system that learns in
a broad range of domains over even days without
human supervision by people who understand its
internals. That is radically different from people, who
get by with feedback from the world and from other
people who have no privileged access to their internals.
Having autonomy and adaptability covers the second and third desiderata, and can be thought of as an
elaboration of what is involved in achieving them.
Communication through natural modalities is
implied by both, thereby covering the first at least
partly. But to complete the argument for the first,
and to handle the fourth (collaborators), we need to
consider why we want social organisms.
Why Social Organisms?
People are social animals. It has been proposed (for
example, Tomasello [2001]) that, in evolutionary
terms, being social provides a strong selection bias
toward intelligence. Social animals have to track the
relationships between themselves and others of their
species. Being social requires figuring out who are
your friends and allies, versus your competitors and
enemies. Relationships need to be tracked over time,
which involves observing how others are interacting
to build and maintain models of their relationships.
Sociality gives rise to friendship and helping, as well
as to deceit and competition. These cognitive challenges seem to be strong drivers toward intelligence,
as most social creatures tend to be more intelligent
than those that are not, with dolphins, crows, and
dogs being well-known examples.
A second reason for focusing on social organisms
is that much of what people learn is from interactions with other people and their culture (Vygotsky
1962). To be sure, we learn much about the basic
properties of materials and objects through physical
manipulation and other experiences in the world.
But we can all think about things that we have never experienced. None reading this lived through the
American Revolutionary War, for example, nor did
they watch the Galápagos Islands form with their
own eyes. Yet we all can have reasonably good models of these things. Moreover, even our knowledge of
the physical world has substantial contributions
from our culture: how we carve the mechanisms
underlying events into processes is enshrined in natural language, as well as aspects of how we carve visual scenes up into linguistic descriptions (for example,
Coventry and Garrod [2004]).
A number of AI researchers have proposed that
stories are central to human intelligence (Schank
1996, Winston 2012). The attraction and power of
stories is that they can leverage the same cognitive
capacities that we use to understand others, and provide models that can be used to handle novel situations. Moral instruction, for example, often relies on
stories. Other AI researchers have directly tackled
how to build systems that can cooperate and collaborate with people (Allen et al. 2007; Grosz, Hunsberger, and Kraus 1999). These lines of research provide important ingredients for building social
organisms, but much work remains to be done.
Hence my claim that human-level AI systems will
simply be sufficiently smart software social organisms. By sufficiently smart, I mean capable of learning to perform a broad range of tasks that people perform, with similar amounts of input data and
instruction, arriving at the same or better levels of
performance. Does it have to be social? If not, it
could not discuss its plans, goals, or intentions, and
could not learn from people using natural interaction
modalities. Does it have to be an organism? If not, it
will not be capable of maintaining itself, which is
something that people plainly do.
Substrate Capabilities for Social Organisms
This equivalence makes understanding what is needed to create social organisms more urgent. To that
end, here is a list of substrate capacities that I believe
will be needed to create human-level social organisms. These are all graded dimensions, which means
that incremental progress measures can be formulated and used as dimensions for evaluation.
(1) Autonomy. They will have their own needs, drives,
SPRING 2016 87
and capabilities for acting and learning. What should those needs and
drives be? That will vary, based on the
niche that an organism is operating
in. But if we are wise, we will include
in their makeup the desire to be good
moral actors, as determined by the
culture they are part of, and that they
will view having good relationships
with humans as being important to
their own happiness.
(2) Operates in environments that
support shared focus. That is, each
participant has some information
about what others can sense, and participants can make their focus of
attention known to each other easily.
People have many ways of drawing
attention to people, places, or things,
such as talking, pointing, gesturing,
erecting signs, and winking. But even
with disembodied software, there are
opportunities for shared focus, for
example, selection mechanisms commonly used in GUIs, as well as speech
and text. Progress in creating virtual
humans (for example, Bohus and
Horvitz [2011] and Swartout et al.
[2013]) is increasing the interactive
bandwidth, as is progress in humanrobotics interaction (for example,
Scheutz et al. [2013]).
(3) Natural language understanding
and generation capabilities sufficient
to express goals, plans, beliefs, desires,
and hypotheticals. Without this capability, building a shared understanding of a situation and formulating
joint plans becomes much more difficult.
(4) Ability to build models of the
intentions of others. This implies
learning the types of goals they can
have, and how available actions feed
into those goals. It also requires models of needs and drives as the wellsprings of particular goals. This is the
basis for modeling social relationships.
(5) Strong interest in interacting with
other social organisms (for example,
people), especially including helping
and teaching. Teaching well requires
building up models of what others
know and tracking their progress.
There is ample evidence that other
animals learn by observation and imitation. The closest thing to teaching
in other animals found so far is that,
in some species, parents bring increasingly more challenging prey to their
young as they grow. By contrast,
human children will happily help
adults, given the opportunity (for
example, Liszkowski, Carpenter, and
Tomasello [2008]).
This list provides a road map for
developing social organisms of varying
degrees of complexity. Simpler environmental niches require less in terms
of reference to shared focus, and
diminished scope for beliefs, plans,
and goals, thereby providing more
tractable test beds for research. I view
Allen’s Trips system (Ferguson and
Allen 1998), along with virtual
humans research (Bohus and Horvitz
2011, Swartout et al. 2013), as examples of such test beds. As AI capabilities
increase, so can the niches, until ultimately the worlds they operate in are
coextensive with our own.
Implications for Measuring
This model for human-level AI has several implications for measuring
progress. First, it should be clear that
no single test will work. No single test
can measure adaptability and breadth.
Single tests can be gamed, by systems
that share few of the human characteristics above. Believability, which is
what the Turing test is about, is particularly problematic since people tend to
treat things as social beings (Reeves
and Nass 2003).
What should we do instead? I
believe that the best approach is to
evaluate AI systems by their ability to
participate in our culture. This means
having AI systems that are doing some
form of work, with roles and responsibilities, interacting with people appropriately. While doing this, it needs to
adapt and learn, about its work, about
others, and about itself. And it needs to
do so without AI experts constantly
fiddling with its internals.
I believe the idea of apprenticeship
is an extremely productive approach
for framing such systems. Apprenticeship provides a natural trajectory for
bringing people into a role. They start
as a student, with lots of book learning
and interaction. There are explicit lessons and tests to gauge learning. But
there is also performance, at first with
simple subtasks. As an apprentice
learns, their range of responsibilities is
expanded to include joint work, where
roles are negotiated. Finally, the
apprentice graduates to autonomous
operation within a community, performing well on its own, but also interacting with others at the same level.
Apprentices do not have to be perfect:
They can ask for help, and help others
in turn. And in time, they start training their own apprentices.
Apprenticeship can be used in a
wide variety of settings. For example,
we are using this approach in working
with companions in a strategy game,
where the game world provides a rich
simulation and source of problems and
decisions to make (Hinrichs and Forbus 2015). Robotics-oriented researchers might use assembly tasks or
flying survey or rescue drones in environments of ever-increasing complexity.
An example of a challenge area for
evaluating AIs is science learning and
teaching. The scientific method and its
products are one of the highest
achievements of human culture. Ultimately, one job of AIs should be helping people learn science, in any
domain and at any level. The Science
Test working group2 has proposed the
following trajectory, as a way of incrementally measuring progress. First,
evaluate the ability of AI systems to
answer questions about science, using
standardized human-normed tests,
such as the New York Regent’s Science
Tests, which are available for multiple
years and multiple levels. Second, evaluate the ability of AI systems to learn
new scientific concepts, by reading,
watching videos, and interacting with
people. Third, evaluate the ability of AI
systems to communicate what they
know about science across multiple
domains and at multiple levels. We
conjecture that this provides a scalable
trajectory for evaluating AI systems,
with the potential for incremental and
increasing benefits for society as
progress is made.
This challenge illustrates how useful
the apprenticeship approach can be for
evaluation. The first phases are aimed
at evaluating systems as students,
ensuring that they know enough to
contribute. The middle phase focuses
on being able to contribute, albeit in a
limited way. The final phase is focused
on AIs becoming practitioners. Notice
that in each phase there are multiple
dimensions of scalability: number of
domains, level of knowledge (for
example, grade level), and modalities
needed to communicate. (We return to
the question of scalable evaluation
dimensions more generally below.)
Progress across these dimensions need
not be uniform: some groups might
focus entirely on maximizing domain
coverage, while others might choose to
stick with a single domain but start to
focus early on tutoring within that
domain. This provides a rich tapestry
of graded challenges. Moreover, incremental progress will lead to systems
that could improve education.
Scalable Evaluation Dimensions
A productive framework should provide a natural set of dimensions along
which progress can be made and measured. Here are some suggestions
implied by the software social organism approach.
Natural Interaction Modalities
Text, speech, sketching, vision, and
mobility are all capabilities that can be
evaluated. Text can be easier than
speech, and sketching can be viewed as
a simplified form of vision.
Initial Knowledge Endowment
How much of what a system knows is
learned by the system itself, versus
what it has to begin with? What the
absolute minimum initial endowment
might be is certainly a fascinating scientific question, but it is probably best
answered by starting out with substantially more knowledge and learning
how human-level capabilities can be
reached. Understanding those pathways should better enable us to understand what minimal subsets can work.
It is very seductive to start from
scratch, and perhaps easier, if it could
be made to work. But the past 50 years
of research suggests that this is much
harder than it seems: Look at the various “robot baby” projects that have
tried that. Arguably, given that IBM’s
Watson used more than 900 million
syntactic frames as part of its knowledge base, the 5 million facts encoded
in ResearchCyc might well be considered a small starting endowment.
Level of Domain Knowledge and Skill
Prior work on learning apprentices (for
example, Mitchell et al. [1994]) focused
on systems that helped people perform
better in particular domains. They started with much of the domain knowledge that they would need, and learned
more about how to operate in that
domain. In qualitative reasoning, many
systems have been built that incorporate expert-level models for particular
domains (Forbus 2011). Breadth is now
the challenge. Consider what fourth
graders know about science (Clark
2015), and the kinds of social interactions they can have with people. AI systems are still far from that level of
accomplishment, nor can they grow
into expertise by building on their
everyday knowledge, as people seem to
do (Forbus and Gentner 1997).
Range of Tasks the System Is Responsible For
Most AI systems have focused on single tasks. Being able to accomplish
multiple tasks with the same system
has been one of the goals of research
on cognitive architecture, and with
interactive task learning, the focus is
shifting to being able to instruct systems in new tasks, an important step
toward building systems capable
enough to be apprentices.
Learning Abilities
Software social organisms need to
learn about their jobs, the organisms
(people and machines) that they work
with, and about themselves. While
some problems may well require massive amounts of data and deep learning
(for example, speech recognition
[Graves, Mohamed, and Hinton
2013]), people are capable of learning
many things with far fewer examples.
Office assistants who required, for
example, 10,000 examples of how to
fill out a form before being able to do
it themselves would not last long in
any reasonable organization. There are
many circumstances where children
learn rapidly (for example, fast mapping in human word learning [Carey
2010]), and understanding when this
can be done, and how to do it, is an
important question.
I have argued that the goal of humanlevel AI can be equivalently expressed
as creating sufficiently smart software
social organisms. This equivalence is
useful because the latter formulation
makes strong suggestions about how
such systems should be evaluated. No
single test is enough, something which
has become very apparent from the
limitations of Turing’s test, which
brought about the workshop that
motivated the talk that this article was
based on. More positively, it provides a
framework for organizing a battery of
tests, namely the apprenticeship trajectory. An apprentice is initially a student, learning from instructors
through carefully designed exercises.
Apprentices start working as assistants
to a mentor, with increasing responsibility as they learn. Eventually they
start working autonomously, communicating with others at their same level, and even taking on their own
apprentices. If we can learn how to
build AI systems with these capabilities, it would be revolutionary. I hope
the substrate capabilities for social
organisms proposed here will encourage others to undertake this kind of
The fantasy of the Turing test, and
many of its proposed replacements, is
that a single simple test can be found
for measuring progress toward humanlevel AI. Part of the attraction of this
view is that the alternative is both difficult and expensive. Many tests,
involving multiple capabilities and
interactions over time with people, all
require substantial investments in
research, engineering, and evaluation.
But given that we are tackling one of
the deepest questions ever asked by
humanity, that is, what is mind, this
should not be too surprising. And I
believe it will be an extraordinarily
productive investment.
I thank Dedre Gentner, Tom Hinrichs,
Mike Tomasello, Herb Clark, and the
Science Test Working Group for many
helpful discussions and suggestions.
This research is sponsored by the
Socio-Cognitive Architectures and the
Machine Learning, Reasoning, and
Intelligence Programs of the Office of
Naval Research and by the Computational and Machine Intelligence Program of the Air Force Office of Scientific Research.
1. Part of the gap, I believe, is the dearth of
SPRING 2016 89
broad and rich representations in most AI
systems, exacerbated by our failure as a field
to embrace existing off-the-shelf resources
such as ResearchCyc.
2. The Science Test Working Group includes
Peter Clark, Barbara Grosz, Dragos Margineantu, Christian Lebiere, Chen Liang, Jim
Spohrer, Melanie Swan, and myself. It is one
of several groups working on tests that, collectively, should provide better ways of
measuring progress in AI.
Allen, J.; Chambers, N.; Ferguson, G.; Galescu, L.; Jung, H.; Swift, M.; and Tayson, W.
2007. PLOW: A Collaborative Task Learning
Agent. In Proceedings of the Twenty-Second
AAAI Conference on Artificial Intelligence. Palo
Alto, CA: AAAI Press.
Anderson, J. R., and Lebiere, C. 1998. The
Atomic Components of Thought. Mahwah, NJ:
Baker, S. 2011. Final Jeopardy! Man Versus
Machine and the Quest to Know Everything.
New York: Houghton Mifflin Harcourt.
Barbella, D., and Forbus, K. 2011. Analogical Dialogue Acts: Supporting Learning by
Reading Analogies in Instructional Texts. In
Proceedings of the Twenty-Fifth AAAI Conference on Artificial Intelligence, 1429–1435.
Palo Alto, CA: AAAI Press.
Bohus, D., and Horvitz, E. 2011 Multiparty
Turn Taking in Situated Dialog: Study, Lessons, and Directions. In Proceedings of the
SIGDIAL 2011 Conference, The 12th Annual
Meeting of the Special Interest Group on Discourse and Dialogue. Stroudsburg, PA: Association for Computational Linguistics.
Carey, S. 2010. Beyond Fast Mapping. Language, Learning, and Development. 6(3): 184–
Clark, P. 2015. Elementary School Science
and Math Tests as a Driver for AI: Take the
Aristo Challenge! In Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, 4019–4021. Palo Alto, CA: AAAI Press.
Coventry, K., and Garrod, S. 2004. Saying,
Seeing, and Acting: The Psychological Semantics of Spatial Prepositions. London: Psychology Press.
Cox, M., and Raja, A. 2011. Metareasoning:
Thinking about Thinking. Cambridge, MA:
The MIT Press.
Ferguson, G., and Allen, J. 1998 TRIPS: An
Intelligent Integrated Problem-Solving
Assistant. In Proceedings of the Fifteenth
National Conference on Artificial Intelligence
and Tenth Innovative Applications of Artificial
Intelligence Conference. 567–573. Menlo
Park, CA: AAAI Press..
Forbus, K. 2011. Qualitative Modeling.
Wiley Interdisciplinary Reviews: Cognitive Science 24 (July/August): 374–391.
Forbus, K., and Gentner, D. 1997. Qualitative Mental Models: Simulations or Memories? Paper presented at the Eleventh International Workshop on Qualitative
Reasoning, Cortona, Italy, June 3–6.
Forbus, K.; Riesbeck, C.; Birnbaum, L.; Livingston, K.; Sharma, A.; and Ureel, L. 2007.
Integrating Natural Language, Knowledge
Representation and Reasoning, and Analogical Processing to Learn by Reading. In Proceedings of the Twenty-Second Conference on
Artificial Intelligence. Palo Alto, CA: AAAI
Forbus, K. D.; Klenk, M.; and Hinrichs, T.
2009. Companion Cognitive Systems:
Design Goals and Lessons Learned So Far.
IEEE Intelligent Systems 24(4): 36–46.
Forbus, K. D.; Usher, J.; Lovett, A.; Lockwood, K.; and Wetzel, J. 2011. Cogsketch:
Sketch Understanding for Cognitive Science
Research and for Education. Topics in Cognitive Science 3(4): 648–666.
Gentner, D. 1983. Structure-Mapping: A
Theoretical Framework for Analogy. Cognitive Science 72(2): 155–170.
Gentner, D. 2003. Why We’re So Smart. In
Language in Mind, ed. D. Gentner and S.
Goldin-Meadow. Cambridge, MA: The MIT
Graves, A.; Mohamed, A.; and Hinton, G.
2013. Speech Recognition with Deep Recurrent Neural Networks. In Proceedings of the
2013 IEEE International Conference on
Acoustic Speech and Signal Processing. Piscataway, NJ: Institute for Electrical and Electronics Engineers.
Grosz, B.; Hunsberger, L.; and Kraus, S.
1999. Planning and Acting Together. AI
Magazine 20(4): 23–34
Hinrichs, T., and Forbus, K. 2014. X Goes
First: Teaching Simple Games through Multimodal Interaction. Advances in Cognitive
Systems 3: 31–46.
Hinrichs, T., and Forbus, K. 2015. Qualitative Models for Strategic Planning. Proceedings of the Third Annual Conference on
Advances in Cognitive Systems, Atlanta,
Kirk, J., and Laird, J. 2014 Interactive Task
Learning for Simple Games. Advances in
Cognitive Systems 3: 13–30.
Laird, J. 2012. The SOAR Cognitive Architecture. Cambridge, MA: The MIT Press
Laird, J.; Coulter, K.; Jones, R.; Kenney, P.;
Koss, F.; Nielsen, P. 1998. Integrating Intelligent Computer Generated Forces in Dis-
tributed Simulations: TacAir-SOAR in
STOW-97. Paper presented at the Simulation Interoperability Workshop, 9–13
March, Orlando, FL.
Lizkowski, U.; Carpenter, M.; and Tomasello, M. 2008. Twelve-Month-Olds Communicate Helpfully and Appropriately for
Knowledgeable and Ignorant Partners. Cognition 108(3): 732–739.
Mitchell, T.; Caruana, R.; Freitag, D.; McDermott, J.; and Zabowski, D. 1994. Experience
with a Personal Learning Assistant. Communications of the ACM 37(7): 80–91.
Reeves, B., and Nass, C. 2003. The Media
Equation: How People Treat Computers, Television, and New Media Like Real People and
Places. Palo Alto, CA: Center for the Study of
Language and Information, Stanford University.
Schank, R. 1996. Tell Me a Story: Narrative
and Intelligence. Evanston, IL: Northwestern
University Press.
Scheutz, M.; Briggs, G.; Cantrell, R.; Krause,
E.; Williams, T.; and Veale, R. 2013. Novel
Mechanisms for Natural Human-Robot
Interactions in the DIARC Architecture. In
Proceedings of the Twenty-Seventh AAAI Conference on Artificial Intelligence. Palo Alto, CA:
AAAI Press.
Swartout, W.; Artstein, R.; Forbell, E.; Foutz,
S.; Lane, H.; Lange, B.; Morie, J.; Noren, D.;
Rizzo, A.; Traum, D. 2013. Virtual Humans
for Learning. AI Magazine 34(4): 13–30.
Tomasello, M. 2001. The Cultural Origins of
Human Cognition. Cambridge, MA: Harvard
University Press.
Vygotsky, L. 1962. Thought and Language.
Cambridge, MA: The MIT Press.
Winston, P. 2012. The Right Way. Advances
in Cognitive Systems 1: 23–36.
Kenneth D. Forbus is the Walter P. Murphy
Professor of Computer Science and Professor of Education at Northwestern University. His research interests include qualitative
reasoning, analogy, spatial reasoning and
learning, sketch understanding, natural language understanding, cognitive architecture, reasoning system design, intelligent
educational software, and the use of AI in
interactive entertainment. He is a fellow of
AAAI, ACM, and the Cognitive Science Society.
Principles for Designing
an AI Competition, or Why the
Turing Test Fails as an Inducement Prize
Stuart M. Shieber
I If the artificial intelligence research
community is to have a challenge problem as an incentive for research, as
many have called for, it behooves us to
learn the principles of past successful
inducement prize competitions. Those
principles argue against the Turing test
proper as an appropriate task, despite its
appropriateness as a criterion (perhaps
the only one) for attributing intelligence
to a machine.
here has been a spate recently of calls for replacements
for the Turing test. Gary Marcus in The New Yorker asks
“What Comes After the Turing Test?” and wants “to
update a sixty-four-year-old test for the modern era” (Marcus
2014). Moshe Vardi in his Communications of the ACM article
“Would Turing Have Passed the Turing Test?” opines that “It’s
time to consider the Imitation Game as just a game” (Vardi
2014). The popular media recommends that we “Forget the
Turing Test” and replace it with a “better way to measure
intelligence” (Locke 2014). Behind the chorus of requests is
an understanding that the test has served the field of artificial intelligence poorly as a challenge problem to guide
This shouldn’t be surprising: The test wasn’t proposed by
Turing to serve that purpose. Turing’s Mind paper (1950) in
which he defines what we now call the Turing test concludes
with a short discussion of research strategy toward machine
intelligence. What he says is this:
We may hope that machines will eventually compete with men
in all purely intellectual fields. But which are the best ones to
start with? Even this is a difficult decision. . . . I do not know
what the right answer is, but I think [various] approaches
should be tried. (Turing [1950], page 460)
What he does not say is that we should be running Turing
Copyright © 2016, Association for the Advancement of Artificial Intelligence. All rights reserved. ISSN 0738-4602
SPRING 2016 91
Perhaps Turing saw that his test is not at all suitable for this purpose, as I will argue in more detail
here. But that didn’t stop some with an entrepreneurial spirit in staging Turing-test–inspired competitions. Several, including myself (Shieber 1994) and
Hayes and Ford (1995), argued such stunts to be misguided and inappropriate. The problem with misapplication of the Turing test in this way has been exacerbated by the publicity around a purported case of a
chatbot in June 2014 becoming “the first machine to
pass the Turing test” (The Guardian 2014), when of
course no such feat took place (Shieber 2014a). (It is
no coincidence that all of the articles cited in the first
paragraph came out in June 2014.)
It is, frankly, sad to see the Turing test besmirched
by its inappropriate application as a challenge problem for AI. But at least this set of events has had the
salutary effect of focusing the AI research community on the understanding that if the Turing test isn’t a
good challenge problem for guiding research toward
new breakthroughs, we should attend to devising
more appropriate problems to serve that role. These
calls to replace the pastime of Turing-test-like competitions are really pleas for a new inducement prize
Inducement prize contests are award programs
established to induce people to solve a problem of
importance by directly rewarding the solver, and the
idea has a long history in other research fields — navigation, aviation, and autonomous vehicles, for
instance. If we are to establish an inducement prize
contest for artificial intelligence, it behooves us to
learn from the experience of the previous centuries
of such contests to design our contest in a way that
is likely to have the intended effect. In this article, I
adduce five principles that an inducement prize contest for AI should possess: occasionality of occurrence, flexibility of award, transparency of result,
absoluteness of criteria, and reasonableness of goal.
Any proposal for an alternative competition, moving
“beyond the Turing test” in the language of the January 2015 Association for the Advancement of Artificial Intelligence workshop,1 ought to be evaluated
according to these principles.
The Turing test itself fails the reasonableness principle, and its implementations to date in various
competitions have failed the absoluteness, occasionality, flexibility, and transparency principles, a clean
sweep of inappropriateness for an AI inducement
prize contest. Creative thinking will be needed to
generate a contest design satisfying these principles.
Inducement Prize Contests
There is a long history of inducement prizes in a
broad range of areas, including: navigation (the 1714
Longitude Prize), chemistry (the 1783 French Academy of Sciences prize for soda ash production), automotive transportation (the 1895 Great Chicago Auto
Race), aviation (numerous early 20th century prizes
culminating in the 1919 Orteig Prize for nonstop
transatlantic flight; the 1959 Kremer Prize for
human-powered flight), space exploration (the 1996
Ansari X Prize for reusable manned spacecraft), and
autonomous vehicles (the 2004 DARPA Grand Challenge). Inducement prizes are typically offered on the
not unreasonable assumption that they provide a
highly financially leveraged method for achieving
progress in the award area. Estimates of the leverage
have ranged up to a factor of 50 (Schroeder 2004).
There have been two types of competitions related to AI,2 though neither type serves well as an
inducement prize contest.
The first type of competition comprises regularly
scheduled enactments of (or at least inspired by) the
Turing test. The most well known is the Loebner Prize
Competition, held annually, though other similar
competitions have been held, such as the June 2014
Royal-Society-sponsored competition in London,
whose organizers erroneously claimed that entrant
Eugene Goostman had passed the Turing test
(Shieber 2014a). Although Hugh Loebner billed his
eponymous prize as a curative for the astonishing
claim that “People had been discussing AI, but
nobody was doing anything about it” (Lindquist
1991), his competition is not set up to provide appropriate incentives and has not engendered any
progress in the area so far as I can tell (Shieber 1994).
In the second type of competition research funders, especially U.S. government funders, like
DARPA, NSF, and NIST, have funded regular (typically annual) “bakeoffs” among funded research groups
working on particular applications — speech recognition, message understanding, question answering,
and so forth. These competitions have been spectacularly successful at generating consistent incremental
progress on the measured objectives, speech recognition error rate reduction, for instance. Such competitions are evidently effective at generating improvements on concrete engineering tasks over time. They
have, however, had the perverse effect of reducing
the diversity of approaches pursued and generally
increasing risk aversion among research projects.
An inducement prize contest for AI has the potential
to promote research on hard AI problems without the
frailties of these previous competitions. We, the AI
community, would like a competition to promote
creativity, reward risk, and curtail incrementalism.
This requires careful attention to the principles
underlying the competition, and it behooves us to
attend to history. We should look to previous successful inducement prize contests in other research
fields in choosing a task and competition structure
that obey the principles that made those competitions successful. These principles include the follow-
ing: (1) The competition should be occasional, occurring only when plausible entrants exist. (2) The
awarding process should be flexible, so awards follow
the spirit of the competition rather than the letter of
the rules. (3) The results should be transparent, so that
any award is given only for systems that are open and
replicable in all aspects. (4) The criteria for success
should be based on absolute milestones, not relative
progress. (5) The milestones should be reasonable,
that is, not so far beyond current capability that their
achievement is inconceivable in any reasonable time.
The first three of these principles concern the rules
of the contest, while the final two concern the task
being posed. I discuss them seriatim, dispensing
quickly with the rule-oriented principles to concentrate on the more substantive and crucial task-related ones.
The competition should be occasional, occurring only
when plausible entrants exist.
The frequency of testing entrants should be determined by the availability of plausible entrants, not
by an artificially mandated schedule. Once one stipulates that a competition must be run, say, every
year, one is stuck with the prospect of awarding a
winner whether any qualitative progress has been
made or not, essentially forcing a quantitative incremental notion of progress that leads to the problems
of incrementalism noted above.
Successful inducement prize contests are structured so that actual tests of entrants occur only when
an entrant has demonstrated a plausible chance of
accomplishing the qualitative criterion. The current
Kremer Prize (the 1988 Kremer International
Marathon Competition) stipulates that it is run only
when an entrant officially applies to make an
attempt under observation by the committee. Even
then, any successful attempt must be ratified by the
committee based on extensive documentation provided by the entrant. Presumably to eliminate frivolous entries, entrants are subject to a nominal fee of
£100, as well as the costs to the committee of observing the attempt (The Royal Aeronautical Society
This principle is closely connected to the taskrelated principle of absoluteness, which will be discussed a little later.
The awarding process should be flexible, so awards follow the spirit of the competition rather than the letter
of the rules.
The goal of an inducement prize contest is to generate real qualitative progress. Any statement of evaluative criteria is a means to that end, not the end in
itself. It is therefore useful to include in the process
flexibility in the criteria, to make sure that the spirit,
and not the letter, of the law are followed. For
instance, the DARPA Grand Challenge allowed for
disqualifying entries “that cannot demonstrate intelligent autonomous behavior” (Schroeder [2004], p.
14). Such flexibility in determining when evaluation
of an entrant is appropriate and successful allows useful wiggle room to drop frivolous attempts or gaming
of the rules. For this reason, the 1714 Longitude Prize
placed awarding of the prize in the hands of an illustrious committee chaired by Isaac Newton, Lucasian
Professor of Mathematics. Similarly, the Kremer Prize
places “interpretation of these Regulations and Conditions . . . with the Society’s Council on the recommendation of the Organisers” (The Royal Aeronautical Society [1988], p. 6).
The results should be transparent, so that any award is
given only for systems that are open and replicable in
all aspects.
The goal of establishing an inducement prize in AI
is to expand knowledge for the public good. We
therefore ought to require entrants (not to mention
awardees) to make available sufficient information to
allow replication of their awarded event: open-source
code and any required data, open access to all documentation. It may even be useful for any award to
await an independent party replicating and verifying
the award. There should be no award for secret
The downside of requiring openness is that potential participants may worry that their participation
could poison the market for their technological
breakthroughs, and therefore they would avoid participation. But to the extent that potential participants believe that there is a large market for their satisfying the award criteria, there is no reason to
motivate them with the award in the first place.
The criteria for success should be based on absolute
milestones, not relative progress.
Any competition should be based on absolute
rather than relative criteria. The criterion for awarding the prize should be the satisfaction of specific
milestones rather than mere improvement on some
figure of merit. For example, the 1714 Longitude Act
established three separate awards based on specific
That the first author or authors, discover or discoverers of any such method, his or their executors, administrators, or assigns, shall be entitled to, and have such
reward as herein after is mentioned; that is to say, to a
reward or sum of ten thousand pounds, if it determines the said longitude to one degree of a great circle, or sixty geographical miles; to fifteen thousand
pounds if it determines the same to two thirds of that
distance; and to twenty thousand pounds, if it determines the same to one half of that same distance.
(British Parliament 1714)
Aviation and aeronautical prizes specify milestones
SPRING 2016 93
as well. The Orteig prize, first offered in
1919, specified a transatlantic crossing
in a single airplane flight, achieved by
Charles Lindbergh in 1927. The Ansari
X Prize required a nongovernmental
organization to perform two launches
to 100 kilometers within two weeks of
a reusable manned spacecraft, a
requirement fulfilled by Burt Rutan’s
SpaceShipOne eight years after the
prize’s 1996 creation.
If a winner is awarded merely on the
basis of having the best current performance on some quantitative metric,
entrants will be motivated to incrementally outperform the previous best,
leading to “hill climbing.” This is
exactly the behavior we see in funder
bakeoffs. If the prevailing approach sits
in some mode of the research search
space with a local optimum, a strategy
of trying qualitatively different
approaches to find a region with a
markedly better local optimum is
unlikely to be rewarded with success
the following year. Prospective
entrants are thus given incentive to
work on incremental quantitative
progress, leading to reduced creativity
and low risk. We see this phenomenon
as well in the Loebner Competition;
some two decades of events have used
exactly the same techniques, essentially those of Weizenbaum’s (1966) Eliza
program. If, by contrast, a winner is
awarded only upon hitting a milestone
defined by a sufficiently large quantum of improvement, one that the
organizers believe requires a qualitatively different approach to the problem, local optimization ceases to be a
winning strategy, and examination of
new approaches becomes more likely
to be rewarded.
The milestones should be reasonable,
that is, not so far beyond current capability that their achievement is inconceivable in any reasonable time.
Although an absolute criterion
requiring qualitative advancement
provides incentive away from incrementalism, it runs the risk of driving
off participation if the criterion is too
difficult. We see this in the qualitative
part of the Loebner Prize Competition.
The competition rules specify that (in
addition to awarding the annual prize
to whichever computer entrant performs best on the quantitative score) a
gold medal would be awarded and the
competition discontinued if an entrant
passes a multimodal extension of the
Turing test. The task is so far beyond
current technology that it is safe to say
that this prize has incentivized no one.
Instead, the award criterion should
be beyond the state of the art, but not
so far that its achievement is inconceivable in any reasonable time. Here
again, successful inducement prizes are
revealing. The first Kremer prize specified a human-powered flight over a figure eight course of half a mile. It did
not specify a transatlantic flight, as the
Orteig Prize for powered flight did.
Such a milestone would have been
unreasonable. Frankly, it is the difficulty of designing a criterion that walks
the fine line between a qualitative
improvement unamenable to hill
climbing and a reasonable goal in the
foreseeable future that makes designing an inducement prize contest so
tricky. Yet without finding a
Goldilocks-satisfying test (not too
hard, not too easy, but just right), it is
not worth running a competition. The
notion of reasonableness is well captured by the XPRIZE Foundation’s target
of “audacious but achievable” (The
Economist 2015.)
The reasonableness requirement
leads to a further consideration in
choosing tasks where performance is
measured on a quantitative scale. The
task must have headroom. Consider
again human-powered flight, measured against a metric of staying aloft
over a prescribed course for a given distance. Before the invention of the airplane, human-powered flight distances
would have been measured in feet,
using technologies like jumping, poles,
or springs. True human-powered flight
— at the level of flying animals like
birds and bats — is measured in distances that are, for all practical purposes, unlimited when compared to that
human performance. The task of
human-powered flight thus has plenty
of headroom. We can set a milestone
of 50 feet or half a mile, far less than
the ultimate goal of full flight, and still
expect to require qualitative progress
on human-powered flight.
By comparison, consider the task of
speech recognition as a test for intelligence. It has long been argued that
speech recognition is an AI-complete
task. Performance at human levels can
require arbitrary knowledge and reasoning abilities. The apocryphal story
about the sentence “It’s hard to wreck
a nice beach” makes an important
point: The speech signal underdetermines the correct transcription. Arbitrary knowledge and reasoning — real
intelligence — may be required in the
most subtle cases. It might be argued,
then, that we could use speech transcription error rate in an inducement
prize contest to promote breakthroughs in AI. The problem is that the
speech recognition task has very little
headroom. Although human-level performance may require intelligence,
near-human-level performance does
not. The difference in error rate
between human speech recognition
and computer speech recognition may
be only a few percentage points. Using
error rate is thus a fragile compass for
directing research.
Indeed, this requirement of reasonableness may be the hardest one to satisfy for challenges that incentivize
research that leads to machine intelligence. Traditionally, incentive prize
contests have aimed at breakthroughs
in functionality, but intelligence short
of human level is notoriously difficult
to define in terms of functionality; it
seems intrinsically intensional. Merely
requiring a particular level of performance on a particular functionality falls
afoul of what might be called Montaigne’s misconception. Michel de Montaigne in his arguing for the intelligence of animals notes the abilities of
individual animals at various tasks:
Take the swallows, when spring
returns; we can see them ferreting
through all the corners of our houses;
from a thousand places they select
one, finding it the most suitable place
to make their nests: is that done without judgement or discernment? . . .
Why does the spider make her web
denser in one place and slacker in
another, using this knot here and that
knot there, if she cannot reflect, think,
or reach conclusions?
We are perfectly able to realize how
superior they are to us in most of their
works and how weak our artistic skills
are when it comes to imitating them.
Our works are coarser, and yet we are
aware of the faculties we use to construct them: our souls use all their
powers when doing so. Why do we
not consider that the same applies to
animals? Why do we attribute to some
sort of slavish natural inclination
works that surpass all that we can do
by nature or by art? (de Montaigne
1987 [1576], 19–20)
Of course, an isolated ability does
not intelligence make. It is the generality of cognitive performance that we
attribute intelligence to. Montaigne
gives each type of animal credit for the
cognitive performances of all others.
Swallows build, but they do not weave.
Spiders weave, but they do not play
chess. People, our one uncontroversial
standard of intelligent being, do all of
these. Turing understood this point in
devising his test. He remarked that the
functionality on which his test is
based, verbal behavior, is “suitable for
introducing almost any one of the
fields of human endeavour that we
wish to include.” (Turing [1950], p.
Any task based on an individual
functionality that does not allow
extrapolation to a sufficiently broad
range of additional functionalities is
not adequate as a basis for an inducement prize contest for AI, however useful the functionality happens to be.
(That is not to say that such a task
might not be appropriate for an
inducement prize contest for its own
sake.) There is tremendous variety in
the functionalities on which particular
computer programs surpass people,
many of which require and demonstrate intelligence in humans. Chess
programs play at the level of the most
elite human chess players, players who
rely on highly trained intelligence to
obtain their performance. Neural networks recognize faces at human levels
and far surpassing human speeds.
Computers can recognize spoken
words under noise conditions that
humans find baffling. But like Montaigne’s animals, each program excels
at only one kind of work. It is the generalizability of the Turing test task that
results in its testing not only a particular functionality, but the flexibility we
take to indicate intelligence. Furthermore, the intensional character of
intelligence, that the functionality be
provided “in the right way,” and not
by mere memorization or brute computation, is also best tested by examining the flexibility of behavior of the
subject under test.
It is a tall order to find a task that
allows us to generalize from performance on a single functionality to performance on a broad range of functionalities while, at the same time,
being not so far beyond current capability that its achievement is inconceivable in any reasonable time. It may
well be that there are no appropriate
prize tasks in the intersection of audacious and achievable.
of the Principles
How do various proposals for tasks fare
with respect to these principles? The
three principles of flexibility, occasionality, and transparency are properties
of the competition rules, not the competition task, so we can assume that an
enlightened organizing body would
establish them appropriately. But what
of the task properties — absoluteness
and reasonableness? For instance,
would it be reasonable to use that most
famous task for establishing intelligence in a machine, the Turing test, as
the basis for an inducement prize contest for AI?
The short answer is no. I am a big
fan of the Turing test. I believe, and
have argued in detail (Shieber 2007),
that it works exceptionally well as a
conceptual sufficient condition for
attributing intelligence to a machine,
which was, after all, its original purpose. However, just because it works as
a thought experiment addressing that
philosophical question does not mean
that it is appropriate as a concrete task
for a research competition.
As an absolute criterion, the test as
described by Turing is fine (though it
has never been correctly put in place in
any competition to date). But the Turing test is far too difficult to serve as
the basis of a competition. It fails the
reasonableness principle.3 Passing a
full-blown Turing test is so far beyond
the state of the art that it is as silly to
establish that criterion in an inducement prize competition as it is to
establish transatlantic human-powered
flight. It should go without saying that
watered-down versions of the Turing
test based on purely relative performance among entrants is a nonstarter.
The AI XPRIZE rules have not yet
been established, but the sample criteria that Chris Anderson has proposed
(XPRIZE Foundation 2014) also fail our
principles. The first part, presentation
of a TED Talk on one of a set of one
hundred predetermined topics can be
satisfied by a “memorizing machine”
(Shieber 2014b) that has in its repertoire one hundred cached presentations. The second part, responding to
some questions put to it on the topic
of its presentation is tantamount to a
Turing test, and therefore fails the reasonableness criterion.4
What about special cases of the Turing test, in which the form of the
queries presented to the subject under
test is more limited than open-ended
natural language communication, yet
still requires knowledge and reasoning
indicative of intelligence? The Winograd Schema Challenge (Levesque,
Davis, and Morgenstern 2012) is one
such proposal. The test involves determining pronoun reference in sentences of the sort first proposed by
Winograd (1972, p. 33): “The city
councilmen refused the demonstrators
a permit because they feared violence.”
Determining whether the referent of
they is the city councilmen or the
demonstrators requires not only a
grasp of the syntax and semantics of
the sentence but an understanding of
and reasoning about the bureaucratic
roles of governmental bodies and
social aims of activists. Presumably,
human-level performance on Winograd schema queries requires humanlevel intelligence. The problem with
the Winograd Schema Challenge may
well be a lack of headroom. It might be
the case that simple strategies could
yield performance quite close to (but
presumably not matching) human level. Such a state of affairs would make
the Winograd Schema Challenge problematic as a guide for directing
research toward machine intelligence.5
Are there better proposals? I hope
so, though I fear there may not be any
combination of task domain and
award criterion that has the required
properties. Intelligence may be a phe-
SPRING 2016 95
nomenon about which we know sufficiently little that substantial but reasonable goals elude us for the moment.
There is one plausible alternative however. We might wait on establishing an
AI inducement prize contest until such
time as the passing of the Turing test
itself seems audacious but achievable.
That day might be quite some time.
I am indebted to Barbara Grosz and
Todd Zickler for helpful discussions on
the subject of this article, as well as the
participants in the AAAI “Beyond the
Turing Test” workshop in January 2015
for their thoughtful comments.
2. The XPRIZE Foundation, in cooperation
with TED, announced on March 20, 2014,
the intention to establish the AI XPRIZE presented by TED, described as “a modern-day
Turing test to be awarded to the first A.I. to
walk or roll out on stage and present a TED
Talk so compelling that it commands a
standing ovation from you, the audience”
(XPRIZE Foundation 2014). The competition
has yet to be finalized, however.
John Baskett, Printer to the Queens most
Excellent Magesty and by the assigns of
Thomas Newcomb, and Henry Hills,
The Economist. 2015. The X-Files. The Economist. Science and Technology section, 6
May. (
de Montaigne, M. 1987 [1576]. An Apology
for Raymond Sebond. Translated and edited
with an introduction and notes by M. A.
Screech. New York: Viking Penguin.
The Guardian. 2014. Computer Simulating
13-Year-Old Boy Becomes First to Pass Turing Test. The Guardian. Monday, June 9.
Hayes, P., and Ford, K. 1995. Turing Test
Considered Harmful. In Proceedings of the
Fourteenth International Joint Conference on
Artificial Intelligence. San Francisco: Morgan
Kaufmann Publishers.
Levesque, H. J.; Davis, E.; and Morgenstern,
L. 2012. The Winograd Schema Challenge.
In Proceedings of the 13th International Conference on Principles of Knowledge Representation and Reasoning, 552–561. Palo Alto, CA:
AAAI Press.
3. As an aside, it is unnecessary, and therefore counterproductive, to propose tasks
that are strict supersets of the Turing test for
a prize competition. For instance, tasks that
extend the Turing test by requiring nontextual inputs to be handled as well — audition
or vision, say — or nontextual behaviors to
be generated — robotic manipulations of
objects, for instance — complicate the task,
making it even less reasonable than the Turing test itself already is.
Lindquist, C. 1991. Quest for Machines
That Think. Computerworld.
4. Anderson proposes that the system
answer only one or two questions, which
may seem like a simplification of the task.
But to the extent that it is, it can be criticized on the same grounds as other topicand time-limited Turing tests (Shieber
The Royal Aeronautical Society, Human
Powered Flight Group. 1988. Human Powered Flight: Regulations and Conditions for
the Kremer International Marathon Competition. Information Sheet, August 1988.
London: The Royal Aeronautical Society,
Assets/Docs/About Us/HPAG/Rules/HP Kremer Marathon Rules.pdf).
5. There are practical issues with the Winograd Schema Challenge as well. Generating
appropriate challenge sentences is a specialized and labor-intensive process that may
not provide the number of examples
required for operating an incentive prize
British Parliament. 1714. An Act for Providing
a Publick Reward for Such Person or Persons as
Shall Discover the Longitude at Sea. London:
Locke, S. 2014. Forget the Turing Test. This
Is a Better Way to Measure Artificial Intelligence. Vox Technology, November 30.
Marcus, G. 2014. What Comes After the
Turing Test? The New Yorker. June 9.
Schroeder, A. 2004. The Application and
Administration of Inducement Prizes in
Technology. Technical Report IP-11-2004,
Independence Institute, Golden, CO.
Shieber, S. M. 1994. Lessons from a Restricted Turing Test. Communications of the ACM
37(6): 70–78.
Shieber, S. M. 2007. The Turing Test as Interactive Proof. Noûs 41(4): 686–713. dx.doi.
Shieber, S. M. 2014a. No, the Turing Test
Has Not Been Passed. The Occasional Pamphlet on Scholarly Communication. June
10. (
Shieber, S. M. 2014b. There Can Be No Turing-Test-Passing Memorizing Machines.
Philosophers’ Imprint 14(16): 1–13. (
Turing, A. M. 1950. Computing Machinery
and Intelligence. Mind 59(236): 433–460.
Vardi, M. Y. 2014. Would Turing Have
Passed the Turing Test? Communications of
the ACM 57(9): 5.
Weizenbaum, J. 1966. ELIZA — A Computer Program for the Study of Natural Language Communication between Man and
Machine. Communications of the ACM 9(1):
Winograd, T. 1972. Understanding Natural
Language. Boston: Academic Press.
XPRIZE Foundation. 2014. A.I. XPRIZE presented by TED. March 20. Los Angeles, CA:
XPRIZE Foundation, Inc. (
Stuart M. Shieber is the James O. Welch, Jr.,
and Virginia B. Welch professor of computer science in the School of Engineering and
Applied Sciences at Harvard University. His
research focuses on computational linguistics and natural language processing. He is a
fellow of AAAI and ACM, the founding
director of the Center for Research on Computation and Society, and a faculty codirector of the Berkman Center for Internet and
WWTS (What Would Turing Say?)
Douglas B. Lenat
I Turing’s Imitation Game was a brilliant early proposed test of machine
intelligence — one that is still compelling today, despite the fact that in the
hindsight of all that we’ve learned in
the intervening 65 years we can see the
flaws in his original test. And our field
needs a good “Is it AI yet?” test more
than ever today, with so many of us
spending our research time looking
under the “shallow processing of big
data” lamppost. If Turing were alive
today, what sort of test might he propose?
WTDS (What Turing Did/Didn’t Say)
If you are reading these words, surely you are already familiar with the Imitation Game proposed by Alan Turing (1950).
Or are you?
Turing was heavily influenced by the World War II “game”
of allied and axis pilots and ground stations each trying to
fool the enemy into thinking they were friendlies. So his
imagined test for AI involved an interrogator being told that
he or she was about to interview a man and woman over a
teletype, both of whom would be pretending to be the
woman; the task was to guess which one was lying. If a
machine could fool interrogators as often as a typical man,
then one would have to conclude that that machine, as programmed, was as intelligent as a person (well, as intelligent
as men.)1 As Judy Genova (1994) puts it, Turing’s originally
proposed game involves not a question of species, but one of
The current version, where the interrogator is told he or
she needs to distinguish a person from a machine, is (1)
much more difficult to get a program to pass, and (2) almost
all the added difficulties are largely irrelevant to intelligence!
And it’s possible to muddy the waters even more by some
programs appearing to do well at it due to various tricks, such
as having the interviewee program claim to be a 13-year-old
Ukrainian who doesn’t speak English well (University of
Reading 2014), and hence having all its wrong or bizarre
responses excused due to cultural, age, or language issues.
Going into more detail here about why the current version
of the Turing test is inadequate and distracting would be a
Copyright © 2016, Association for the Advancement of Artificial Intelligence. All rights reserved. ISSN 0738-4602
SPRING 2016 97
digression from my main point, so I’ve included that
discussion as a sidebar to this article.
Here, let it suffice for me to point out that one
improvement would be simply to go back to his originally proposed test, or some variant of it. I’m imagining here a game similar to the TV program To Tell
the Truth. Panelists (the interrogators) are told that
they are talking to three people who will all be claiming that some fact is true about them (for example,
they treat sick whales; they ate their brother’s bug
collection; and others) and that two of the people are
lying and one is telling the truth; their job is to ask
questions to pick out the truth teller.
In my imagined game, the interrogator is told he
or she will be interviewing three people online, all
claiming X, and her or his task is to pick out the one
truth teller. Then we measure whether our supposed
AI fools the interrogator at least as often as the
human “liars” are able to. Averaged over lots of interrogators, lots of claims, and lots of liars, this might be
an improvement over today’s Turing test.
Does that go far enough? It still smacks of a challenge one might craft for a magician. I can imagine
programs doing well at that task through tricks, but
then clearly (through subsequent failed attempts to
apply them) revealing themselves not to be generally intelligent after all. So let’s rethink the test from
the top down.
WTMS (What Turing Might Say)
So what might Turing say today, if he were alive to
propose a new test for machine intelligence? He was
able to state the original test in one paragraph; he
might first try to find an equally terse and compelling
modern version.
Mathematics revolutionized physics in the late
nineteenth and early twentieth centuries, and “softer” sciences like psychology and sociology and AI
have been yearning not to be left behind. That type
of physics envy has all too often led to premature formalization, holding back progress in AI at least as
much as helping it. To quote economist Robert Heilbroner, “Mathematics has given economics rigor, but
alas, also mortis.”
I don’t quite have enough presumption to claim
that Turing would come up with the same test that
I’m about to discuss, but I do believe that he’d recoil
a bit at some of the tricks-based chatbots crafted in
his name, and think twice before tossing off a new
glib two-sentence-long test for AI.
My test, like his original Imitation Game, is one for
recognizing AI when it’s here. Instead of focusing on
one computer program being examined for intelligence, what matters is that human beings synergizing with the AI exhibit what from our 2016 point of
view would be superhuman intelligence.
The way to test for that, in turn, will be to look for
the many and dramatic impacts that state of affairs
would have on us, on our personal and professional
lives, and on the way that various aspects of society
and economy work. Some of the following are no
doubt wrong, and will seem naïve and even humorous 65 years from now, but I’d be genuinely surprised3 if real AI — from now on let’s just call that RAI
— didn’t engender most of the following.
Almost everyone has a cradle-to-grave general personal assistant application that builds up an integrated model of the person’s preferences, abilities,
interests, modes of learning, idiosyncratic use of
terms and expressions, experiences (to analogize to),
goals, plans, beliefs. Siri and Cortana are indicators
of how much demand there is for such PDAs. The real
test for this having “arrived” will be not just its universal adoption but metalevel phenomena including
legislation surrounding privacy and access by law
enforcement; and the rise of standards and applications using those standards that broker communication between multiple individuals’ PDAs; and marketing directed at the PDAs that will be making most
of the mundane purchasing decisions in their ratava’s (the inverse of “avatar”) life.
The popularity of massive open online courses
(MOOCs) and the Khan Academy are early indicators
of how much demand there is even for non-AI-based
education courseware. When AI is here, we will see
widespread individualized (using — and feeding back
to — one’s PDA) education to the point where in
effect everyone is home schooled, “schools” continuing to exist in some form to meet the infrastructure,
extracurricular, and social needs of the students. A
return to what appears to be the monitorial system,
where much of the student’s time is spent emulating
not so much a sponge (trying to absorb concepts and
skills, as is true today) as emulating a teacher, a tutor,
since — I think we’ve all experienced this — we often
really understand something only after we’ve had to
teach or explain it to someone else. In this case, the
human (let’s refer to her or him as the tutor) will be
tutoring one or more tutees who will likely be AIs,
not other human beings. Those “tutee” AIs will be
constantly assessing the tutor and deciding what mistakes to make, what confusions to have, what apparent learning (and forgetting) to exhibit, based on
what will best serve that tutor pedagogically, what
will be motivated by situations in that person’s real
life (teaching you new things in situations where
they would be useful and timely for you to know),
based on the AI reasoning about what will be fun and
entertaining to the person, and similar concerns that
in effect blur the boundaries of what education is,
compared with today.
Health Care
The previous two impacts ripple over to this — your
PDA watching out for you and helping you become a
more accurately and more fully informed consumer
of health-care products and services, calling attention
to things in ways and at times that will make a difference in your life. From the other direction,
though, RAI will enable much more individualized
diagnosis and treatment; for an early step along that
line, see DARPA’s Big Mechanism project, which has
just begun, whose goal is to use AI to read and integrate large amounts of cancer research literature,
which (coupled with patient-specific information)
will enable plausible hypotheses to be formed about
the pathways that your cancer is taking to grow and
metastasize, and plausible treatments that might
only be effective or even safe for you and a tiny sliver of other individuals. RAI (coupled with robotics
only slightly more advanced than the current state of
the art) will also revolutionize elderly care, given
almost limitless patience, ability to recognize what
their “patient”/companion is and isn’t doing (for
example, exercise-wise), and so on. This will later
spread to nursing care for wider populations of
patients. I fear that extending this all the way to child
and infant care will be one of the last applications of
AI in health care due to the public’s and the media’s
intolerance of error in that activity.
This is currently based on atoms (goods), services
involving atoms, and information treated as a commodity. The creation and curation of knowledge is,
by contrast, done for free — given away in return for
your exposure to online advertising and as a gateway to other products and services. I believe that
RAI will change that, profoundly, and that people
will not hesitate to be charged some tiny amount (a
penny, let’s say) for each useful alert, useful answer,
useful suggestion. That in turn will fuel a knowledge
economy in which contributors of knowledge are
compensated in micropayment shares of that penny. Once this engine is jump-started, widespread
vocation and avocation as knowledge contributors
will become the norm. Some individuals will want
and will receive the other sort of credit (citation
credit) in addition or instead of monetary credit,
possibly pseudonymously. Moreover, as we increase
our trust in our PDA (above), it will be delegated
increasing decision-making and spending authority;
the old practice of items being sent to individuals
“on approval” will return and human attention
being paid to shopping may be relegated to hobby
status, much as papermaking or home gardening
today. Advertising will have to evolve or die, once
consumers are better educated and increasingly the
buying decisions are being made by their PDAs anyway. And ever-improving translation and (not using
AI particularly) three-dimensional printing tech-
nologies will make the consumer’s uncorrected
physical location almost as unimportant as his or
her uncorrected vision is today.
The flip side of the impact of AI on the economy is
that a very small fraction of the population will be
needed to grow the world’s food and produce the
world’s goods, as robots reliably amplify the ability of
a relatively few people to meet that worldwide
demand. This will lead to something that many critics will no doubt label universal socialism in their then
vastly greater free time.
Democracy and Government
RAI will probably have a dramatic effect in this area,
pummeling the status quo of these institutions from
multiple directions: for example, more effective education will result in a voting public better able to perform critical thinking and to detect and correct for
attempts at manipulation and at revising history.
Lawmakers and the public will be able to generate
populations of plausible scenarios that enable them
to better assess alternative proposed policies and
courses of action. Fraud and malfeasance will become
more and more difficult to carry out, with multiple
independent AI watchdogs always awake and alert.
Government functions currently drowning in red
tape, due to attempts to be frugal through standardization, may be catalyzed or even automated by RAI,
which can afford to — which will inevitably — know
and treat everyone as an individual.
Our Personal Experience
By this I mean to include various sorts of phenomena that will go from unheard of to ubiquitous once
RAI arrives. These include the following.
Weak Telepathy
You formulate an intent, and have barely started to
make a gesture to act on it, when the AI understands
what you have in mind and why, and completes that
action (or a better one that accomplishes your actual
goal) for you; think of an old married couple finishing each other’s sentences, raised to the nth power.
This isn’t of course real telepathy — hence the word
weak — but functionally is almost indistinguishable
from it.
Weak Immortality
Your PDA’s cradle-to-grave model of you is good
enough that, even after your death, it can continue
to interact with loved ones, friends, business associates, carry on conversations, carry our assigned tasks,
and others; eventually this will be almost as though
you never died (well, to everyone except you, of
course, hence the word weak).
Weak “Cloning”
The quotation marks refer to the science-fiction type
of duplication of you instantly as you are now, able
to be in several places at once, attending to several
things at once, with your one “real” biological consciousness and (through VR) awareness flitting to
SPRING 2016 99
The Current Turing Test Is Hard in Ways Both Unintended and Irrelevant
At AAAI 2006, I went through
this at length (Lenat 2008),
but the gist is that Turing’s
game had a human interrogator talking through a teletype
with a man and a woman,
both pretending that they
were the woman. The experimenter measures what percentage of the time the average interrogator is wrong —
identifies the wrong interviewee as being the woman. Turing’s proposed test, then, was
to see if a computer could be
programmed to fool the interrogator (who was still told
that they were talking to a
human man and a human
woman!) into guessing incorrectly about which interrogatee was the woman at least as
often as men were able to fool
the interrogator. One could
argue then that such a computer, as programmed, was
intelligent. Well, at least as
intelligent the typical human
Why is the revised genderneutral version harder to pass
and less reflective of human
intelligence? If the interrogator is told that the task is to
distinguish a computer from a
person, then they can draw
on his or her array of facts,
experiences, visual and aural
and olfactory and tactile capabilities, current events and
history, expectations about
how accurately and completely the average person remembers Shakespeare, and so on,
to ask things they never
would have asked under Turing’s original test, when they
thought they were trying to
distinguish a human man
from a human woman
through a teletype.
Our vast storehouse of
common sense also makes it
more difficult to pass the
“neutered” Turing test than
the original version. Every
time we see or hear a sentence
with a pronoun, or an
ambiguous word, we draw on
that reservoir to decode what
the author or speaker encoded
into that shorthand. Most of
the examples I’ve used in my
talks and articles for the last
40 years (such as disambiguating the word pen in “the box
is in the pen” versus “the pen
is in the box”) have been borrowed and reborrowed from
Bar-Hillel, Chomsky, Schank,
Winograd, Woods, and — surprisingly often and effectively
— from Burns and Allen.
Almost all of these disambiguatings are gender neutral
— men perform them about
as well as women perform
them — hence they simply
wouldn’t come up or figure
into the original Turing test,
only the modern, neutered
The previous two paragraphs listed various ways in
which the gender-neutral Turing test is made vastly more
difficult because of human
beings’ gender-independent
general knowledge and reasoning capabilities. The next
few paragraphs list a few ways
in which the gender-neutral
Turing test is made more difficult because of gender-independent human foibles and
Human beings exhibit
dozens of translogical behaviors: illogical but predictable
wrong decisions that most
people make, incorrect but
predictable wrong answers to
queries. Since they are so predictable, an interrogator in
today’s “neutered” Turing test
could use these to separate
human from nonhuman
interrogatees, since that’s
what they are told their task
is. As I said in 2008 (Lenat
2008): “Some of these are very
obvious and heavy-handed,
hence uninteresting, but still
work a surprising fraction of
the time — ‘work’ meaning,
here, to enable the interrogator instantly to unmask many
of the programs entered into a
Turing test competition as
programs and not human
beings: slow and errorful typing; 7 +/– 2 short-term memory size; forgetting (for example, what day of the week was
April 7, 1996? What day of
the week was yesterday?);
wrong answers to math problems (some wrong answers
being more ‘human’ than
others: 93 – 25 = 78 is more
understandable than if the
program pretends to get a
wrong answer of 0 or –9998
for that subtraction problem.
[Brown and van Lehn 1980]).
… Asked to decide which is
more likely, ‘Fred S. just got
lung cancer.’ or ‘Fred S.
smokes and just got lung cancer,’ most people say the latter. People worry more about
dying in a hijacked flight than
the drive to the airport. They
see the ‘face’ on Mars. They
hold onto a losing stock too
long because of ego. If a
choice is presented in terms of
rewards, they opt for a different alternative than if it’s presented in terms of risks. They
are swayed by ads.”
When faced with a difficult
decision, human beings often
select the alternative of inaction — if it is available to
them — rather than action.
One example of this is the
startling statistic that in those
European countries that ask
driver’s license applicants to
“check this box to opt in” to
organ donation, there is only
a 15 percent enrollment,
whereas in neighboring, culturally similar countries
where the form says “check
this box to opt out” there is
an 85 percent organ donor
enrollment. That is, 85 percent don’t check the box no
matter what it says! This isn’t
because this decision is
beneath their notice, quite
the contrary: they care very
deeply about the issue, but
they are ambivalent, and thus
their reaction is to make the
choice that doesn’t require
them to do anything, not
even check a box on a piece of
paper. Another, even more
tragic, example of this “omission bias” (Ritov and Baron
1990) involves American parents’ widespread reluctance to
have their children vaccinated.
For more examples of these
sorts of irrational yet predictable human behaviors,
see, for example, Tversky and
Kahneman (1983).
As an exercise, imagine that
an extraterrestrial lands in
Austin, Texas, and wants to
find out how Microsoft Word
works, the program I am currently running as I type these
words. The alien carefully
measures the cooling fan air
outflow rate and temperature,
and the disk-seeking sounds
that my computer makes as I
type these words, and then
spends 65 years trying to
mimic those air-heatings and
clicking noises so precisely
that no one can distinguish
them from the sounds my
Dell PC is making right now.
Absurd! Pathetic! But isn’t
that in effect what the
“neutered” Turing test proponents are requiring we do,
requiring that our program do
if it is to be adjudged to pass
their test? Are we really so
self-enthralled that we think
it’s wise to spend our precious
collective AI research time
getting programs to mimic
the latency delays, error rates,
limited short-term memory
size, omission bias, and others, of human beings? Those
aren’t likely to be intimately
tied up with intelligence, but
rather just unfortunate artifacts of the platform on which
human intelligence runs.
They are about as relevant to
intelligence as my Dell PC’s
cooling fan and disk noises
are to understanding how
Microsoft Word works.
whichever of your simulated selves needs you the
most at that moment.
Arbitrarily Augmented Reality
This includes real-time correction for what is being
said around and to you, so almost no one ever mishears or misunderstands any more. It includes superimposing useful details onto what you see, so you
have the equivalent of X-ray and telescopic vision,
and the sort of “important objects glow” effects seen
in video games, paths of glowing particles to guide
you, reformulation of objects you’d prefer to see differently (but with physical boundaries and edges preserved for safety).
Better-Than-Life Games and Entertainment
This is of course potentially dangerous and addictive,
and — like many of the above predicted indicators —
may herald very serious brand new problems, not
just solutions to old ones.4
I’ll close here, on that cautionary note. My purpose is
not to provide answers, or even make predictions
(though I seem to have done that), but rather to stimulate discussion about how we’ll know when RAI has
arrived: not through some Turing test Mark II but
because the world will change almost overnight if or
when superhuman aliens arrive — and real AI making
its appearance is likely to be the one and only time
that happens.
The following individuals provided comments and
suggestions that have helped make this article more
accurate and on-point, but they should not be held
accountable for any remaining inaccuracies, omissions, commissions, or inflammations: Paul Cohen,
Ed Feigenbaum, Elaine Kant, Elaine Rich, and Mary
1. Creepily, many people today in effect play this game
online every day: men trying to “crash” women-only chats
and forums, pedophiles pretending to be 10 year olds,
MMO players lying about their gender or age, and others.
2. There remains some ambiguity (given his dialogue examples) about what Turing was proposing. But there is no
ambiguity in the fact that the gender-neutral version is how
the world came to recall what Turing wrote, by the time of
the 1956 Dartmouth AI Summer Project, and ever since.
3. Alan Kay says that the best way to predict the future is to
invent it. In that sense, these “predictions” could be recast
as challenge problems for AI, a point of view consonant
with Feigenbaum (2003) and Cohen (2006).
4. For example, while most of us will use AI to help us see
multiple sides of an issue, to see reality more accurately and
completely, AI could also be used for the opposite purpose,
to filter out parts of the world that disagree with how we
want to believe it to be.
5. He then gives some dialogue examples that make his
intent somewhat ambiguous, but after that he returns to his
main point about the computer pretending to be a man;
and then discusses various possible objections to a computer ever being considered intelligent.
Brown, J. S., and VanLehn, K. 1980. Repair Theory: A Generative Theory of Bugs in Procedural Skills. Cognitive Science
4(4): 379–426.
Cohen, P. 2006. If Not Turing’s Test, Then What? AI Magazine 26(4): 61–67.
Feigenbaum, E. A. 2003. Some Challenges and Grand Challenges for Computational Intelligence. Journal of the Association for Computing Machinery 50(1): 32–40.
Genova, J. 1994. Turing’s Sexual Guessing Game. Journal of
Social Epistemology 8(4): 313–326.
Lenat, D. B. 2008. The Voice of the Turtle: Whatever Happened to AI? AI Magazine 29(2): 11–22.
Ritov, I., and Baron, J. 1990. Reluctance to Vaccinate: Omission Bias and Ambiguity. Journal of Behavioral Decision Making 3(4): 263–277.
Turing, A. M. 1950. Computing Machinery and Intelligence.
Mind 59(236): 433–460.
Tversky, A., and Kahneman, D. 1983. Extensional Versus
Intuitive Reasoning: The Conjunction Fallacy in Probability Judgment. Psychological Review 90(4): 293–315.
University of Reading. 2014. Turing Test Success Marks Milestone in Computing History. Press Release, June 8, 2014.
Communications Office, University of Reading, Reading,
Doug Lenat, a prolific author and pioneer in artificial intelligence, focuses on applying large amounts of structured
knowledge to information management tasks. As the head
of Cycorp, Lenat leads groundbreaking research in software
technologies, including the formalization of common
sense, the semantic integration of — and efficient inference
over — massive information sources, the use of explicit contexts to represent and reason with inconsistent knowledge,
and the use of existing structured knowledge to guide and
strengthen the results of automated information extraction
from unstructured sources. He has worked in diverse parts
of AI — natural language understanding and generation,
automatic program synthesis, expert systems, machine
learning, and so on — for more than 40 years now. His 1976
Stanford Ph.D. dissertation, AM, demonstrated that creative
discoveries in mathematics could be produced by a computer program (a theorem proposer, rather than a theorem
prover) guided by a corpus of hundreds of heuristic rules for
deciding which experiments to perform and judging “interestingness” of their outcomes. That work earned him the
IJCAI Computers and Thought Award and sparked a renaissance in machine-learning research. Lenat was on the computer science faculties at Carnegie Mellon University and
Stanford, was one of the founders of Teknowledge, and was
in the first batch of AAAI Fellows. He worked with Bill Gates
and Nathan Myhrvold to launch Microsoft Research Labs,
and to this day he remains the only person to have served
on the technical advisory boards of both Apple and
Microsoft. He is on the technical advisory board of TTI Vanguard, and his interest and experience in national security
has led him to regularly consult for several U.S. agencies and
the White House.
SPRING 2016 101
Competition Reports
Summary Report of the
First International Competition
on Computational Models
of Argumentation
Matthias Thimm, Serena Villata, Federico Cerutti,
Nir Oren, Hannes Strass, Mauro Vallati
I We review the First International
Competition on Computational Models
of Argumentation (ICCMA’15). The
competition evaluated submitted
solvers’ performance on four different
computational tasks related to solving
abstract argumentation frameworks.
Each task evaluated solvers in ways
that pushed the edge of existing performance by introducing new challenges. Despite being the first competition in this area, the high number of
competitors entered, and differences in
results, suggest that the competition will
help shape the landscape of ongoing
developments in argumentation theory
omputational models of argumentation are an active
research discipline within artificial intelligence that
has grown since the beginning of the 1990s (Dung
1995). While still a young field when compared to areas such
as SAT solving and logic programming, the argumentation
community is very active, with a conference series (COMMA,
which began in 2006) and a variety of workshops and special
issues of journals. Argumentation has also worked its way
into a variety of applications. For example, Williams et al.
(2015) described how argumentation techniques are used for
recommending cancer treatments, while Toniolo et al. (2015)
detail how argumentation-based techniques can support critical thinking and collaborative scientific inquiry or intelligence analysis.
Many of the problems that argumentation deals with are
computationally difficult, and applications utilizing argumentation therefore require efficient solvers. To encourage
this line of research, we organised the First International
Competition on Computational Models of Argumentation
(ICCMA), with the intention of assessing and promoting
state-of-the-art solvers for abstract argumentation problems,
and to identify families of challenging benchmarks for such
Copyright © 2016, Association for the Advancement of Artificial Intelligence. All rights reserved. ISSN 0738-4602
Competition Reports
The objective of ICCMA’15 is to allow researchers
to compare the performance of different solvers systematically on common benchmarks and rules.
Moreover, as witnessed by competitions in other AI
disciplines such as planning and SAT solving, we see
ICCMA as a new pillar of the community, which provides information and insights on the current state
of the art and highlights future challenges and developments.
This report summarizes the first ICCMA held in
2015 (ICCMA’15). In this competition, solvers were
invited to address standard decision and enumeration problems of abstract argumentation frameworks
(Dunne and Wooldridge 2009). Solvers’ performance
is evaluated based on their time taken to provide a
correct solution for a problem; incorrect results were
discarded. More information about the competition,
including complete results and benchmarks, can be
found on the ICCMA website.1
In abstract argumentation (Dung 1995), a directed
graph (A, R) is used as knowledge representation formalism, where the set of nodes A are identified with
the arguments under consideration and R represents
a conflict-relation between arguments, that is, aRb for
a, b ∈ A if a is a counterargument for b. The framework is abstract because the content of the arguments
is left unspecified. They could, for example, consist of
a chain of logical deductions from logic programming with defeasible rules (Simari 1992); a proof for
a theorem in classical logic (Besnard and Hunter
2007); or an informal presumptive reason in favour
of some conclusion (Walton, Reed, and Macagno
2008). The notion of conflict then depends on the
chosen formalization. Irrespective of the precise formalization used, one can identify a subset of arguments that can be collectively accepted given interargument conflicts. Such a subset is referred to as an
extension, and (Dung 1995) defined four commonly
used argumentation semantics — namely the complete (CO), preferred (PR), grounded (GR), and stable
(ST) semantics — each of which defines an extension
differently. More precisely, a complete extension is a
set of arguments that do not attack each other,2 and
in which arguments defend each other; a preferred
extension is a maximal (with regard to set inclusion)
complete extension; the grounded extension is the
minimal (with regard to set inclusion) complete
extension; and a stable extension is a complete
extension such that each argument not in the extension is attacked by at least one argument within the
The competition was organized around four computational tasks of abstract argumentation: (1) Given
an abstract argumentation framework, determine
some extension (SE). (2) Given an abstract argumentation framework, determine all extensions (EE). (3)
Given an abstract argumentation framework and
some argument, decide whether the given argument
is contained in some extension (DC). (4) Given an
abstract argumentation framework and some argument, decide whether the given argument is contained in all extensions (DS).
Combining these four different tasks with the four
semantics discussed above yields a total of 16 tracks
that constituted ICCMA’15. Each submitted solver
was free to support any number of these tracks.
The competition received 18 solvers from research
groups in Austria, China, Cyprus, Finland, France,
Germany, Italy, Romania, and UK, of which 8 were
submitted to all tracks. The solvers used a variety of
approaches and programming languages to solve the
competition tasks. In particular, 5 solvers were based
on transformations of argumentation problems to
SAT, 3 on transformations to ASP, 2 on CSP, and 8
were built on tailor-made algorithms. Seven solvers
were implemented in C/C++, 4 in Java, 2 used shellscripts for translations to other formalisms, and the
remaining solvers were implemented in Haskell, Lisp,
Prolog, Python, and Go.
All participants were required to submit the source
code of their solver, which was made freely available
after the competition, to foster independent evaluation and exploitation in research or real-world scenarios, and to allow for further refinements. Submitted solvers were required to support the probo
(Cerutti et al. 2014)3 command-line interface, which
was specifically designed for running and comparing
solvers within ICCMA.
Performance Evaluation
Each solver was evaluated over N different argumentation graph instances within each track (N = 192 for
SE and EE, and 576 for DC and DS). Instances were
generated with the intention of being challenging —
one group of instances was generated so as to contain
a large grounded extension and few extensions in the
other semantics. This group’s graphs were large (1224
to 9473 arguments), and challenged solvers that
scaled poorly (that is, those that used combinatorial
approaches for computing extensions). A second
group of instances was smaller (141 to 400 arguments), but had a rich structure of stable, preferred,
and complete extensions (up to 159 complete extensions for the largest graphs) and thus provided combinatorial challenges for solvers relying on simple
search-based algorithms. A final group contained
medium-sized graphs (185 to 996 arguments) and
featured many strongly connected components with
many extensions. This group was particularly challenging for solvers not able to decompose the graph
into smaller components.
SPRING 2016 103
Competition Reports
Each solver was given 10 minutes to solve an
instance. For each correctly and timely solved
instance the solver received one point, and a ranking
for each track was obtained based on points scored
on all its instances. Ties were broken by considering
total run time on all instances. Additionally, a global
ranking of the solvers across all tracks was generated
by computing the Borda count of all solvers in all
Results and Concluding Remarks
The obtained rankings for all 16 tracks can be found
on the competition website.4 The global ranking
identified the following top three solvers: (1)
CoQuiAAS, (2) ArgSemSAT, and (3) LabSATSolver.
Another solver, Cegartix, participated in only three
tracks (SE-PR, EE-PR, DS-PR), but came top in all of
these. It is interesting to note that these four solvers
are based on SAT-solving techniques. Additionally, an
answer set programming–based solver (ASPARTIX-D)
came first in the four tracks related to the stable
semantics; there is a strong relationship between
these semantics and the answer set semantics, which
probably explains its strength in these tracks. Information on the solvers and their authors can also be
found on the home page of the competition.
Given the success of the competition, a second iteration will take place in 2017 with an extended number of tracks.
2. S ⊆ A defends a if ∀bRA, ∃c ∈ S s.t. cRB, that is, all attackers of a are counterattacked by S.
3. See also F. Cerutti, N. Oren, H. Strass, M. Thimm, and M.
Vallati, M. 2015: The First International Competition on
Computational Models of Argumentation (ICCMA15): Supplementary notes on probo (argumentationcompetition.
Besnard, P., and Hunter, A. 2007. Elements of Argumentation.
Cambridge, MA: The MIT Press.
Cerutti, F.; Oren, N.; Strass, H.; Thimm, M.; and Vallati, M.
2014. A Benchmark Framework for a Computational Argumentation Competition. In Proceedings of the 5th International Conference on Computational Models of Argument, 459–
460. Amsterdam: IOS Press.
Dung, P. M. 1995. On the Acceptability of Arguments and
Its Fundamental Role in Nonmonotonic Reasoning, Logic
Programming, and n-Person Games. Artificial Intelligence
77(2): 321–357.
Dunne, P. E., and Wooldridge, M. 2009. Complexity of
Abstract Argumentation. In Argumentation in AI, ed. I. Rahwan and G. Simari, chapter 5, 85–104. Berlin: Springer-Verlag.
Simari, G. 1992. A Mathematical Treatment of Defeasible
Reasoning and Its implementation. Artificial Intelligence
Toniolo, A.; Norman, T. J.; Etuk, A.; Cerutti, F.; Ouyang, R.
W.; Srivastava, M.; Oren, N.; Dropps, T.; Allen, J. A.; and Sullivan, P. 2015. Agent Support to Reasoning with Different
Types of Evidence in Intelligence Analysis. In Proceedings of
the 14th International Conference on Autonomous Agents and
Multiagent Systems (AAMAS 2015), 781–789. Richland, SC:
International Foundation for Autonomous Agents and Multiagent Systems.
Walton, D. N.; Reed, C.; and Macagno, F. 2008. Argumentation Schemes. New York: Cambridge University Press.
Williams, M.; Liu, Z. W.; Hunter, A.; and Macbeth, F. 2015.
An Updated Systematic Review of Lung Chemo-Radiotherapy Using a New Evidence Aggregation Method. Lung Cancer
Matthias Thimm is a senior lecturer at the Universität
Koblenz-Landau, Germany. His main research interests are
in knowledge representation and reasoning, particularly on
aspects of uncertainty and inconsistency.
Serena Villata is a researcher at CNRS, France. Her main
research interests are in knowledge representation and reasoning, particularly in argumentation theory, normative
systems, and the semantic web.
Federico Cerutti is a lecturer at Cardiff University, UK. His
main research interests are in knowledge representation and
reasoning, and in computational models of trust.
Nir Oren is a senior lecturer at the University of Aberdeen,
UK. His research interests lie in the area of agreement technologies, with specific interests in argumentation, normative reasoning, and trust and reputation systems.
Hannes Strass is a postdoctoral researcher at Leipzig University, Germany. His main research interest is in logicbased knowledge representation and reasoning.
Mauro Vallati is a research fellow at the PARK research
group of the University of Huddersfield, United Kingdom.
His main research interest is in AI planning. He was coorganiser of the 2014 edition of the International Planning
Competition (IPC).
A Report on the Ninth
International Web
Rule Symposium
Adrian Paschke
I The annual International Web Rule
Symposium (RuleML) is an international conference on research, applications,
languages, and standards for rule technologies. RuleML is a leading conference to build bridges between academe
and industry in the field of rules and its
applications, especially as part of the
semantic technology stack. It is devoted
to rule-based programming and rulebased systems including production
rules systems, logic programming rule
engines, and business rule engines/business rule management systems; semantic web rule languages and rule standards; rule-based event-processing
languages (EPLs) and technologies; and
research on inference rules, transformation rules, decision rules, production
rules, and ECA rules. The Ninth International Web Rule Symposium
(RuleML 2015) was held in Berlin, Germany, August 2–5. This report summarizes the events of that conference.
he Ninth International Web Rule Symposium (RuleML
2015) was held in Berlin, Germany, from August 2–5.
The symposium was organized by Adrian Paschke (general chair), Fariba Sadri (program cochair), Nick Bassiliades
(program cochair), and Georg Gottlob program cochair). A
total number of 94 papers were submitted from which 22 full
papers, 1 short paper, 2 keynote papers, 3 track papers, 4 tutorial papers, 6 industry papers, 6 challenge papers, 3 competition papers, 5 Ph.D. papers and 3 poster papers were selected. The papers were presented in multiple tracks on complex
event processing, existential rules and Datalog+/–, industry
applications, legal rules and reasoning, and rule learning. Following the precedent set in earlier years, RuleML also hosted
the Fifth RuleML Doctoral Consortium and the Ninth International Rule Challenge as well as the RuleML Competition,
which this year was dedicated to rule-based recommender
systems on the web of data. A highlight of this year’s event
was the industry track, which introduced six papers describing research work in innovative companies. New this year
was also the joint RuleML / reasoning web tutorial day on the
first day of the symposium, with four tutorials — TPTP World
by Geoff Sutcliffe, PSOA RuleML by Harold Boley, Rulelog by
Benjamin Grosof, and OASIS LegalRuleML by Tara Athan.
Copyright © 2016, Association for the Advancement of Artificial Intelligence. All rights reserved. ISSN 0738-4602
SPRING 2016 105
The Thirty-First AAAI Conference on Artificial Intelligence
(AAAI-17) and the Twenty-Ninth Conference on Innovative Applications of Artificial Intelligence (IAAI-17), will be
held in New Orleans, Louisiana, USA, during the mid-January to mid-February timeframe. Final dates will be available by March 31, 2016. The technical conference will
continue its 3.5-day scheduled, either preceded or followed by the workshop and tutorial programs. AAAI-17
will arrive in New Orleans just prior to Mardi Gras and festivities will already be underway. Enjoy legendary jazz
music, the French Quarter filled with lively clubs and
restaurants, world-class museums, and signature architecture. New Orleans’ multicultural and diverse communities
will make your choices and and experience in the Big Easy
unique. The 2017 Call for Papers will be available soon at
Please join us in 2017 in NOLA for a memorable AAAI!
This year’s symposium featured three invited
keynote talks. Michael Genesereth of Stanford University, USA, presented the Herbrand Manifesto.
Thom Fruehwirth of the University of Ulm, Germany, presented an overview of constraint-handling
rules, while Avigdor Gal of the Technion – Israel Institute of Technology, presented a framework for mining the rules that guide event creation.
Very special this year was the great collocation of
subevents and colocated events. A total number of
138 registered participants attended the main
RuleML 2015 symposium and affiliated subevents,
including the colocated Conference on Web Reasoning and Rule Systems (RR 2015), the Reasoning Web
Summer School (RW 2015), and the Workshop on
Formal Ontologies meet Industry (FOMI). Additionally, the Conference on Automated Deduction
(CADE 2015) celebrated its 25th meeting with more
than 200 participants. This “Berlin on Rules” colocation provided great opportunity for the rule-based
community to meet with the automated deduction
community at one of the several joint social events,
including the joint reception at the Botanic Garden
on Monday, August 3, the joint keynote by Michael
Genesereth, the poster session on Tuesday, August 4,
and the joint conference dinner at the Fischerhuette
restaurant at Lake Schlachtensee on Wednesday,
August 5. The welcome address at the reception was
given by Ute Finckh-Krämer (Berlin, SPD, member of
the German Parliament) followed by Wolfgang Bibel
(University of Darmstadt) who was the invited speaker. The dinner speech at the Fischerhuette was given
by Jörg Siekmann (University of Saarbrücken).
The poster session, consisting of 18 posters and
demos, was jointly organized as a get-together with
the Berlin Semantic Web Meetup. At the session,
wine, beer, and finger food were provided in the
greenhouses of the Computer Science Department at
the Freie Universität Berlin. The organizers also used
this unique opportunity to hold a joint public
RuleML and RR business meeting as well as an invited dinner with all chairs, and invited keynote speakers of RuleML, RR, RW, FOMI, and CADE. The additional rich social program, with a bus sightseeing
tour to east, west, and downtown Berlin on Saturday,
August 1, a boat sightseeing tour from lake Wannsee
to the Reichstag on Sunday, August 2, the CADE exhibitions on Wednesday, and plenty visits to the various beer gardens, made it a memorable stay in the
capital of Germany for the participants.
The RuleML 2015 Best Paper Award was given to
Thomas Lukasiewicz, Maria Vanina Martinez, Livia
Predoiu, and Gerardo I. Simari for their paper Existential Rules and Bayesian Networks for Probabilistic
Ontological Data Exchange. The Ninth International
Rule Challenge Award went to Jean-François Baget,
Alain Gutierrez, Michel Leclère, Marie-Laure Mugnier, Swan Rocher, and Clément Sipieter, for their
paper Datalog+, RuleML, and OWL 2: Formats and
Translations for Existential Rules. The winners of the
RuleML 2015 Competition Award were Marta Vomlelova, Michal Kopecky, and Peter Vojtas, for their
paper Transformation and Aggregation Preprocessing
for Top-k Recommendation GAP Rules Induction.
As in previous years, RuleML 2015 was also a place
for presentations and face-to-face meetings about
rule technology standardizations, which this year
covered OASIS LegalRuleML, RuleML 1.02 (Consumer+Deliberation+Reaction), OMG API4KB, OMG
SBVR, ISO Common Logic, ISO PSL, and TPTP.
We would like to thank our sponsors, whose contributions allowed us to cover the costs of student
participants and invited keynote speakers. We would
also like to thank all the people who have contributed to the success of this year’s special RuleML
2015 and colocated events, including the organization chairs, PC members, authors, speakers, and participants.
The next RuleML symposium will be held at Stony
Brook University, in New York, USA, from July 5–8,
2016 (
Adrian Paschke is a professor and head of the Corporate
Semantic Web (AG-CSW), chair at the Institute of Computer Science, Department of Mathematics and Computer Science at Freie Universitaet Berlin (FUB). He also is director of
the Data Analytics Center (DANA) at Fraunhofer FOKUS
and director of RuleML Inc. in Canada.
Fifteenth International
Conference on Artificial
Intelligence and Law (ICAIL 2015)
Katie Atkinson, Jack G. Conrad, Anne Gardner, Ted Sichelman
I The 15th International Conference
on AI and Law (ICAIL 2015) was held
in San Diego, California, USA, June 8–
12, 2015, at the University of San
Diego, at the Kroc Institute, under the
auspices of the International Association for Artificial Intelligence and Law
(IAAIL), an organization devoted to promoting research and development in the
field of AI and law with members
throughout the world. The conference is
held in cooperation with the Association for the Advancement of Artificial
Intelligence (AAAI) and with ACM
SIGAI (the Special Interest Group on
Artificial Intelligence of the Association
for Computing Machinery).
he 15th International Conference on AI and Law
(ICAIL 2015) was held in San Diego, California, on June
8–12, 2015 and broke all prior attendance records. The
conference has been held every two years since 1987, alternating between North America and (usually) Europe. The
program for ICAIL 2015 included three days of plenary sessions and two days of workshops, tutorials, and related
events. Attendance reached a total of 179 participants from
23 countries. Of the total, 95 were registered for the full conference and 84 for one or two days.
The work reported at the ICAIL conferences has always had
two thrusts: using law as a rich domain for AI research, and
using AI techniques to develop legal applications. That duality continued this year, with an increased emphasis on the
applications side. Workshop topics included (1) discovery of
electronically stored information, (2) law and big data, (3)
automated semantic analysis of legal texts, and (4) evidence
in the law. There were also two sessions for which attorneys
could obtain Continuing Legal Education credit, one on AI
techniques for intellectual property analytics and the other
on trends in legal search and software.
The program also contained events intended to reach out
to a variety of communities and audiences. There was a mul-
Copyright © 2016, Association for the Advancement of Artificial Intelligence. All rights reserved. ISSN 0738-4602
SPRING 2016 107
AAAI Email Addresses
Please note that AAAI will be modifying its email
addresses in 2014 in an effort to reduce the amount of
spam that we are receiving. We will be adding 14 to all
email addresses, as follows:
The number will be updated on an annual basis. AAAI
can also be reached by filling out the contact form at scripts/Contact/contact.php.
tilingual workshop for AI and Law researchers from
non-English-speaking countries, and a successful
doctoral consortium was held to welcome and
encourage student researchers. Two well-attended
tutorials were offered for those new to the field, an
introduction to AI and law and an examination of
legal ontologies.
The talks given by the invited speakers of the conference each had a different focal point: Jan Becker
(Robert Bosch LLC) reported on progress in self-driving vehicles and how these vehicles obey traffic
rules; Jack Conrad (Thomson Reuters), in his IAAIL
Presidential Address, reflected upon past developments within AI and law and commented on current
and upcoming challenges facing researchers in the
field and the means to address them; Jerry Kaplan
(Stanford University) explored the attribution of
rights and responsibilities to AI systems under the
law; Michael Luck (King’s College London) discussed
electronic contracts in agent-based systems and the
emergence of norms within these systems.
For this 15th edition of ICAIL, 58 contributions
were submitted. Of these submissions, 15 were
accepted as full papers (10 pages) and 15 were accepted as research abstracts (5 pages). Four additional submissions were accepted as abstracts of system demonstrations, and these systems were showcased in a
lively demo session.
In addition to the long-standing award for the best
student paper, three new awards were presented at
ICAIL 2015. The awards and their winners follow.
The Donald Berman best student paper prize was
awarded to Sjoerd Timmer (Utrecht University), for A
Structure-Guided Approach to Capturing Bayesian
Reasoning about Legal Evidence in Argumentation.
The paper was coauthored by John-Jules Ch. Meyer,
Henry Prakken, Silja Renooij, and Bart Verheij. The
Peter Jackson best innovative application paper prize
was awarded to Erik Hemberg (Massachusetts Institute of Technology), Jacob Rosen (Massachusetts
Institute of Technology), Geoff Warner (MITRE Corporation), Sanith Wijesinghe (MITRE Corporation),
and Una-May O’Reilly (Massachusetts Institute of
Technology), for their paper Tax Non-Compliance
Detection Using Co-Evolution of Tax Evasion Risk
and Audit Likelihood. The Carole Hafner best paper
prize, memorializing an ICAIL founder who passed
away in 2015, was awarded to Floris Bex (Utrecht
University), for An Integrated Theory of Causal Stories and Evidential Arguments. Finally, the award for
the best doctoral consortium student paper was presented to Jyothi Vinjumur (University of Maryland),
for Methodology for Constructing Test Collections
using Collaborative Annotation.
The conference was held at the University of San
Diego, at the Joan B. Kroc Institute for Peace and Justice. Conference sponsors were the International
Association for Artificial Intelligence and Law, Thomson Reuters, the University of San Diego Center for IP
Law & Markets, Davis Polk & Wardwell LLP, TrademarkNow, and Legal Robot. Both AAAI and ACM
SIGAI were in cooperation. Conference officials were
Katie Atkinson (program chair), Ted Sichelman (conference chair), and Anne Gardner (secretary/treasurer).
Further information about the conference is available at The proceedings were published
by the Association for Computing Machinery and are
available in the ACM Digital Library.
Katie Atkinson is a professor and head of the Department
of Computer Science at the University of Liverpool. She
gained her Ph.D. in computer science from the University of
Liverpool, and her research interests concern computational models of argument, with a particular focus on how these
can be applied in the legal domain.
Jack G. Conrad is a lead research scientist with Thomson
Reuters Corporate Research and Development group. He
applies his expertise in information retrieval, natural language processing, data mining, and machine learning to
meet the technology needs of the company’s businesses,
including coverage of the legal domain, to develop capabilities for products such as WestlawNext.
Anne Gardner is an independent scholar with a longstanding interest in artificial intelligence and law. Her law degree
and her Ph.D. in computer science are both from Stanford
Ted Sichelman is a professor of law at the University of San
Diego. He teaches and writes in the areas of intellectual
property, law and entrepreneurship, empirical legal studies,
law and economics, computational legal studies, and tax
Spring News from the
Association for the Advancement
of Artificial Intelligence
AAAI Announces
New Senior Member!
AAAI congratulates Wheeler Ruml
(University of New Hampshire, USA)
on his election to AAAI Senior Member
status. This honor was announced at
the recent AAAI-16 Conference in
Phoenix. Senior Member status is
designed to recognize AAAI members
who have achieved significant accomplishments within the field of artificial
intelligence. To be eligible for nomination for Senior Member, candidates
must be consecutive members of AAAI
for at least five years and have been
active in the professional arena for at
least ten years.
Congratulations to
the 2016 AAAI Award
Tom Dietterich, AAAI President,
Manuela Veloso, AAAI Past President
and Awards Committee Chair, and Rao
Kambhampati, AAAI President-Elect,
presented the AAAI Awards in February
at AAAI-16 in Phoenix.
AAAI Classic Paper Award
The 2016 AAAI Classic Paper Award
was given to the authors of the two
papers deemed most influential from
the Fifteenth National Conference on
Artificial Intelligence, held in 1998 in
Madison, Wisconsin, USA. The 2016
recipients of the AAAI Classic Paper
Award were:
The Interactive Museum Tour-Guide
Robot (Wolfram Burgard, Armin B.
Cremers, Dieter Fox, Dirk Hähnel,
Gerhard Lakemeyer, Dirk Schulz, Walter Steiner, and Sebastian Thrun)
through Randomization (Carla P.
Gomes, Bart Selman, and Henry
Burgard and colleagues were honored
for significant contributions to probabilistic robot navigation and the integration with high-level planning
methods, while Gomes, Selman, and
Kautz were recognized for their significant contributions to the area of automated reasoning and constraint solving through the introduction of
randomization and restarts into complete solvers. Wolfram Burgard and
Carla Gomes presented invited talks
during the conference in recognition
of this honor.
For more information about nominations for AAAI 2017 Awards, please
contact Carol Hamilton at [email protected]
AAAI-16 Outstanding
Paper Awards
This year, AAAI's Conference on Artificial Intelligence honored the following
two papers, which exemplify high
standards in technical contribution
and exposition by regular and student
AAAI-16 Outstanding Paper Award
Bidirectional Search That Is Guaranteed to Meet in the Middle (Robert C.
Holte, Ariel Felner, Guni Sharon,
Nathan R. Sturtevant)
AAAI-16 Outstanding
Student Paper Award
Toward a Taxonomy and Computational Models of Abnormalities in
Images (Babak Saleh, Ahmed Elgammal, Jacob Feldman, Ali Farhadi)
IAAI-16 Innovative
Application Awards
Each year the AAAI Conference on
Innovative Applications selects the
recipients of the IAAI Innovative
Application Award. These deployed
application case study papers must
describe deployed applications with
measurable benefits that include some
aspect of AI technology. The application needs to have been in production
use by its final end-users for sufficiently long so that the experience in use
can be meaningfully collected and
reported. The 2016 winners were as follows:
Deploying PAWS: Field Optimization
of the Protection Assistant for Wildlife
Security (Fei Fang, Thanh H. Nguyen,
Rob Pickles, Wai Y. Lam, Gopalasamy
R. Clements, Bo An, Amandeep Singh,
Milind Tambe, Andrew Lemieux)
Ontology Re-Engineering: A Case
Study from the Automotive Industry
(Nestor Rychtyckyj, Baskaran Sankaranarayanan, P Sreenivasa Kumar, Deepak Khemani, Venkatesh Raman)
Deploying nEmesis: Preventing Foodborne Illness by Data Mining Social
Media (Adam Sadilek, Henry Kautz,
Copyright © 2016, Association for the Advancement of Artificial Intelligence. All rights reserved. ISSN 0738-4602
SPRING 2016 109
Lauren DiPrete, Brian Labus, Eric Portman, Jack Teitel, Vincent Silenzio)
Special Computing Community
Consortium (CCC) Blue Sky
Please Join Us for ICWSM-16
in Cologne, Germany!
The Tenth International AAAI Conference on Web and Social Media will be
held May 17–20 at Maternushaus and GESIS - Leibniz Institute for the Social
Sciences in Cologne, Germany. This interdisciplinary conference is a forum
for researchers in computer science and social science to come together to
share knowledge, discuss ideas, exchange information, and learn about cutting-edge research in diverse fields with the common theme of online social
media. This overall theme includes research in new perspectives in social
theories, as well as computational algorithms for analyzing social media.
ICWSM is a singularly fitting venue for research that blends social science
and computational approaches to answer important and challenging questions about human social behavior through social media while advancing
computational tools for vast and unstructured data.
ICWSM-16 will include a lively program of technical talks and posters,
invited presentations, and keynote talks by Lise Getoor (University of California, Santa Cruz) and Amr Goldberg (Stanford Graduate School of Business).
Workshops and Tutorials
The ICWSM Workshop program will continue in 2016, and the Tutorial Program will return. Both will be held on the first day of the conference, May
17. For complete details about the workshop program, please see
Registration Is Now Open!
Registration information is available at the ICWSM-16 website
( The early registration deadline is March 25, and the late registration deadline is April 15.
For full details about the conference program, please visit the ICWSM-16
website (icwsm. org) or write to [email protected]
AAAI-16, in cooperation with the CRA
Computing Community Consortium
(CCC), honored three papers in the
Senior Member track that presented
ideas and visions that can stimulate
the research community to pursue new
directions, such as new problems, new
application domains, or new methodologies. The recipients of the 2016 Blue
Sky Idea travel awards, sponsored by
the CCC, were as follows:
Indefinite Scalability for Living Computation (David H. Ackley)
Embedding Ethical Principles in Collective Decision Support Systems
(Joshua Greene, Francesca Rossi, John
Tasioulas, Kristen Brent Venable, Brian
Five Dimensions of Reasoning in the
Wild (Don Perlis)
2016 AI Video Competition
The tenth annual AI video competition was held during AAAI-16 and several winning videos were honored during the awards presentation. Videos
were nominated for awards in six categories, and winners received a
“Shakey” award during a special award
ceremony at the conference. Our
thanks go to Sabine Hauert and
Charles Isbell for all their work on this
event. The winners of the three awards
were as follows:
Best Video
Machine Learning Techniques for
Reorchestrating the European Anthem
(François Pachet, Pierre Roy, Mathieu
Ramona, Marco Marchini, Gaetan
Hadjeres, Emmanuel Deruty, Benoit
Carré, Fiammetta Ghedini)
Best Robot Video
A Sea of Robots (Anders Lyhne Christensen, Miguel Duarte, Vasco Costa,
Tiago Rodrigues, Jorge Gomes, Fernando Silva, Sancho Oliveira)
Best Student Video
Deep Neural Networks Are Easily Fooled
(Anh Nguyen, Jason Yosinski, Jeff
Congratulations to the 2016 AAAI Fellows!
Each year a small number of fellows are recognized for their unusual distinction in the profession and for their sustained contributions to the field for a decade or more. An official dinner and ceremony were held in their honor
during AAAI-16 in Phoenix, Arizona.
Giuseppe De Giacomo (University of Rome La Sapienza,
For significant contributions to the field of knowledge representation and reasoning, and applications to data integration, ontologies, planning, and process synthesis and verification.
Daniel D. Lee (University of Pennsylvania, USA)
For significant contributions to machine learning and robotics, including algorithms for perception, planning, and
motor control
Bing Liu (University of Illinois at Chicago, USA)
For significant contributions to data mining and development of widely used sentiment analysis, opinion spam
detection, and Web mining algorithms.
Maja J. Mataric (University of Southern California, USA)
For significant contributions to the advancement of multirobot coordination, learning in human-robot systems, and
socially assistive robotics.
Eric Poe Xing (Carnegie Mellon University, USA)
For significant contributions to statistical machine learning,
its theoretical analysis, new algorithms for learning probabilistic models, and applications of these to important problems in biology, social network analysis, natural language
processing and beyond; and to the development of new
architecture, system platform, and theory for distributed
machine learning programs on large scale applications.
Zhi-Hua Zhou (Nanjing University, China)
For significant contributions to ensemble methods and
learning from multi-labeled and partially-labeled data.
The 2016 AAAI Distinguished Service Award
The 2016 AAAI Distinguished Service Award recognizes one individual for extraordinary service to the AI community. The AAAI Awards Committee is pleased to announce that this year's recipient is Maria Gini (University
of Minnesota). Professor Gini is being recognized for her outstanding contributions to the field of artificial intelligence through sustained service leading AI societies, journals, and conferences; mentoring colleagues; and working to increase participation of women in AI and computing.
Maria Gini is a professor in the Department of Computer Science and Engineering at
the University of Minnesota. She studies decision making for autonomous agents in a
variety of applications and contexts, ranging from distributed methods for allocation
of tasks to robots, to methods for robots to explore an unknown environment, teamwork for search and rescue, and navigation in dense crowds. She is a Fellow of the Association for the Advancement of Artificial Intelligence. She is coeditor in chief of Robotics and Autonomous Systems, and is on the editorial board of numerous journals,
including Artificial Intelligence, and Autonomous Agents and Multi-Agent Systems.
SPRING 2016 111
AAAI/EAAI 2016 Outstanding Educator Award
The Inagural AAAI/EAAI Outstanding Educator was established
in 2016 to recognize a person (or group of people) who has
(have) made major contributions to AI education that provide
long-lasting benefits to the AI community. Examples might
include innovating teaching methods, providing service to the
AI education community, generating pedagogical resources,
designing curricula, and educating students outside of higher
education venues (or the general public) about AI. AAAI is
pleased to announce the first corecipients of this award, Peter
Norvig (Google) and Stuart Russell (University of California,
Berkeley), who are being honored for their definitive text,
Peter Norvig and Stuart Russell accept the AAAI /
“Artificial Intelligence: A Modern Approach,” that systemized
EAAI Outstanding Educator Award.
the field of artificial intelligence and inspired a new generation
of scientists and engineers throughout the world, as well as for
their individual contributions to education in artificial intelligence. This award is jointly sponsored by AAAI and
the Symposium on Educational Advances in Artificial Intelligence.
Han Yu, Chunyan Miao, Cyril Leung,
Daniel Wei Quan Ng, Kian Khang
Ong, Bo Huang and Yaming Zhang)
AAAI gratefully acknowledges the Bristol Robotics Laboratory for help with
the manufacturing of the awards. Congratulations to all the winners!
2016 Fall Symposium Series
November 17–19
Mark Your Calendars! The 2016 AAAI Fall Symposium Series will be held
Thursday through Saturday, November 17–19, at the Westin Arlington
Gateway in Arlington, Virginia, adjacent to Washington, DC. Proposals are
due April 8, and accepted symposia will be announced in late April. Submissions will be due July 29, 2016.
For more information, please see the 2016 Fall Symposium Series website
(www.aaai. org/Symposia/Fall/fss16.php).
Most Entertaining Video
Finding Linda — A Search and Rescue
Mission by SWARMIX (Mahdi Asadpour, Gianni A. Di Caro, Simon Egli,
Eduardo Feo-Flushing, Danka Csilla,
Dario Floreano, Luca M. Gambardella,
Yannick Gasser, Linda Gerencsér,
Anna Gergely, Domenico Giustiniano,
Gregoire Heitz, Karin A. Hummel, Barbara Kerekes, Ádám Miklósi, Attila
David Molnar, Bernhard Plattner,
Maja Varga, Gábor Vásárhelyi, JeanChristophe Zufferey)
Best Application of AI
Save the Wildlife, Save the Planet: Protection Assistant for Wildlife Security (Fei
Fang, Debarun Kar, Dana Thomas,
Nicole Sintov, Milind Tambe)
People's Choice
AI for Liveable Cities (Zhengxiang Pan,
AAAI-16 Student
Abstract Awards
Two awards were presented to participants in the AAAI-16 Student Abstract
Program, including the Best Student 3Minute Presentation and the Best Student Poster. Fifteen finalists in the Best
Student 3-Minute Presentation category presented one-minute oral spotlight
presentations during the second day of
the technical conference, followed that
evening by their poster presentations.
Votes for both awards were cast by senior program committee members and
students. The winners were as follows:
Best Student 3-Minute Presentation
Towards Structural Tractability in
Hedonic Games (Dominik Peters)
Honorable Mention: Student 3-Minute
Epitomic Image Super-Resolution
(Yingzhen Yang, Zhangyang Wang,
Zhaowen Wang, Shiyu Chang, Ding
Liu, Honghui Shi, and Thomas S.
AAAI President Tom Dietterich delivers his Presidential Address,
Steps Toward Robust Artificial Intelligence, on Sunday, February 14 at AAAI-16.
Best Student Poster
Efficient Collaborative Crowdsourcing
(Zhengxiang Pan, Han Yu, Chunyan
Miao, and Cyril Leung)
Join Us in New Orleans
for AAAI-17
The Thirty-First AAAI Conference on
Artificial Intelligence (AAAI-17) and
the Twenty-Ninth Conference on
Innovative Applications of Artificial
Intelligence (IAAI-17), will be held in
New Orleans, Louisiana, USA, during
the mid-January to mid-February timeframe. Final dates will be available by
March 31, 2016. The technical conference will continue its 3.5-day scheduled, either preceded or followed by
the workshop and tutorial programs.
AAAI-17 will arrive in New Orleans just
prior to Mardi Gras and festivities will
already be underway. Enjoy legendary
AAAI President-Elect and
Executive Council Elections
Please watch your mailboxes for an announcement of the 2016 AAAI Election. The link to the electronic version of the annual AAAI Ballot will be
mailed to all regular individual AAAI members in the spring. The membership will vote for a new President-Elect (two-year term, followed by two
years as President, and two additional years as Past President), as well as four
new councilors, who will each serve three-year terms. The online voting
system is expected to close on June 10. Please note that the ballot will be
available via the online system only. If you have not provided AAAI with an
up-to-date email address, please do so immediately by writing to [email protected]
jazz music, the French Quarter filled
with lively clubs and restaurants,
world-class museums, and signature
architecture. New Orleans’ multicultural and diverse communities will
make your choices and and experience
in the Big Easy unique. The 2017 Call
for Papers will be available soon at
Please join us in 2017 in NOLA for a
memorable AAAI!
SPRING 2016 113
Robert S. Engelmore Memorial Lecture Award
The Robert S. Engelmore Memorial Lecture Award award was established in 2003 to honor Dr. Robert S. Engelmore's
extraordinary service to AAAI, AI Magazine, and the AI applications community, and his contributions to applied AI.
The annual keynote lecture is presented at the Innovative Applications of Artificial Intelligence Conference. Topics
encompass Engelmore's wide interests in AI, and each lecture is linked to a subsequent article published upon approval
by AI Magazine. The lecturer and, therefore, the author for the magazine article, are chosen jointly by the IAAI Program Committee and the Editor of AI Magazine.
AAAI congratulates the 2016 recipient of this award, Reid G. Smith, i2k Connect, who was
honored for pioneering research contributions and high-impact applications in knowledge
management and for extensive contributions to AAAI, including educating and inspiring the
broader community about AI through AITopics. Smith presented his award lecture, “A Quarter Century of AI Applications: What We Knew Then versus What We Know Now,” at the
Innovative Applications of Artificial Intelligence Conference in Phoenix.
Reid G. Smith is cofounder and chief executive officer of i2k Connect, an AI technology
company that transforms unstructured documents into structured data enriched with subject matter expert knowledge. Formerly, he was vice president of research and knowledge
management at Schlumberger, enterprise content management director at Marathon Oil, and
senior vice president at Medstory, a vertical search company purchased by Microsoft. He
holds a Ph.D. in electrical engineering from Stanford University and is a Fellow of AAAI. He
has served as AAAI Councilor, AAAI-88 program cochair, IAAI-91 program chair and program committee member for
IAAI from its inception in 1989. He is coeditor of AITopics.
Join Us in Austin for
The Fourth AAAI Conference on
Human Computation and Crowdsourcing will be held October 30 –
November 3, 2016 at the AT&T Executive Education and Conference Center
on the University of Texas at Austin
campus. HCOMP-16 will be co-located
with EMNLP 2016, the 2016 Conference on Empirical Methods in Natural
Language Processing. HCOMP is the
premier venue for disseminating the
latest research findings on crowdsourcing and human computation. While
artificial intelligence (AI) and humancomputer interaction (HCI) represent
traditional mainstays of the conference, HCOMP believes strongly in
inviting, fostering, and promoting
broad, interdisciplinary research. This
field is particularly unique in the diversity of disciplines it draws upon, and
contributes to, ranging from humancentered qualitative studies and HCI
design, to computer science and artificial intelligence, economics and the
social sciences, all the way to law and
policy. We promote the exchange of
scientific advances in human computation and crowdsourcing not only
among researchers, but also engineers
and practitioners, to encourage dialogue across disciplines and communities of practice.
Submissions are due May 17, 2016.
For more information, please visit, or write to
[email protected]
AAAI Member News
Nick Jennings Receives New
Year Honor
Nick Jennings, professor of computer
science at the University of Southampton, has been made Companion of the
Order of the Bath in the Queen’s New
Year Honours List for his services to
computer science and national security science. Jennings, who is head of
Electronics and Computer Science
(ECS) at the University, has been recognized for his pioneering contributions to the fields of artificial intelligence, autonomous systems and
agent-based computing. He is the UK’s
only Regius Professor in Computer Science, a prestigious title awarded to the
University by HM The Queen to mark
her Diamond Jubilee. Jennings just
completed a six-year term of office as
the Chief Scientific Advisor to the UK
Government in the area of National
Security. Jennings is also a successful
entrepreneur and is chief scientific officer for Aerogility, a 20-person start-up
that develops advanced software solutions for the aerospace and defense sectors.
Bill Clancey Named NAI Fellow
William “Bill” Clancey, a senior
research scientist with the Florida
Institute for Human and Machine Cognition (IHMC), was named a Fellow of
the National Academy of Inventors
(NAI). The Tampa-based Academy
named a total of 168 Fellows this week,
bringing the total number of Fellows to
582. This is the fourth year that Fellows have been named. Clancey is
most well-known for developing a
work practice modeling and simulation system called Brahms, a tool for
comprehensive design of work systems, relating people and automation.
Using the Brahms modeling system,
scientists study the flow of information and communications in realworld work settings, and the effect of
automated systems. One important
practical application is the coordination among air traffic controllers,
Nick Bostrom of Oxford University addresses AAAI-16 in his talk “What
We Should Think about Regarding the Future of Machine Intelligence.”
As part of a series of events addressing ethical issues and AI, AAAI-16 held a debate on AI’s Impact on Labor Markets, with participants (left to right) Erik Brynjolfsoon (MIT), Moshe Vardi (Rice University), Nick Bostrom (Oxford University), and Oren Etzioni
(Allen Institute for AI). The panel was moderated by Toby Walsh (Data61) (far left).
SPRING 2016 115
Marvin Minsky, 1927 – 2016
AAAI is deeply saddened to note the death of Marvin Minsky on 25 January 2016, at the age of 88. One of the
founders of the discipline of artificial intelligence, Minsky was a professor emeritus at the Massachusetts Institute of Technology, which he joined in 1958. With John McCarthy, Minsky cofounded the MIT Artificial Intelligence Laboratory. Minsky was also a founding member of the MIT Media Lab, founder of Logo Computer Systems, Thinking Machines Corporation, as well as AAAI’s third President. Minsky’s research spanned an enormous
range of fields, including mathematics, computer science, the theory of computation, neural networks, artificial
intelligence, robotics, commonsense reasoning, natural language processing and psychology. An accomplished
musician, Minsky had boundless energy and creativity. He was not only a scientific pioneer and leader, but also
a mentor and teacher to many.
Minsky’s impact on many leaders in our community was documented in the Winter 2007 issue of AI Magazine
article “In Honor of Marvin Minsky’s Contributions on his 80th Birthday” written by Danny Hillis, John
McCarthy, Tom M. Mitchell, Erik T. Mueller, Doug Riecken, Aaron Sloman, and Patrick Henry Winston. His prolific writing career also included many articles within our pages as well.
Minsky’s passing is a tragic landmark for the many AI scientists he influenced personally, for the many more that
he inspired intellectually, as well as for the history of the discipline. AI Magazine will celebrate his many contributions in a future issue.
pilots and automated systems during
flights. The NAI Fellows will be inducted on April 15, 2016, as part of the
Fifth Annual Conference of the
National Academy of Inventors at the
United States Patent and Trademark
Office (USPTO) in Alexandria, Virginia.
Ken Ford Named AAAS Fellow
The American Association for the
Advancement of Science (AAAS) has
elected Ken Ford, director and chief
executive officer of the Florida Institute for Human and Machine Cognition (IHMC), as a Fellow. Ford is one of
347 scientists who have been named
Fellows this year. The electing Council
elects people “whose efforts on behalf
of the advancement of science of its
applications are scientifically or socially distinguished.” Ford was selected
“for founding and directing the IHMC,
for his scientific contributions to artificial intelligence and human-centered
computing, and for service to many
federal agencies.” IHMC, which Ford
founded in 1990, is known for its
groundbreaking research in the field of
artificial intelligence. Ford has served
on the National Science Board, chaired
the NASA Advisory Council and served
on the U.S. Air Force Science Advisory
Board and the Defense Science Board.
Ford received an official certificate and
a gold and blue rosette pin on Saturday, February 13 at the AAAS Fellows
Forum during the 2016 Annual Meeting in Washington, D.C.
AAAI congratulates all three of these
AAAI Fellows for their honors!
AAAI Executive
Council Meeting Minutes
The AAAI Executive Council met via
teleconference on September 25, 2016.
Attending: Tom Dietterich, Sonia
Chernova, Vincent Conitzer, Boi Faltings, Ashok Goel, Carla Gomes, Eduard
Hovy, Julia Hirschberg, Charles Isbell,
Rao Kambhampati, Sven Koenig,
David Leake, Henry Lieberman, Diane
Litman, Jen Neville, Francesca Rossi,
Ted Senator, Steve Smith, Manuela
Veloso, Kiri Wagstaff, Shlomo Zilberstein, Sonia Chernova, Carol Hamilton
Not Attending: Sylvie Thiebaux, Toby
Walsh, Brian Williams
Tom Dietterich convened the meeting at 6:05am PDT, and welcomed the
newly-elected councilors. The new
councilors gave brief statements about
their interests and goals while serving
on the Executive Council. The retiring
councilors were also given an opportunity to offer their advice about future
priorities for AAAI. All agreed that outreach to other research communities,
international outreach, and encouraging diversity should be paramount.
They urged the Council to concentrate
on the direction of the field and AAAI,
and not get bogged down in the
mechanics. Dietterich thanked the
retiring councilors for their contributions to the Council and to AAAI, and
encouraged them to stay involved.
Standing Committee Reports
Awards/Fellows/Nominating: Manuela
Veloso reviewed the current nomination process for Fellows, Senior Members, Distinguished Service, and Classic
Paper. She asked the Council to
encourage their colleagues to nominate people for all of these honors. She
noted that it is fine to ask a Fellow to
nominate you, and in the case of Senior Members, self-nomination with
accompanying references is the normal process. She noted that there is a
new award this year, the Outstanding
Educator Award, which will be cosponsored by AAAI and EAAI. Sylvie
Thiebaux volunteered to serve on the
selection committee, representing the
Council. Veloso reported that the
Nominating Committee completed its
selections for the recent ballot earlier
in the summer, and welcomed the new
councilors, thanking them for their
thoughtful ballot statements.
Conference: Shlomo Zilberstein
reported that Michael Wellman and
Dale Schuurmans, AAAI-16 program
cochairs, are doing a great job with
AAAI-16 conference. Another record
number of submissions was received
on September 15, which indicates that
the timing of the conference is continuing to work well for the community.
While the conference is very prestigious, many do not realize that AAAI is
more than the conference, so raising
the visibility of the Association and
what it does is important. Sandip Sen is
spearheading sponsor recruitment,
and has been working with Carol
Hamilton on the annual AI Journal proposal for support of student activities,
as well as other corporate sponsorships. Zilberstein noted that he is in
the process of recruiting chairs for
2017. The committee discussed an
overview of venues circulated earlier,
and decided that New Orleans was
their top choice, with Albuquerque
next in line. Hamilton will enter into
negotiations with New Orleans. Finally, Zilberstein noted that the Council
should establish a written policy
regarding its decision to make conference attendance for young families
more accessible. The Council discussed
the importance of continuing the initiatives started in 2015, such as outreach, ethics panels, and demos of new
competitions. Zilberstein will follow
up with the 2016 program chairs to be
sure these issues are being addressed.
Conference Outreach: Henry Lieberman noted that the conference outreach program continues to offer sister
conferences publicity opportunities
through various AAAI outlets. However, more conferences need to be
encouraged to take advantage of this.
The committee recently added the
10th IEEE International Conference on
Semantic Computing to the list.
Ethics: Francesca Rossi reported that
the Ethics Committee has been
formed, and noted the recent open letter on Autonomous Weapons signed
by many members of the AI community. The Committee has not yet formed
a proposal for a code of ethics, but has
started gathering samples from other
Finance: Ted Senator reported that
the Association investment portfolio is
now over $9,000,000.00, and that programs have continued to operate on
budget for 2015. Due to the larger surplus from AAAI-15, it is likely that a
smaller operating deficit will be realized for 2015. Senator noted that there
will be another meeting of the Council
in November to approve the 2016
budget. (Ed note: This meeting was
rescheduled to February 2016 due to
extenuating circumstances.)
International: Tom Dietterich reported on behalf of Toby Walsh that AAAI
recently sponsored a “Lunch with a
AAAI Fellow” event at IJCAI, and this
SPRING 2016 117
AAAI President Tom Dietterich (far left) thanks members of the AAAI-16 Conference Committee
for all their efforts in making the conference a great success, including AAAI-16 Program Cochairs,
Michael Wellman (second from left) and Dale Schuurmans (fourth from left).
program proved to be very successful.
He would like to see more of this type
of outreach in the future. Sven Koenig
noted that AAAI could have a stronger
presence via town hall meetings, a
booth, or panels and talks at IJCAI or
other events. This is a strong possibility in 2016 because of the North American location of IJCAI. This would raise
the visibility of AAAI and encourage
the international community to join.
All agreed that AAAI presidents should
pursue a presence at IJCAI on a permanent basis.
Policy/Government Relations: The
newly-formed Policy / Goverment
Relations committee will be conducting a survey, and will convene its first
meeting after that survey is complete.
Membership: Ed Hovy reported that
reduced membership fees for weak currency countries have been instituted,
and reminded the Council to review
this program every year. The program
will be up for renewal in three years. At
that time, the Council should decide if
membership fees need to be adjusted.
However, an annual review will help
avoid any long-term problems. He also
reminded the Council that the other
component of this program — developing strong local representatives in
international locales — is equally
important, and should be finalized in
2016. AAAI is continuing to offer free
memberships to attendees of cooperating conferences who are new to AAAI.
Several conferences have taken advantage of this opportunity. Hovy noted
that the most important challenge facing AAAI is the retention of members
through the development of programs
that serve the needs of its members.
Tom Dietterich thanked Hovy for his
service as chair of the Membership
Committee, and noted that he is in the
process of revamping the committee
assignments in light of the recent
Council transition.
Publications: David Leake reviewed all
of the publishing activities of the Association during the past nine months,
including AI Magazine, several proceedings, and workshop and symposium
technical reports. He noted that Ashok
Goel has agreed to join AI Magazine as
associate editor, and welcomed Goel to
the group. Goel thanked Leake and
Tom Dietterich for this opportunity,
and said he looked forward to contributing to the magazine.
Funding Requests
CRA-W: Tom Dietterich reported that
AAAI had received a request from CRAW for $15,000.00 in support of their
efforts. He proposed that AAAI contribute $5,000.00, primarily in support
of outreach and diversity for women in
The Arizona State University’s Robot Learning Class Demonstrated their Autonomous Robot Arm at AAAI-16.
research. During the following discussion, it was noted that CRA-W has had
difficulty of late securing funding from
traditional sources, and would like to
attract 100–200 more women at their
conference, which supports women in
their first three years of graduate study.
The Council suggested that CRA-W be
encouraged to collocate their conference with AAAI, but also would like to
see an increased effort to establish a
AAAI subgroup of women. Julia
Hirschberg moved to support this
request at the $5,000.00 level, and
Steve Smith seconded the motion. The
motion passed. Charles Isbell will follow up with CRA-W to offer complimentary one-year memberships to
award recipients.
AI Topics: Tom Dietterich noted that
Bruce Buchanan and Reid Smith would
like to continue AI Topics, but fund it
via membership revenue, rather than an
NSF grant, as it has been for the past several years. While this may be possible, it
was decided that a clearer proposal
needs to be developed to establish what
the exact dollar amount would be per
member. The Council also noted that
the AI Alert mailing list should be optout rather than opt-in. An article in the
AI Magazine would help raise the visibility of the site, which has far better curation then Wikipedia.
Focus Group Report
Tom Dietterich reported that he ran a
focus group during AAAI to explore
new directions for AAAI. The group
helped develop a questionnaire reflecting their discussions, and this was subsequently circulated to the AAAI membership. The top-ranked item by the
membership was the pursuit of education initiatives, including summer
schools for college freshmen, development of course materials, and continuing education opportunities for AAAI
members. As a result, Dietterich is seeking to establish a Committee on Education, and asking the members to
study and implement one of the ideas
detailed in the survey. He is seeking
volunteers from the Council. Kiri
Wagstaff volunteered to help with this
committee, and moved to approve the
formation of this committee as an ad
hoc committee of the Executive Council. Francesca Rossi seconded the
motion, and it passed unanimously. In
addition, the Council moved to
appoint Wagstaff as chair of the committee, which also passed with no
opposing votes.
Media Committee
Tom Dietterich would like to create an
ad hoc media committee, which would
advise on and create social media
opportunities for members, including
a blog, and other features. Advice on
appropriate members of the committee
is needed. This committee would be a
subcommittee of the Publications
Committee, and might also oversee
the AI Topics website. Dietterich will
work on finding an appropriate chair
for the committee. Ted Senator moved
to create the ad hoc committee, which
was seconded by Steve Smith. The
motion passed.
Tom Dietterich thanked everyone
for their participation, and the meeting was adjourned at 8:01 AM PDT.
SPRING 2016 119
AAAI Conferences Calendar
This page includes forthcoming AAAI sponsored conferences,
conferences presented by AAAI Affiliates, and conferences held
in cooperation with AAAI. AI Magazine also maintains a calendar listing that includes nonaffiliated conferences at
AAAI Sponsored
Conferences Held by
AAAI Affiliates
Conferences Held in
Cooperation with AAAI
The Tenth International AAAI Conference on Web and Social Media
ICWSM-16 will be held May 17–20,
2016 in Cologne, Germany.
15th International Conference on
Principles of Knowledge Representation and Reasoning (KR 2016) KR
2016 will be held April 25–29, 2016 in
Cape Town, South Africa.
18th International Conference on
Enterprise Information Systems
ICEIS 2016 will be held April 27–30,
2016, in Rome, Italy
Twelfth AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment. AIIDE-16 will be
held in October in the San Francisco
Bay Area.
Twenty-Ninth International Florida
AI Research Society Conference.
FLAIRS-2016 will be held May 16–18,
2016 in Key Largo, Florida, USA.
Fourth AAAI Conference on Human
Computation and Crowdsourcing.
HCOMP-16 will be held October
30–November 3 in Austin, Texas, USA.
AAAI Fall Symposium. The AAAI Fall
Symposium Series will be held November 17–19 in Arlington, Virginia, adjacent to Washington DC USA.
14th International Conference on
Practical Applications of Agents and
Multi-Agent Systems. PAAMS-2016
will be held 1-3 June, 2016, in Sevilla,
The 26th International Conference
on Automated Planning and Scheduling. ICAPS-16 will be held June
12–17, 2016 in London, UK.
25th International Joint Conference
on Artificial Intelligence. IJCAI-16
will be held July 19–15, 2016 in New
York, New York USA.
URL: /
9th Conference on Artificial General
Intelligence. AGI-16 will be held
16–19 July in New York, New York
Twenty-Ninth International Conference on Industrial, Engineering, and
Other Applications of Applied Intelligent Systems. IEA/AIE-2016 will be
held 2–4 August, 2016, in Morioka,
Thirty-First AAAI Conference on
Artificial Intelligence. AAAI-17 will be
held in January–February in New
Orleans, Louisiana USA.
Twenty-Ninth Innovative Applications of Artificial Intelligence Conference. IAAI-17 in January–February
in New Orleans, Louisiana USA
Visit AAAI on Facebook!
We invite all interested individuals to
check out the Facebook site by searching for AAAI. We welcome your feedback
at [email protected]
Copyright © 2016, Association for the Advancement of Artificial Intelligence. All rights reserved. ISSN 0738-4602
Please Join Us
for the
Fourth AAAI
on Human
and Crowdsourcing
October 30–November 3 2016
Austin Texas, USA
2016 Cologne, Germany
Join Us in Cologne, Germany on May 17–20, 2016 for the
Tenth International AAAI Conference on Web and Social Media