* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download AI Magazine - Spring 2016
Human-Computer Interaction Institute wikipedia , lookup
Computer vision wikipedia , lookup
Kevin Warwick wikipedia , lookup
Technological singularity wikipedia , lookup
Wizard of Oz experiment wikipedia , lookup
Human–computer interaction wikipedia , lookup
Turing test wikipedia , lookup
Knowledge representation and reasoning wikipedia , lookup
Embodied cognitive science wikipedia , lookup
Intelligence explosion wikipedia , lookup
Ethics of artificial intelligence wikipedia , lookup
Existential risk from artificial general intelligence wikipedia , lookup
Visual Turing Test wikipedia , lookup
2016 AAAI Fall Symposium Series November 17–19, 2016 Arlington, Virginia www.aaai.org/fall VOLUME 37, NUMBER 1 SPRING 2016 ISSN 0738-4602 BEYOND THE TURING TEST 3 Editorial Introduction: Beyond the Turing Test Gary Marcus, Francesca Rossi, Manuela Veloso 5 My Computer Is an Honor Student — But How Intelligent Is It? … Peter Clark, Oren Etzioni 13 How to Write Science Questions That Are Easy for People and Hard for Computers Ernest Davis 23 Toward a Comprehension Challenge, Using Crowdsourcing as a Tool Praveen Paritosh, Gary Marcus 31 The Social-Emotional Turing Challenge William Jarrold, Peter Z. Yeh 39 Artificial Intelligence to Win the Nobel Prize and Beyond … Hiroaki Kitano 50 Planning, Executing, and Evaluating the Winograd Schema Challenge Leora Morgenstern, Ernest Davis, Charles L. Ortiz, Jr. Cover: Cognitive Orthoses by Giacomo Marchesi Brooklyn, New York. The guest editors for the 2016 special issue on Beyond the Turing Test are Gary Marcus, Francesca Rossi, and Manuela Veloso 55 Why We Need a Physically Embodied Turing Test … Charles L. Ortiz, Jr 63 Measuring Machine Intelligence Through Visual Question Answering C. Lawrence Zitnick, Aishwarya Agrawal, Stanislaw Antol, Margaret Mitchell, Dhruv Batra, Devi Parikh 73 Turing++ Questions: A Test for the Science of (Human) Intelligence Tomaso Poggio, Ethan Meyers 78 I-athlon: Toward a Multidimensional Turing Test Sam S. Adams, Guruduth Banavar, Murray Campbell 85 Software Social Organisms: Implications for Measuring AI Progress Kenneth D. Forbus 91 Principles for Designing an AI Competition … Stuart M. Shieber 97 WWTS (What Would Turing Say?) Douglas B. Lenat COMPETITION REPORT 102 The First International Competition on Computational Models of Argumentation Matthias Thimm, Serena Villata, Federico Cerutti, Nir Oren, Hannes Strass, Mauro Vallati REPORTS 105 The Ninth International Web Rule Symposium Adrian Paschke 107 Fifteenth International Conference on Artificial Intelligence and Law Katie Atkinson, Jack G. Conrad, Anne Gardner, Ted Sichelman DEPARTMENTS 109 AAAI News 120 AAAI Conferences Calendar SPRING 2016 1 An Official Publication of the Association for the Advancement of Artificial Intelligence aimagazine.org ISSN 0738-4602 (print) ISSN 2371-9621 (online) Submissions Submissions information is available at http://aaai.org/ojs/index.php/aimagazine/information/authors. Authors whose work is accepted for publication will be required to revise their work to conform reasonably to AI Magazine styles. Author’s guidelines are available at aaai.org/ojs/index.php/aimagazine/about/submissions#authorGuidelines. If an article is accepted for publication, a new electronic copy will also be required. Although AI Magazine generally grants reasonable deference to an author’s work, the Magazine retains the right to determine the final published form of every article. Calendar items should be posted electronically (at least two months prior to the event or deadline). Use the calendar insertion form at aimagazine.org. News items should be sent to the News Editor, AI Magazine, 2275 East Bayshore Road, Suite 160, Palo Alto, CA 94303. (650) 328-3123. Please do not send news releases via either e-mail or fax, and do not send news releases to any of the other editors. Advertising AI Magazine, 2275 East Bayshore Road, Suite 160, Palo Alto, CA 94303, (650) 328-3123; Fax (650) 321-4457. Web: aimagazine.org. Web-based job postings can be made using the form at https://www.aaai.org/Forms/jobs-submit.php. Microfilm, Back, or Replacement Copies Replacement copies (for current issue only) are available upon written request and a check for $10.00. Back issues are also available (cost may differ). Send replacement or back order requests to AAAI. Microform copies are available from ProQuest Information and Learning, 300 North Zeeb Road, Ann Arbor, MI 48106. Telephone (800) 521-3044 or (734) 761-4700. 2 the appropriate fee is paid directly to the Copyright Clearance Center, 222 Rosewood Drive, Danvers, MA 01923. Telephone: (978) 750-8400. Fax: (978) 750-4470. Website: www.copyright.com. E-mail: [email protected]. This consent does not extend to other kinds of copying, such as for general distribution, resale, advertising, Internet or internal electronic distribution, or promotion purposes, or for creating new collective works. Please contact AAAI for such permission. Address Change Please notify AAAI eight weeks in advance of a change of address. Send electronically via MemberClicks or by e-mailing us to [email protected]. Subscriptions AI Magazine (ISSN 0738-4602) is published quarterly in March, June, September, and December by the Association for the Advancement of Artificial Intelligence (AAAI), 2275 East Bayshore Road, Suite 160, Palo Alto, CA 94303, telephone (650) 328-3123. AI Magazine is a direct benefit of membership in AAAI. Membership dues are $145.00 individual, $75.00 student, and $285.00 academic / corporate libraries. Subscription price of $50.00 per year is included in dues; the balance of your dues may be tax deductible as a charitable contribution; consult your tax advisor for details. Inquiries regarding membership in the Association for the Advancement of Artificial Intelligence should be sent to AAAI at the above address. PERIODICALS POSTAGE PAID at Palo Alto CA and additional mailing offices. Postmaster: Change Service Requested. Send address changes to AI Magazine, 2275 East Bayshore Road, Suite 160, Palo Alto, CA 94303. Copying Articles for Personal Use Copyright © 2016 by the Association for the Advancement of Artificial Intelligence. All rights reserved. No part of this publication may be reproduced in whole or in part without prior written permission. Unless otherwise stated, the views expressed in published material are those of the authors and do not necessarily reflect the policies or opinions of AI Magazine, its editors and staff, or the Association for the Advancement of Artificial Intelligence. Authorization to photocopy items for internal or personal use, or the internal or personal use of specific clients, or for educational classroom use, is granted by AAAI, provided that PRINTED AND BOUND IN THE USA. AI Magazine and AAAI Press AAAI Officials Editor-in-Chief David Leake, Indiana University Editor-in-Chief Elect Ashok Goel, Georgia Institute of Technology Competition Reports Coeditors Sven Koenig, University of Southern California Robert Morris, NASA Ames Reports Editor Robert Morris, NASA Ames Worldwide AI Column Editor Matthijs Spaan, Instituto Superior Técnico AAAI Press Editor Anthony Cohn, University of Leeds Managing Editor David Hamilton, The Live Oak Press, LLC. Editorial Board John Breslin, National University of Ireland Gerhard Brewka, Leipzig University Vinay K. Chaudhri, SRI International Marie desJardins, University of Maryland, Baltimore County Kenneth Forbus, Northwestern University Kenneth Ford, Institute for Human and Machine Cognition Ashok Goel, Georgia Institute of Technology Sven Koenig, University of Southern California Ramon Lopez de Mantaras, IIIA, Spanish Scientific Research Council Sheila McIlraith, University of Toronto Robert Morris, NASA Ames Hector Munoz-Avila, Lehigh University Pearl Pu, EPFL Sandip Sen, University of Tulsa Kirsten Brent Venable, Tulane University and IHMC Chris Welty, IBM Research Holly Yanco, University of Massachusetts, Lowell Qiang Yang, Hong Kong University of Science and Technology Feng Zhao, Microsoft Research President Thomas G. Dietterich, Oregon State University President-Elect Subbarao Kambhampati, Arizona State University Past-President Manuela Veloso, Carnegie Mellon University Secretary-Treasurer Ted Senator AI MAGAZINE Standing Committees Conference Chair Shlomo Zilberstein, University of Massachusetts, Amherst Awards, Fellows, and Nominating Chair Manuela Veloso, Carnegie Mellon University Finance Chair Ted Senator Conference Outreach Chair Stephen Smith, Carnegie Mellon University International Committee Chair Councilors (through 2016) Toby Walsh, NICTA, The Australian Sven Koenig, University of Southern National University California, USA Membership Chair Sylvie Thiebaux, NICTA, The Australian Sven Koenig, University of National University, Australia Southern California Francesca Rossi, University of Padova, Italy Publications Chair Brian Williams, Massachusetts David Leake, Indiana University Institute of Technology, USA Symposium Chair and Cochair Councilors (through 2017) Gita Sukthankar, University of Sonia Chernova, Worcester Central Florida Polytechnic Institute, USA Christopher Geib, Vincent Conitzer, Duke University, USA Drexel University Boi Faltings, École polytechnique fédérale de Lausanne, Suisse AAAI Staff Stephen Smith, Carnegie Mellon University, USA Executive Director Carol Hamilton Councilors (through 2018) Accountant Charles Isbell, Georgia Institute Diane Mela of Technology, USA Diane Litman University of Pittsburgh, USA Conference Manager Keri Harvey Jennifer Neville, Purdue University, USA Membership Coordinator Kiri L. Wagstaff, Jet Propulsion Alanna Spencer Laboratory, USA AAAI SPONSORS AI Journal National Science Foundation Microsoft Research IBM IBM Watson Group Baidu Google, Inc. Infosys Lionbridge Disney Research USC/ISI Yahoo Labs! Facebook Amazon Big ML Bing! MicroWorkers Nuance Communications Oxford Internet Institute Qatar Computing Research Institute US Department of Energy University of Michigan Adventium Labs LeadGenius CloudFactory Crowdflower ACM/SIGAI CRA Computing Community Consortium David Smith Editorial Editorial Introduction to the Special Articles in the Spring Issue Beyond the Turing Test Gary Marcus, Francesca Rossi, Manuela Veloso I The articles in this special issue of AI Magazine include those that propose specific tests and those that look at the challenges inherent in building robust, valid, and reliable tests for advancing the state of the art in AI. A lan Turing’s renowned test on intelligence, commonly known as the Turing test, is an inescapable signpost in AI. To people outside the field, the test — which hinges on the ability of machines to fool people into thinking that they (the machines) are people — is practically synonymous with the quest to create machine intelligence. Within the field, the test is widely recognized as a pioneering landmark, but also is now seen as a distraction, designed over half a century ago, and too crude to really measure intelligence. Intelligence is, after all, a multidimensional variable, and no one test could possibly ever be definitive truly to measure it. Moreover, the original test, at least in its standard implementations, has turned out to be highly gameable, arguably an exercise in deception rather than a true measure of anything especially correlated with intelligence. The much ballyhooed 2015 Turing test winner Eugene Goostman, for instance, pretends to be a thirteen-year-old foreigner and proceeds mainly by ducking questions and returning canned one-liners; it cannot see, it cannot think, and it is certainly a long way from genuine artificial general intelligence. Our hope is to see a new suite of tests, part of what we have Copyright © 2016, Association for the Advancement of Artificial Intelligence. All rights reserved. ISSN 0738-4602 SPRING 2016 3 Editorial dubbed the Turing Championships, each designed in some way to move the field forward, toward previously unconquered territory. Most of the articles in this special issue stem from our first workshop toward creating such an event, held during the AAAI Conference on Artificial Intelligence in January 2015 in Austin, Texas. T he articles in this special issue can be broadly divided into those that propose specific tests, and those that look at the challenges inherent in building robust, valid, and reliable tests for advancing the state of the art in artificial intelligence. In the article My Computer is an Honor Student — But How Intelligent Is It? Standardized Tests as a Measure of AI, Peter Clark and Oren Etzioni argue that standardized tests developed for children offer one starting point for testing machine intelligence. Ernest Davis in his article How to Write Science Questions That Are Easy for People and Hard for Computers, proposes an alternative test called SQUABU (science questions appraising basic understanding) that aims to asks questions that are easy for people but hard for computers. In Toward a Comprehension Challenge, Using Crowdsourcing as a Tool, Praveen Paritosh and Gary Marcus propose a crowdsourced comprehension challenge, in which machines will be asked to answer open-ended questions about movies, YouTube videos, podcasts, stories, and podcasts. The article The Social-Emotional Turing Challenge, by William Jarrold and Peter Z. Yeh, considers the importance of social-emotional intelligence and proposes a methodology for designing tests that assess the ability of machines to infer things like motivations and desires (often referred to in the psychological literature as theory of mind.) In Artificial Intelligence to Win the Nobel Prize and Beyond: Creating the Engine for Scientific Discovery, Hiroaki Kitano urges the field to build AI systems that can make significant, even Nobel-worthy, scientific discoveries. In Planning, Executing, and Evaluating the Winograd Schema Challenge, Leora Morgenstern, Ernest Davis, and Charles L. Ortiz, Jr., describe the Winograd Schema Challenge, a test of commonsense reasoning that is set in a linguistic context. In the article Why We Need a Physically Embodied Turing Test and What It Might Look Like, Charles L. Ortiz, Jr., argues for tests, such as a construction challenge (build something given a bag of parts), that focus on four aspects of intelligence: language, perception, reasoning, and action. Measuring Machine Intelligence Through Visual Question Answering, by C. Lawrence Zitnik, Aishwarya Agrawal, Stanislaw Antol, Margaret Mitchell, Dhruv Batra, and Devi Parikh, argues for using visual question answering as an essential part of a multimodal challenge to measure intelligence. 4 AI MAGAZINE Tomaso Poggio and Ethan Meyers in Turing++ Questions: A Test for the Science of (Human) Intelligence, which also focuses on visual questions, propose to develop tests where competitors must not only match human behavior but also do so in a way that is consistent with human physiology, in this way aiming to use a successor to the Turing test to bridge between the fields of neuroscience, psychology, and artificial intelligence. The article I-athlon: Toward a Multidimensional Turing Test, by Sam Adams, Guruduth Banavar, and Murray Campbell, proposes a methodology for designing a test that consists of a series of varied events, in order to test several dimensions of intelligence. Kenneth Forbus also argues for testing multiple dimensions of intelligence in his article Software Social Organisms: Implications for Measuring AI Progress. In the article Principles for Designing an AI Competition, or Why the Turing Test Fails as an Inducement Prize, Stuart Shieber discusses several desirable features for an inducement prize contest, contrasting them with the current Turing test. Douglas Lenat in WWTS (What Would Turing Say?) takes a step back and focuses instead on synergy between human and machine, and the development of conjoint superhuman intelligence. While the articles included in this issue propose and discuss several kinds of tests, and we hope to see many of them being deployed very soon, they should be considered merely as a starting point for the AI community. Challenge problems, well chosen, can drive media interest in the field, but also scientific progress. We hope therefore that many AI researchers participate actively in formalizing and refining the initial proposals described in these articles and discussed at our initial workshops. In the meantime, we have created a website1 with pointers to presentations, discussions, and most importantly ways for interested researchers to get involved, contribute, and participate in these successors to the Turing test. Note 1. www.math.unipd.it/~frossi/btt. Gary Marcus is founder and chief executive officer of Geometric Intelligence and a professor of psychology and neural science at New York University. Francesca Rossi is a research scientist at the IBM T.J. Watson research center, (on leave from the University of Padova). Manuela Veloso is the Herbert A. Simon University Professor in the Computer Science Department at Carnegie Mellon University. Articles My Computer Is an Honor Student — But How Intelligent Is It? Standardized Tests as a Measure of AI Peter Clark, Oren Etzioni I Given the well-known limitations of the Turing test, there is a need for objective tests to both focus attention on, and measure progress toward, the goals of AI. In this paper we argue that machine performance on standardized tests should be a key component of any new measure of AI, because attaining a high level of performance requires solving significant AI problems involving language understanding and world modeling — critical skills for any machine that lays claim to intelligence. In addition, standardized tests have all the basic requirements of a practical test: they are accessible, easily comprehensible, clearly measurable, and offer a graduated progression from simple tasks to those requiring deep understanding of the world. Here we propose this task as a challenge problem for the community, summarize our state-of-the-art results on math and science tests, and provide supporting data sets (www.allenai.org). A lan Turing (Turing 1950) approached the abstract question can machines think? by replacing it with another, namely can a machine pass the imitation game (the Turing test). In the years since, this test has been criticized as being a poor replacement for the original enquiry (for example, Hayes and Ford [1995]), which raises the question: what would a better replacement be? In this article, we argue that standardized tests are an effective and practical assessment of many aspects of machine intelligence, and should be part of any comprehensive measure of AI progress. While a crisp definition of machine intelligence remains elusive, we can enumerate some general properties we might expect of an intelligent machine. The list is potentially long (for example, Legg and Hutter [2007]), but should at least include the ability to (1) answer a wide variety of questions, (2) answer complex questions, (3) demonstrate commonsense and world knowledge, and (4) acquire new knowledge scalably. In addition, a suitable test should be clearly measurable, graduated (have a variety of levels of difficulty), not gameable, ambitious but realistic, and motivating. There are many other requirements we might add (for example, capabilities in robotics, vision, dialog), and thus any comprehensive measure of AI is likely to require a battery of different tests. However, standardized tests meet a surprising number of requirements, including the four listed, and thus should be a key component of a future battery of tests. As we will show, the tests require answering a wide variety of questions, including those requiring commonsense and world knowledge. In addition, they meet all the practical requirements, a huge advantage for any component of a future test of AI. Copyright © 2016, Association for the Advancement of Artificial Intelligence. All rights reserved. ISSN 0738-4602 SPRING 2016 5 Articles My computer is an HONOR STUDENT Science and Math as Challenge Areas Standardized tests have been proposed as challenge problems for AI, for example, Bringsjord and Schimanski (2003), Bringsjord (2011), Beyer et al. (2005), Fujita et al. (2014), as they appear to require significant advances in AI technology while also being accessible, measurable, understandable, and motivating. They also enable us easily to compare AI performance with that of humans. In our own work, we have chosen to focus on elementary and high school tests (for 6–18 year olds) because the basic language-processing requirements are surmountable, while the questions still present formidable challenges for solving. Similarly, we are focusing on science and math tests, and have recently achieved some baseline results on these tasks (Seo et al. 2015, Koncel-Kedziorski et al. 2015, Khot et al. 2015, Li and Clark 2015, Clark et al. 2016). Other groups have attempted higher level exams, such as the Tokyo entrance exam (Strickland 2013), and more specialized psychometric tests such as SAT word analogies (Turney 2006), GRE word antonyms (Mohammad et al. 2013), and TOEFL synonyms (Landauer and Dumais 1997). We also stipulate that the exams are taken exactly as written (no reformulation or rewording), so that the task is clear, standard, and cannot be manipulated or gamed. Typical questions from the New York Regents 4th grade (9–10 year olds) science exams, SAT math questions, and more are shown in the next section. We have also made a larger collection of challenge questions drawn from these and other exams, available on our web site.1 We propose to leverage standardized tests, rather than synthetic tests such as the Winograd schema (Levesque, Davis, and Morgenstern 2012) or MCTest (Richardson, Burges, and Renshaw 2013), because they provide a natural sample of problems and more directly suggest real-world applications in the areas of education and science. Exams and Intelligence One pertinent question concerning the suitability of 6 AI MAGAZINE exams is whether they are gameable, that is, answerable without requiring any real understanding of the world. For example, questions might be answered with a simple search-engine query or through simple corpus statistics, without requiring any understanding of the underlying material. Our experience is that while some questions are answerable in this way, many are not. There is a continuum from (computationally) easy to difficult questions, where more difficult questions require increasingly sophisticated internal models of the world. This continuum is highly desirable, as it means that there is a low barrier to entry, allowing researchers to make initial inroads into the task, while significant AI challenges need to be solved to do well in the exam. The diversity of questions also ensures a variety of skills are tested for, and guards against finding a simple shortcut that may answer them all without requiring any depth of understanding. (This contrasts with the more homogeneous Winograd schema challenge [Levesque, Davis, and Morgenstern 2012], where the highly stylized question format risks producing specialized solution methods that have little generality). We illustrate these properties throughout this article. In addition, 45–65 percent of the regents science exam questions (depending on the exam), and virtually all SAT geometry questions, contain diagrams that are necessary for solving the problem. Similarly, the answers to algebraic word problems are typically four numbers (see, for example, table 1). In all these cases, a Google search or simple corpus statistics will not answer these questions with any degree of reliability. A second important question, raised by Davis in his critique of standardized tests for measuring AI (Davis 2014), is whether the tests are measuring the right thing. He notes that standardized tests are authored for people, not machines, and thus will be testing for skills that people find difficult to master, skipping over things that are easy for people but challenging for machines. In particular, Davis conjectures that “standardized tests do not test knowledge that is obvious for people; none of this knowledge can be assumed in AI systems.” However, our experience is generally contrary to this conjecture: although questions do not typically test basic world knowledge directly, basic commonsense knowledge is frequently required to answer them. We will illustrate this in detail throughout this article. The New York Regents Science Exams One of the most interesting and appealing aspects of elementary science exams is their graduated and multifaceted nature: Different questions explore different types of knowledge and vary substantially in difficulty (for a computer), from a simple lookup to those requiring extensive understanding of the world. This allows incremental progress while still Articles demanding significant advances for the most difficult questions. Information retrieval and bag-of-words methods work well for a subset of questions but eventually reach a limit, leaving a collection of questions requiring deeper understanding. We illustrate some of this variety here, using (mainly) the multiple choice part of the New York Regents 4th Grade Science exams2 (New York State Education Department 2014). For a more detailed analysis, see Clark, Harrison, and Balasubramanian (2013). A similar analysis can be made of exams at other grade levels and in other subjects. Basic Questions Part of the New York Regents exam tests for relatively straightforward knowledge, such as taxonomic (“isa”) knowledge, definitional (terminological) knowledge, and basic facts about the world. Example questions include the following. (1) Which object is the best conductor of electricity? (A) a wax crayon (B) a plastic spoon (C) a rubber eraser (D) an iron nail (2) The movement of soil by wind or water is called (A) condensation (B) evaporation (C) erosion (D) friction (3) Which part of a plant produces the seeds? (A) flower (B) leaves (C) stem (D) roots This style of question is amenable to solution by information-retrieval methods and/or use of existing ontologies or fact databases, coupled with linguistic processing. Simple Inference Many questions are unlikely to have answers explicitly written down anywhere, from questions requiring a relatively simple leap from what might be already known to questions requiring complex modeling and understanding. An example requiring (simple) inference follows: (4) Which example describes an organism taking in nutrients? (A) dog burying a bone (B) A girl eating an apple (C) An insect crawling on a leaf (D) A boy planting tomatoes in the garden Answering this question requires knowledge that eating involves taking in nutrients, and that an apple contains nutrients. More Complex World Knowledge Many questions appear to require both richer knowledge of the world, and appropriate linguistic knowledge to apply it to a question. As an example, consider the following question: (5) Fourth graders are planning a roller-skate race. Which surface would be the best for this race? (A) gravel (B) sand (C) blacktop (D) grass Strong cooccurrences between sand and surface, grass and race, and gravel and graders (road-smoothing machines), throw off information-retrieval-based guesses. Rather, a more reliable answer requires knowing that a roller-skate race involves roller skat- ing, that roller skating is on a surface, that skating is best on a smooth surface, and that blacktop is smooth. Obtaining these fragments of world knowledge and integrating them correctly is a substantial challenge. As a second example, consider the following question: (6) A student puts two identical plants in the same type and amount of soil. She gives them the same amount of water. She puts one of these plants near a sunny window and the other in a dark room. This experiment tests how the plants respond to (A) light (B) air (C) water (D) soil Again, information-retrieval methods and word correlations do poorly. Rather, a reliable answer requires recognizing a model of experimentation (perform two tasks, differing in only one condition), knowing that being near a sunny window will expose the plant to light, and that a dark room has no light in it. As a third example, consider this question: (7) A student riding a bicycle observes that it moves faster on a smooth road than on a rough road. This happens because the smooth road has (A) less gravity (B) more gravity (C) less friction (D) more friction A reliable processing of this question requires envisioning and comparing two different situations, overlaying a simple qualitative model on the situations described (smoother → less friction → faster). It also requires basic knowledge that bicycles move, and that riding propels a bicycle. All the aforementioned examples require general knowledge of the world, as well as simple science knowledge. In addition, some questions more directly test basic commonsense knowledge, such as the following: (8) A student reaches one hand into a bag filled with smooth objects. The student feels the objects but does not look into the bag. Which property of the objects can the student most likely identify? (A) shape (B) color (C) ability to reflect light (D) ability to conduct electricity This question requires, among other things, knowing that touch detects shape, and that sight detects color. Some questions require selecting the best explanation for a phenomenon, requiring a degree of metareasoning. For example, consider the following question: (9) Apple trees can live for many years, but bean plants usually live for only a few months. This statement suggests that (A) different plants have different life spans (B) plants depend on other plants (C) plants produce many offspring (D) seasonal changes help plants grow This requires not just determining whether the statement in each answer option is true (here, several of them are), but whether it explains the statement given in the body of the question. Again, this kind of question would be challenging for a retrieval-based solution. SPRING 2016 7 Articles uity in tests, and because spatial interpretation and reasoning is such a fundamental aspect of intelligence. Diagrams introduce several new dimensions to question-answering, including spatial interpretation and correlating spatial and textual knowledge. Diagrammatic (nontextual) entities in elementary exams include sketches, maps, graphs, tables, and diagrammatic representations (for example, a food chain). Reasoning requirements include sketch interpretation, correlating textual and spatial elements, and mapping diagrammatic representations (graphs, bar charts, and so on) to a form supporting computation. Again, while there are many challenges, the level of difficulty varies widely, allowing a graduated plan of attack. Two examples are shown. The first, question 11 (figure 1), requires sketch interpretation, part identification, and label/part correlation. The second, question 12 (figure 2), requires recognizing and interpreting a spatial representation. A B C Mathematics and Geometry We also include elementary mathematics in our challenge scope, as these questions intrinsically require mapping to mathematical models, a key requirement for many real-world tasks. These questions are particularly interesting as they combine elements of language processing, (often) story interpretation, mapping to an internal representation (for example, algebra), and symbolic computation. For example (from ixl.com): (13) Molly owns the Wafting Pie Company. This morning, her employees used 816 eggs to bake pumpkin pies. If her employees used a total of 1339 eggs today, how many eggs did they use in the afternoon? D Figure 1. Question 11. (11) Which letter in the diagram points to the plant structure that takes in water and nutrients? As a final example, consider the following question from the Texas Assessment of Knowledge and Skills exam3 (Texas Education Agency 2014): (10) Which of these mixtures would be easiest to separate? (A) Fruit salad (B) Powdered lemonade (C) Hot chocolate (D) Instant pudding This question requires a complex interplay of basic world knowledge and language to answer correctly. Diagrams A common feature of many elementary grade exams is the use of diagrams in questions. We choose to include these in the challenge because of their ubiq- 8 AI MAGAZINE Such questions clearly cannot be answered by information retrieval, and instead require symbolic processing and alignment of textual and algebraic elements (for example, Hosseini et al. 2014; KoncelKedziorski et al. 2015; Seo et al. 2014, 2015) followed by inference. Additional examples are shown in table 1. Note that, in addition to simple arithmetic capabilities, some capacity for world modeling is often needed. Consider, for example, the following two questions: (14) Sara’s high school won 5 basketball games this year. They lost 3 games. How many games did they play in all? (15) John has 8 orange balloons, but lost 2 of them. How many orange balloons does John have now? Both questions use the word lost, but the first question maps to an addition problem (5 + 3) while the second maps to a subtraction problem (8 – 2). This illustrates how modeling the entities, events, and event sequences is required, in addition to basic algebraic skills. Finally we also include geometry questions, as these combine both arithmetic and diagrammatic reasoning together in challenging ways. For example, Articles Larva Pupa Egg Adult Egg Egg Pupa Larva Adult Pupa Adult Larva A B C Figure 2. Question 12. (2) Which diagram correctly shows the life cycle of some insects? Problems and Equations John had 20 stickers. He bought 12 stickers from a store in the mall and got 20 stickers for his birthday. Then John gave 5 of the stickers to his sister and used 8 to decorate a greeting card. How many stickers does John have left? ((20 + ((12 + 20) – 8)) – 5) = x B D Maggie bought 4 packs of red bouncy balls, 8 packs of yellow bouncy balls, and 4 packs of green bouncy balls. There were 10 bouncy balls in each package. How many bouncy balls did Maggie buy in all? x = (((4 + 8) + 4) * 10) A E O Sam had 79 dollars to spend on 9 books. After buying them he had 16 dollars. How much did each book cost? 79 = ((9 * x) + 16) C Fred loves trading cards. He bought 2 packs of football cards for $2.73 each, a pack of Pokemon cards for $4.01, and a deck of baseball cards for $8.95. How much did Fred spend on cards? ((2 * 2.73) + (4.01 + 8.95)) = x T bl 1. Table 1 Examples E l off Problems P bl Solved S l d By Alges with the Returned Equation. (From Koncel-Kedziorski et al. [2015]) question 16 (figure 3) requires multiple skills (text processing, diagram interpretation, arithmetic, and aligning evidence from both text and diagram together). Although very challenging, there has been significant progress in recent years on this kind of problem (for example, Koncel-Kedziorski et al. [2015]). Examples of problems that current systems have been able to solve are shown in table 2. Testing for Commonsense Possessing and using commonsense knowledge is a central property of intelligence (Davis and Marcus 2015). However, Davis (2015) and Weston et al. Figure 3. Question 16. (16) In the diagram, AB intersects circle O at D, AC intersects circle O at E, AE = 4, AC = 24, and AB = 16. Find AD. (2015) have both argued that standardized tests do not test “obvious” commonsense knowledge, and hence are less suitable as a test of machine intelligence. For instance, using their examples, the following questions are unlikely to occur in a standardized test: Can you make a watermelon fit into a bag by folding the watermelon? If you look at the moon then shut your eyes, can you still see the moon? If John is in the playground and Bob is in the office, then where is John? Can you make a salad out of a polyester shirt? However, although such questions may not be SPRING 2016 9 Articles Interpretations Questions (a) B C 2 E In the diagram at the left, circle O has a radius of 5, and CE = 2. Diameter AC is perpendicular to chord BD. What is the length of BD? D 5 O 5 Equals(RadiusOf(O), 5) IsCircle(O) Equals(LengthOf(CE), 2) IsDiameter(AC) IsChord(BD) Perpendicular(AC), BD) Equals(what, Length(BD)) correct a) 12 A (b) b) 10 c) 8 d) 6 e) 4 B In isosceles triangle ABC at the left, lines AM and CM are the angle bisectors of angles BAC and BCA. What is the measure of angle AMC? 40˚ M IsIsoscelesTriangle(ABC) BisectsAngle(AM, BAC) IsLine(AM) CC(AM, CM) CC(BAC, BCA) IsAngle(BAC) IsAngle(AMC) Equals(what, MeasureOf(AMC)) correct A a) 110 C (c) B A D C In the figure at left, the bisector of angle BAC is perpendicular to BC at point D. If AB = 6 and BD = 3, what is the measure of angle BAC? b) 115 c) 120 d) 125 e) 130 IsAngle(BAC) BisectsAngle(line, BAC) Perpendicular(line, BC) Equals(LengthOf(AB), 6) Equals(LengthOf(BD), 3) IsAngle(BAC) Equals(what, MeasureOf(BAC)) correct a) 15 b) 30 c) 45 d) 60 e) 75 Table 2. Examples of Problems That Current Systems Have Solved. Questions (left) and interpretations (right) leading to correct solution by GEOS. From Seo et al. (2015). directly posed in standardized tests, many questions indirectly require at least some of this commonsense knowledge in order to answer. For example, question (6) (about plants) in the previous section requires knowing (among other things) that if you put a plant near X (a window), then the plant will be near X. This is a flavor of blocks-world-style knowledge very similar to that tested in many of Weston et al.’s examples. Similarly question (8) (about objects in a 10 AI MAGAZINE bag) requires knowing that touch detects shape, and that not looking implies not being able to detect color. It also requires knowing that a bag filled with objects contains those objects; a smooth object is smooth; and if you feel something, you touch it. These commonsense requirements are similar in style to many of Davis’s examples. In short, at least some of the standardized test questions seem to require the kind of obvious commonsense knowledge that Davis Articles and Weston et al. call for in order to derive an answer, even if the answers themselves are less obvious. Conversely, if one authors a set of synthetic commonsense questions, there is a significant risk of biasing the set toward one’s own preconceived notions of what commonsense means, ignoring other important aspects. (This has been a criticism sometimes made of the Winograd schema challenge.) For this reason we feel that the natural diversity present in standardized tests, as illustrated here, is highly beneficial, along with their other advantages. Other Aspects of Intelligence Standardized tests clearly do not test all aspects of intelligence, for example, dialog, physical tasks, speech. However, besides question-answering and reasoning there are some less obvious aspects of intelligence they also push on: explanation, learning and reading, and dealing with novel problems. Explanation Tests (particularly at higher grade levels) typically include questions that not only ask for answers but also for explanations of those answers. So, at least to some degree, the ability to explain an answer is required. Learning and Reading Reddy (1996) proposed the grand AI challenge of reading a chapter of a textbook and answering the questions at the end of the chapter. While standardized tests do not directly test textbook reading, they do include question comprehension, including sometimes long story questions. In addition, acquiring the knowledge necessary to pass a test will arguably require breakthroughs in learning and machine reading; attempts to encode the requisite knowledge by hand have to date been unsuccessful. Dealing with Novel Problems As our examples illustrate, test taking is not a monolithic skill. Rather it requires a battery of capabilities and the ability to deploy them in potentially novel and unanticipated ways. In this sense, test taking requires, to some level, a degree of versatility and the ability to handle new and surprising problems that we would expect of an intelligent machine. State of the Art on Standardized Tests How well do current systems perform on these tests? While any performance figure will be exam specific, we can provide some example data points from our own research. On nondiagram, multiple choice science questions (NDMC), our Aristo system currently scores on average 75 percent (4th grade), 63 percent (8th grade), and 41 percent (12th grade) on (previously unseen) New York Regents science exams (NDMC questions only, typically four-way multiple choice). As can be seen, questions become considerably more challenging at higher grade levels. On a broader multistate collection of 4th grade NDMC questions, Aristo scores 65 percent (unseen questions). The data sets are available at allenai.org/aristo.html. Note that these are the easier questions (no diagrams, multiple choice); other question types pose additional challenges as we have described. No system to date comes even close to passing a full 4th grade science exam. On algebraic story problems such as those in table 1, our AlgeS system scores over 70 percent accuracy on story problems that translate into single equations (Koncel-Kedziorski et al. 2015). Kushman et al. (2014) report results on story problems that translate to simultaneous algebraic equations. On geometry problems such as those in table 2, our GeoS system achieves a 49 percent score on (previously unseen) official SAT questions, and a score of 61 percent on a data set of (previously unseen) SAT-like practice questions. The relevant questions, data, and software are available on the Allen Institute’s website.4 Summary If a computer were able to pass standardized tests, would it be intelligent? Not necessarily, but it would demonstrate that the computer had several critical skills we associate with intelli- gence, including the ability to answer sophisticated questions, handle natural language, and solve tasks requiring extensive commonsense knowledge of the world. In short, it would mark a significant achievement in the quest toward intelligent machines. Despite the successes of data-driven AI systems, it is imperative that we make progress in these broader areas of knowledge, modeling, reasoning, and language if we are to make the next generation of knowledgable AI systems a reality. Standardized tests can help to drive and measure progress in this direction as they present many of these challenges, yet are also accessible, comprehensible, incremental, and easily measurable, To help with this, we are releasing data sets related to this challenge. In addition, in October 2015 we launched the Allen AI Science Challenge,5 a competition run on kaggle .com to build systems to answer eighthgrade science questions. The competition attracted over 700 participating teams, and scores jumped from 32.5 percent initially to 58.8 percent by the end of January 2016. Athough the winner is not yet known at press time, this successful impact demonstrates the efficacy of standardized tests to focus attention and research on these important AI problems. Of course, some may claim that existing data-driven techniques are all that is needed, given enough data and computing power; if that were so, that in itself would be a startling result. Whatever your bias or philosophy, we encourage you to prove your case, and take these challenges! AI2’s data sets are available on the Allen Institute’s website.5 Notes 1. www.allenai.org. 2. www.nysedregents.org/Grade4/Science/home.html . 3. tea.texas.gov/student.assessment/taks/rel eased-tests/. 4. www.allenai.org/euclid.html. 5. www.allenai.org/2015-science-challenge.html. 6. www.allenai.org/data.html. References Bayer, S.; Damianos, L.; Doran, C.; Ferro, L.; SPRING 2016 11 Articles Fish, R.; Hirschman, L.; Mani, I.; Riek, L.; and Oshika, B. 2005. Selected Grand Challenges in Cognitive Science, MITRE Technical Report 05-1218. Bedford MA: The MITRE Corporation. Bringsjord, S. 2011. Psychometric Artificial Intelligence. Journal of Experimental and Theoretical Artificial Intelligence (JETAI) 23(3): 271–277. Bringsjord, S., and Schimanski, B. 2003. What Is Artificial Intelligence? Psychometric AI as an Answer. In Proceedings of the Eighteenth International Joint Conference on Artificial Intelligence, 887–893. San Francisco: Morgan Kaufmann Publishers. dx.doi.org/10.1080/0952813X.2010.50231 4 Clark, P.; Harrison, P.; and Balasubramanian, N. 2013. A Study of the Knowledge Base Requirements for Passing an Elementary Science Test. In AKBC’13: Proceedings of the 2013 Workshop on Automated Knowledge Base Construction. New York: Association for Computing Machinery. dx.doi.org/ 10.1145/2509558.2509565 Clark, P.; Etzioni, O.; Khashabi, D.; Khot, T.; Sabharwal, A.; Tafjord, O.; Turney, P. 2016. Combining Retrieval, Statistics, and Inference to Answer Elementary Science Questions. In Proceedings of the Thirtieth Conference of the Association for the Advancement of Artificial Intelligence. Menlo Park, CA: AAAI Press. Davis, E. 2014. The Limitations of Standardized Science Tests as Benchmarks for AI Research. Technical Report, New York University. arXiv Preprint arXiv:1411.1629. Ithaca, NY: Cornell University Library. Davis, E., and Marcus, G. 2015. Commonsense Reasoning and Commonsense Knowledge in Artificial Intelligence. Communications of the ACM 58(9): 92–103. dx.doi.org/ 10.1145/2701413 Fujita, A.; Kameda, A.; Kawazoe, A.; and Miyao, Y. 2014. Overview of Todai Robot Project and Evaluation Framework of its NLP-based Problem Solving. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2014). Paris: European Language Resources Association. Hayes, P., and Ford, K. 1995. Turing Test Considered Harmful. In Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence. San Francisco: Morgan Kaufmann Publishers. Hosseini, M.; Hajishirzi, H.; Etzioni, O.; and Kushman, N. 2014. Learning to Solve Arithmetic Word Problems with Verb Categorization. In EMNLP 2014: Proceedings of the Empirical Methods in Natural Language Processing. Stroudsburg, PA: Association for Computational Linguistics. dx.doi.org/10.3115/v1/D14-1058 12 AI MAGAZINE Khot, T.; Balasubramanian, N.; Gribkoff, E.; Sabharwal, A.; Clark P.; and Etzioni, O. 2015. Exploring Markov Logic Networks for Question Answering. In EMNLP 2015: Proceedings of the Empirical Methods in Natural Language Processing. Stroudsburg, PA: Association for Computational Linguistics. dx.doi.org/10.18653/v1/D15-1080 Koncel-Kedziorski, R.; Hajishirzi, H.; Sabharwal, A.; Ang, S. D.; and Etzioni, O. 2015. Parsing Algebraic Word Problems into Equations. Transactions of the Association for Computational Linguistics, Volume 3 (2015). Kushman, N.; Artzi, Y.; Zettlemoyer, L.; and Barzilay, R. 2014. Learning to Automatically Solve Algebra Word Problems. In EMNLP 2014: Proceedings of the Empirical Methods in Natural Language Processing Conference. Stroudsburg, PA: Association for Computational Linguistics. dx.doi.org/ 10.3115/v1/ p14-1026 Machine Comprehension of Text. In EMNLP 2013: Proceedings of the Empirical Methods in Natural Language Processing Conference. Stroudsburg, PA: Association for Computational Linguistics. dx.doi.org/10. 18653/v1/D15-1171 Seo, M.; Hajishirzi, H.; Farhadi, A.; and Etzioni, O. 2014. Diagram Understanding in Geometry Questions. In Proceedings of the Twenty-Eighth Conference on Artificial Intelligence AAAI 2014. Palo Alto, CA: AAAI Press. Seo, M.; Hajishirzi, H.; Farhadi, A.; Etzioni, O.; and Malcolm, C. 2015. Solving Geometry Problems: Combining Text and Diagram Interpretation. In EMNLP 2015: Proceedings of the Empirical Methods in Natural Language Processing Conference. Stroudsburg, PA: Association for Computational Linguistics. Strickland, E. 2013. Can an AI Get into the University of Tokyo? IEEE Spectrum 21 August. dx.doi.org/10.1109/ mspec.2013. 6587172 Landauer, T. K., and Dumais, S. T. 1997. A Solution to Plato’s Problem: The Latent Semantic Analysis Theory of the Acquisition, Induction, and Representation of Knowledge. Psychological Review 104(2): 211–240. dx.doi.org/10.1037/0033-295X. 104.2.211 Texas Education Agency. 2014. Texas Assessment of Knowledge and Skills. Austin, TX: State of Texas Education Agency. Legg, S., and Hutter, M. A. 2007. Collection of Definitions of Intelligence. In Advances in Artificial General Intelligence: Concepts, Architectures, and Algorithms. Frontiers in Artificial Intelligence and Applications Volume 157. Amsterdam, The Netherlands: IOS Press. Turney, P. D. 2006. Similarity of Semantic Relations. Computational Linguistics 32(3): 379–416. dx.doi.org/10.1162/ coli.2006.32. 3.379 Levesque, H.; Davis, E.; and Morgenstern, L. 2012. The Winograd Schema Challenge. In Principles of Knowledge Representation and Reasoning: Proceedings of the Thirteenth International Conference (KR2012), 552–561. Palo Alto: AAAI Press. Li, Y., and Clark, P. 2015. Answering Elementary Science Questions via Coherent Scene Extraction from Knowledge Graphs. In EMNLP 2015: Proceedings of the Empirical Methods in Natural Language Processing Conference. Stroudsburg, PA: Association for Computational Linguistics. dx.doi.org/10. 18653/v1/D15-1236 Mohammad, S. M.; Dorr, B. J.; Hirst, G.; and Turney, P. D. 2013. Computing Lexical Contrast. Computational Linguistics 39(3): 555– 590. dx.doi.org/10.1162/COLI_a_00143 New York State Education Department. 2014. The Grade 4 Elementary-Level Science Test. Albany, NY: University of the State of New York. Reddy, R. 1996. To Dream the Possible Dream. Communications of the ACM 39(5): 105–112. dx.doi.org/10.1145/ 229459. 233436 Richardson, M.; Burges, C.; and Renshaw, E. 2013. MCTest: A Challenge Dataset for the Turing, A. 1950. Computing Machinery and Intelligence. Mind 59(236): 433–460. dx.doi.org/10.1093/mind/LIX.236. 433 Weston, J.; Bordes, A.; Chopra, S.; Mikolov, T.; and Rush, A. 2015. Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks. arXiv Preprint arXiv:1502. 05698v6. Ithaca, NY: Cornell University Library. Peter Clark is the senior research manager for Project Aristo at the Allen Institute for Artificial Intelligence. His work focuses on natural language processing, machine reasoning, and large knowledge bases, and the interplay among these three areas. He has received several awards including a AAAI Best Paper (1997), Boeing Associate Technical Fellowship (2004), and AAAI Senior Member (2014). Oren Etzioni is chief executive officer of the Allen Institute for Artificial Intelligence. Beginning in 1991, he was a professor at the University of Washington’s Computer Science Department. He has received several awards, including the Robert Engelmore Memorial Award (2007), the IJCAI Distinguished Paper Award (2005), AAAI Fellow (2003), and a National Young Investigator Award (1993). Articles How to Write Science Questions That Are Easy for People and Hard for Computers Ernest Davis I As a challenge problem for AI systems, I propose the use of hand-constructed multiple-choice tests, with problems that are easy for people but hard for computers. Specifically, I discuss techniques for constructing such problems at the level of a fourth-grade child and at the level of a high school student. For the fourth-grade-level questions, I argue that questions that require the understanding of time, of impossible or pointless scenarios, of causality, of the human body, or of sets of objects, and questions that require combining facts or require simple inductive arguments of indeterminate length can be chosen to be easy for people, and are likely to be hard for AI programs, in the current state of the art. For the high school level, I argue that questions that relate the formal science to the realia of laboratory experiments or of real-world observations are likely to be easy for people and hard for AI programs. I argue that these are more useful benchmarks than existing standardized tests such as the SATs or New York Regents tests. Since the questions in standardized tests are designed to be hard for people, they often leave many aspects of what is hard for computers but easy for people untested. T he fundamental paradox of artificial intelligence is that many intelligent tasks are extremely easy for people but extremely difficult to get computers to do successfully. This is universally known as regards basic human activities such as vision, natural language, and social interaction, but it is true of more specialized activities, such as scientific reasoning, as well. As everyone knows, computers can carry out scientific computations of staggering complexity and can hunt through immense haystacks of data looking for minuscule needles of insights or subtle, complex correlations. However, as far as I know, no existing computer program can answer the question, “Can you fold a watermelon?” Perhaps that doesn’t matter. Why should we need computer programs to do things that people can already do easily? For the last 60 years, we have relied on a reasonable division of labor: computers do what they do extremely well — calculations that are either extremely complex or require an enormous, unfailing memory — and people do what they do well — perception, language, and many forms of learning and of reasoning. However, the fact that computers have almost no commonsense knowledge and rely almost entirely on quite rigid forms of reasoning ultimately forms a serious limitation on the capacity of science-oriented applications including question answering; design, robotic execution, and evaluation of experiments; retrieval, summarization, and high-quality translation of scientific documents; science educational software; and sanity checking of the results of specialized software (Davis and Marcus 2016). A basic understanding of the physical and natural world at the level of common human experience, and an understanding of how the concepts and laws of formal science relate to the world as experienced, is thus a critical objective Copyright © 2016, Association for the Advancement of Artificial Intelligence. All rights reserved. ISSN 0738-4602 SPRING 2016 13 Articles in developing AI for science. To measure progress toward this objective, it would be useful to have standard benchmarks; and to inspire radically ambitious research projects, it would be valuable to have specific challenges. In many ways, the best benchmarks and challenges here would be those that are directly connected to real-world, useful tasks, such as understanding texts, planning in complex situations, or controlling a robot in a complex environment. However, multiple-choice tests also have their advantages. First, as every teacher knows, they are easy to grade, though often difficult to write. Second, multiple-choice tests can enforce a much narrower focus on commonsense physical knowledge specifically than on more broadly based tasks. In any more broadly based task, such as those mentioned above, the commonsense reasoning will only be a small part of the task, and, to judge by past experience, quite likely the part of the task with the least short-term payoff. Therefore research on these problems is likely to focus on the other aspects of the problem and to neglect the commonsense reasoning. If what we want is a multiple-choice science as a benchmark or challenge for AI, then surely the obvious thing to do is to use one of the existing multiplechoice challenge tests, such as the New York State Regents’ test (New York State Education Department 2014) or the SAT. Indeed, a number of people have proposed exactly that, and are busy working on developing systems aimed at that goal. Brachman et al. (2005) suggest developing a program that can pass the SATs. Clark, Harrison, and Balasubramanian (2013) propose a project of passing the New York State Regents Science test for 4th graders. Strickland (2013) proposes developing an AI that can pass the entrance exams for the University of Tokyo. Ohlsson et al. (2013) evaluated the performance of a system based on ConceptNet (Havasi, Speer, and Alonso 2007) on a preprocessed form of the Wechsler Preschool and Primary Scale of Intelligence test. Barker et al. (2004) describe the construction of a knowledge-based system that (more or less) scored a 3 (passing) on two sections of the high school chemistry advanced placement test. The GEOS system (Seo et al. 2015), which answers geometry problems from the SATs, scored 49 percent on official problems and 61 percent on a corpus of practice problems. The pros and cons of using standardized tests will be discussed in detail later on in this article. For the moment, let us emphasize one specific issue: standardized tests were written to test people, not to test AIs. What people find difficult and what AIs find difficult are extremely different, almost opposite. Standardized tests include many questions that are hard for people and practically trivial for computers, such as remembering the meaning of technical terms or performing straightforward mathematical calculation. Conversely, these tests do not test scientific 14 AI MAGAZINE knowledge that “every [human] fool knows”; since everyone knows it, there is no point in testing it. However, this is often exactly the knowledge that AIs are missing. Sometimes the questions on standardized tests do test this kind of knowledge implicitly; but they do so only sporadically and with poor coverage. Another possibility would be to automate the construction of questions that are easy for people and hard for computers. The success of CAPTCHA (von Ahn et al. 2003) shows that it is possible automatically to generate images that are easy for people to interpret and hard for computers; however, that is an unusual case. Weston et al. (2015) propose to build a system that uses a world model and a linguistic model to generate simple narratives in commonsense domains. However, the intended purpose of this set of narratives is to serve as a labeled corpus for an endto-end machine-learning system. Having been generated by a well-understood world model and linguistic model, this corpus certainly cannot drive work on original, richer, models of commonsense domains, or of language, or of their interaction. Having tabled the suggestion of using existing standardized tests and having ruled out automatically constructed tests, the remaining option is to use manually designed test problems. To be a valid test for AI, such problems must be easy for people. Otherwise the test would be in danger of running into, or at least being accused of, the superhuman human fallacy, in which we set benchmarks that AI cannot attain because they are simply impossible to attain. At this point, we have reached, and hopefully to some extent motivated, the proposal of this article. I propose that it would be worthwhile to construct multiple-choice tests that will measure progress toward developing AIs that have a commonsense understanding of the natural world and an understanding of how formal science relates to the commonsense view; tests that will be easy for human subjects but difficult for existing computers. Moreover, as far as possible, that difficulty should arise from issues inherent to commonsense knowledge and commonsense reasoning rather than specifically from difficulties in natural language understanding or in visual interpretation, to the extent that these can be separated. These tests will collectively be called science questions appraising basic understanding — or SQUABU (pronounced skwaboo). In this article we will consider two specific tests. SQUABU-Basic is a test designed to measure commonsense understanding of the natural world that an elementary school child can be presumed to know, limited to material that is not explicitly taught in school because it is too obvious. The questions here should be easy for any contemporary child of 10 in a developed country. SQUABU-HighSchool is a test designed to measure how well an AI can integrate concepts of high school chemistry and physics with a commonsense under- Articles standing of the natural world. The questions here are designed to be reasonably easy for a student who has completed high school physics, though some may require a few minutes thought. The knowledge of the subject matter is intended to be basic; the problems are intended to require a conceptual understanding of the domain, qualitative reasoning about mathematical relations, and basic geometry, but do not require memory for fine details or intricate exact calculations. These two particular levels were chosen in part because the 4th grade New York Regents exam and the physics SATs are helpful points of contrast. By commonsense knowledge I emphatically do not mean that I am considering AIs that will replicate the errors, illusions, and flaws in physical reasoning that are well known to be common in human cognition. I are here interested only in those aspects of commonsense reasoning that are valid and that enhance or underlie formal scientific thinking. Because of the broad scope of the questions involved, it would be hard to be very confident of any particular question that AI systems will find it difficult. This is in contrast to the Winograd schema challenge (Levesque, Davis, and Morgenstern 2012), in which both the framework and the individual questions have been carefully designed, chosen, and tuned so that, with fair confidence, each individual question will be difficult for an automated system. I do not see any way to achieve that level of confidence for either level of SQUABU; there may be some questions that can be easily solved. However, I feel quite confident that at most a few questions would be easily solved. It is also difficult to be sure that an AI program will get the right answer on specific questions in the categories I’ve marked below as “easy”; AI programs have ways of getting confused or going on the wrong track that are very hard to anticipate. (An example is the Toronto problem that Watson got wrong [Welty, undated].) However, AI programs exist that can answer these kinds of questions with a large degree of accuracy. I will begin by discussing the kinds of problems that are easy for the current generation of computers; these must be avoided in SQUABU. Then I will discuss some general rules and techniques for developing questions for SQUABU-Basic and SQUABUHighSchool. After that I will return to the issue of standardized tests, and their pros and cons for this purpose, and finally, will come the conclusion. Problems That Are Easy for Computers As of the date of writing (May 2015), the kinds of problems that tend to arise on standardized tests that are “easy for computers” (that is, well within the state of the art) include terminology, taxonomy, and exact calculation. Terminology Retrieving the definition of (for human students) obscure jargon. For example, as Clark (2015) remarks, the following problem from the New York State 4th grade Regents Science test is easy for AI programs: The movement of soil by wind or water is known as (A) condensation (B) evaporation (C) erosion (D) friction If you query a search engine for the exact phrase “movement of soil by wind and water,” it returns dozens of pages that give that phrase as the definition of erosion. Taxonomy Constructing taxonomic hierarchies of categories and individuals organized by subcategory and instance relations can be considered a solved problem in AI. Enormous, quite accurate hierarchies of this kind have been assembled through web mining; for instance Wu et al. (2012) report that the Probase project had 2.6 million categories and 20.7 million isA pairs, with an accuracy of 92.8 percent. Finding the features of these categories, and carrying out inheritance, particularly overridable inheritance, is certainly a less completely solved problem, but is nonetheless sufficiently solved that problems based on inheritance must be considered as likely to be easy for computers. For example a question such as the following may well be easy: Which of the following organs does a squirrel not have: (A) a brain (B) gills (C) a heart (D) lungs? (This does require an understanding of not, which is by no means a feature of all IR programs; but it is well within the scope of current technology.) Exact Calculation Problems that involve retrieving standard exact physical formulas, and then using them in calculations, either numerical or symbolic, are easy. For example, questions such as the following SAT-level physics problems are probably easy (Kaplan [2013], p. 294) A 40 Ω resistor in a closed circuit has 20 volts across it. The current flowing through the resistor is (A) 0.5 A; (B) 2 A; (C) 20 A; (D) 80 A; (E) 800 A. A horizontal force F acts on a block of mass m that is initially at rest on a floor of negligible friction. The force acts for time t and moves the block a displacement d. The change in momentum of the block is (A) F/t; (B) m/t; (C) Fd; (D) Ft; (E) mt. The calculations are simple, and, for examples like these, finding the standard formula that matches the word problem can be done with high accuracy using standard pattern-matching techniques. One might be inclined to think that AI programs would have trouble with the kind of brain teaser in which the naïve brute-force solution is horribly complicated but there is some clever way of looking at the SPRING 2016 15 Articles problem that makes it simple. However, these probably will not be effective challenges for AI. The AI program will, indeed, probably not find the clever approach; however, like John von Neumann in the well-known anecdote,1 the AI program will be able to do the brute force calculation faster than ordinary people can work out the clever solution. SQUABU-Basic What kind of science questions, then, are easy for people and hard for computers? In this section I will consider this question in the context of SQUABUBasic, which does not rely on book learning. Later, I will consider the question in the context of SQUABUHighSchool, which tests the integration of high school science with commonsense reasoning. Time In principle, representing temporal information in AI systems is almost entirely a solved problem, and carrying out temporal reasoning is largely a solved problem. The known representational systems for temporal knowledge (for example, those discussed in Reiter (2001) and in Davis (1990, chapter 5) suffice for all but a handful of the situations that arise in temporal reasoning;2 almost all of the purely temporal inferences that come up can be justified in established temporal theories; and most of these can be carried out reasonably efficiently, though not all, and there is always room for improvement. However, in practical terms, time is often seriously neglected in large-scale knowledge-based systems, although CYC (Lenat, Prakash, and Shepherd 1986) is presumably an exception. Mitchell et al. (2015) specifically mention temporal issues as an issue unaddressed in NELL, and systems like ConceptNet (Havasi, Speer, and Alonso 2007) seem to be entirely unsystematic in how they deal with temporal issues. More surprisingly the abstract meaning representation (AMR)3, a recent project to manually annotate a large body of text with a formal representation of its meaning, has decided to exclude temporal information from its representation. (Frankly, I think this may well be a short-sighted decision, which will be regretted later.) Thus, there is a common impression that temporal information is either too difficult or not important enough to deal with in AI systems. Therefore, if a temporal fact is not stated explicitly, then it is likely to be hard for existing AI systems to derive. Examples include the following: Problem B.1 Sally’s favorite cow died yesterday. The cow will probably be alive again (A) tomorrow; (B) within a week; (C) within a year; (D) within a few years; (E) The cow will never be alive again. Problem B.2 Malcolm Harrison was a farmer in Virginia who died more than 200 years ago. He had a dozen horses on his farm. Which of the following is most likely to be true: (A) All of Harrison’s horses are dead. (B) Most of Harrison’s horses are dead, but a few 16 AI MAGAZINE might be alive. (C) Most of Harrison’s horses are alive, but a few might have died. (D) Probably all of Harrison’s horses are alive. Problem B.3 Every week during April, Mike goes to school from 9 AM to 4 PM, Monday through Friday. Which of the following statements is true (only one)? (A) Between Monday 9 AM and Tuesday 4 PM, Mike is always in school. (B) Between Monday 9 AM through Tuesday 4 PM, Mike is never in school. (C) Between Monday 4 PM and Friday 9 AM, Mike is never in school. (D) Between Saturday 9 AM and Monday 8 AM, Mike is never in school. (E) Between Sunday 4 PM and Tuesday 9 AM, Mike is never in school. (F) It depends on the year. With regard to question B.2, the AI can certainly find the lifespan of a horse on Wikipedia or some similar source. However, answering this question requires combining this with the additional facts that lifespan measures the time from birth to death, and that if person P owns horse H at time T, then both P and H are alive at time T. This connects to the feature “combining multiple facts” discussed later. This seems like it should be comparatively easy to do; I would not be very surprised if AI programs could solve this kind of problem 10 years from now. On the other hand, I am not aware of much research in this direction. Inductive Arguments of Indeterminate Length AI programs tend to be bad at arguments about sequences of things of an indeterminate number. In the software verification literature, there are techniques for this, but these have hardly been integrated into the AI literature. Examples include the following: Problem B.4 Mary owns a canary named Paul. Does Paul have any ancestors who were alive in the year 1750? (A) Definitely yes. (B) Definitely no. (C) There is no way to know. Problem B.5 Tim is on a stony beach. He has a large pail. He is putting small stones one by one into the pail. Which of the following is true: (A) There will never be more than one stone in the pail. (B) There will never be more than three stones in the pail. (C) Eventually, the pail will be full, and it will not be possible to put more stones in the pail. (D) There will be more and more stones in the pail, but there will always be room for another one. Impossible and Pointless Scenarios If you cook up a scenario that is obviously impossible for no very interesting reason, then it is quite likely that no one has gone to the trouble of stating on the web that it is impossible, and that the AI cannot figure that out. Of course, if all the questions of this form have the answer “this is impossible,” then the AI or its designer will soon catch on to that fact. So these have to be counterbalanced by questions about scenarios that Articles are in fact obviously possible, but so pointless that no one will have bothered to state that they are possible or that they occurred. Examples include the following: Problem B.6 Is it possible to fold a watermelon? Problem B.7 Is it possible to put a tomato on top of a watermelon? Problem B.8 Suppose you have a tomato and a whole watermelon. Is it possible to get the tomato inside the watermelon without cutting or breaking the watermelon? Problem B.9 Which of the following is true: (A) A female eagle and a male alligator could have a baby. That baby could either be an eagle or an alligator. (B) A female eagle and a male alligator could have a baby. That baby would definitely be an eagle. (C) A female eagle and a male alligator could have a baby. That baby would definitely be an alligator. (D) A female eagle and a male alligator could have a baby. That baby would be half an alligator and half an eagle. (E) A female eagle and a male alligator cannot have a baby. Problem B.10 If you brought a canary and an alligator together to the same place, which of the following would be completely impossible: (A) The canary could see the alligator. (B) The alligator could see the canary. (C) The canary could see what is inside the alligator’s stomach. (D) The canary could fly onto the alligator’s back. stick became longer. (D) After the pin is pulled out, the stick no longer has a length. Putting Facts Together Questions that require combining facts that are likely to be expressed in separate sources are likely to be difficult for an AI. As already discussed, B.2 is an example. Another example: Problem B.15 George accidentally poured a little bleach into his milk. Is it OK for him to drink the milk, if he’s careful not to swallow any of the bleach? This requires combining the facts that bleach is a poison, that poisons are dangerous even when diluted, that bleach and milk are liquids, and that it is difficult to separate two liquids that have been mixed. Human Body Of course, people have an unfair advantage here. Problem B.16 Can you see your hand if you hold it behind your head? Problem B.17 If a person has a cold, then he will probably get well (A) In a few minutes. (B) In a few days or a couple of weeks. (C) In a few years. (D) He will never get well. Problem B.18 If a person cuts off one of his fingers, then he will probably grow a new finger (A) In a few minutes. (B) In a few days or a couple of weeks. (C) In a few years. (D) He will never grow a new finger. Causality Sets of Objects Many causal sequences that are either familiar or obvious are unlikely to be discussed in the corpus available. Physical reasoning programs are good at reasoning about problems with fixed numbers of objects, but not as good at reasoning about problems with indeterminate numbers of objects. Problem B.11 Suppose you have two books that are identical except that one has a white cover and one has a black cover. If you tear a page out of the white book what will happen? (A) The same page will fall out of the black book. (B) Another page will grow in the black book. (C) The page will grow back in the white book. (D) All the other pages will fall out of the white book. (E) None of the above. Spatial Properties of Events Basic spatial properties of events may well be difficult for an AI to determine. Problem B.12 When Ed was born, his father was in Boston and his mother was in Los Angeles. Where was Ed born? (A) In Boston. (B) In Los Angeles. (C) Either in Boston or in Los Angeles. (D) Somewhere between Boston and Los Angeles. Problem B.13 Joanne cut a chunk off a stick of cheese. Which of the following is true? (A) The weight of the stick didn’t change. (B) The stick of cheese became lighter. (C) The stick of cheese became heavier. (D) After the chunk was cut off, the stick no longer had a measurable weight. Problem B.14 Joanne stuck a long pin through the middle of a stick of cheese, and then pulled it out. Which of the following is true? (A) The stick remained the same length. (B) The stick became shorter. (C) The Problem B.19 There is a jar right-side up on a table, with a lid tightly fastened. There are a few peanuts in the jar. Joe picks up the jar and shakes it up and down, then puts it back on the table. At the end, where, probably, are the peanuts? (A) In the jar. (B) On the table, outside the jar. (C) In the middle of the air. Problem B.20 There is a jar right-side up on a table, with a lid tightly fastened. There are a few peanuts on the table. Joe picks up the jar and shakes it up and down, then puts it back on the table. At the end, where, probably, are the peanuts? (A) In the jar. (B) On the table, outside the jar. (C) In the middle of the air. SQUABU-HighSchool The construction of SQUABU-HighSchool is quite different from SQUABU-Basic. SQUABU-HighSchool relies largely on the same gaps in an AI’s understanding that we have described earlier for SQUABUBasic. However, since the object is to appraise the AI’s understanding of the relation between formal science and commonsense reasoning, the choice of domain becomes critical; the domain must be one where the relation between the two kinds of knowledge is both deep and evident to people. SPRING 2016 17 Articles Figure 1: A Chemistry Experiment. One fruitful source of these kinds of domains is simple high school level science lab experiments. On the one hand experiments draw on or illustrate concepts and laws from formal science; on the other hand, understanding the experimental set up often requires commonsense reasoning that is not easily formalized. Experiments also must be physically manipulable by human beings and their effects must be visible (or otherwise perceptible) to human beings; thus, the AI’s understanding of human powers of manipulation and perception can also be tested. Often, an effective way of generating questions is to propose some change in the setup; this may either create a problem or have no effect. I have also found basic astronomy to be a fruitful domain. Simple astronomy involves combining general principles, basic physical knowledge, elementary geometric reasoning, and order-of-magnitude reasoning. A third category of problem is problems in everyday settings where formal scientific analysis can be brought to bear. One general caveat: I am substantially less confident that high school students would in fact do well on my sample questions for SQUABU-HighSchool than that fourth-graders would do well on the sample questions for SQUABU-Basic. I feel certain that they should do well, and that something is wrong if they do not do well, but that is a different question. Chemistry Experiment Read the following description of a chemistry experiment,4 illustrated in figure 1. A small quantity of potassium chlorate (KClO3) is heated in a test tube, 18 AI MAGAZINE and decomposes into potassium chloride (KCl) and oxygen (O2). The gaseous oxygen expands out of the test tube, goes through the tubing, bubbles up through the water in the beaker, and collects in the inverted beaker over the the water. Once the bubbling has stopped, the experimenter raises or lowers the beaker until the level of the top of water inside and outside the beaker are equal. At this point, the pressure in the beaker is equal to atmospheric pressure. Measuring the volume of the gas collected over the water, and correcting for the water vapor that is mixed in with the oxygen, the experimenter can thus measure the amount of oxygen released in the decomposition. Problem H.1: If the right end of the U-shaped tube were outside the beaker rather than inside, how would that change things? (A) The chemical decomposition would not occur. (B) The oxygen would remain in the test tube. (C) The oxygen would bubble up through the water in the basin to the open air and would not be collected in the beaker. (D) Nothing would change. The oxygen would still collect in the beaker, as shown. Problem H.2: If the beaker had a hole in the base (on top when inverted as shown), how would that change things? (A) The oxygen would bubble up through the beaker and out through the hole. (B) Nothing would change. The oxygen would still collect in the beaker, as shown. (C) The water would immediately flow out from the inverted beaker into the basin and the beaker would fill with air coming in through the hole. Problem H.3 If the test tube, the beaker, and the Utube were all made of stainless steel rather than glass, how would that change things? (A) Physically it would make no difference, but it would be impossible to see and therefore impossible to measure. (B) The chemical Articles Charged Plate – Oil drop Charged Plate Figure 2. Millikan Oil-Drop Experiment. decomposition would not occur. (C) The oxygen would seep through the stainless steel beaker. (D) The beaker would break. (E) The potassium chloride would accumulate in the beaker. Problem H.4 Suppose the stopper in the test tube were removed, but that the U-tube has some other support that keeps it in its current position. How would that change things? (A) The oxygen would stay in the test tube. (B) All of the oxygen would escape to the outside air. (C) Some of the oxygen would escape to the outside air, and some would go through the U-shaped tube and bubble up to the beaker. So the beaker would get some oxygen but not all the oxygen. Problem H.5 The experiment description says, “The experimenter raises or lowers the beaker until the level of the top of water inside and outside the beaker are equal. At this point, the pressure in the beaker is equal to atmospheric pressure.” More specifically: Suppose that after the bubbling has stopped, the level of water in the beaker is higher than the level in the basin (as seems to be shown in the right-hand picture). Which of the following is true: (A) The pressure in the beaker is lower than atmospheric pressure, and the beaker should be lowered. (B) The pressure in the beaker is lower than atmospheric pressure, and the beaker should be raised. (C) The pressure in the beaker is higher than atmospheric pressure, and the beaker should be lowered. (D) The pressure in the beaker is higher than atmospheric pressure, and the beaker should be raised. Problem H.6 Suppose that instead of using a small amount of potassium chlorate, as shown, you put in enough to nearly fill the test tube. How will that change things? (A) The chemical decomposition will not occur. (B) You will generate more oxygen than the beaker can hold. (C) You will generate so little oxygen that it will be difficult to measure. Problem H.7 In addition to the volume of the gas in the beaker, which of the following are important to measure accurately? (A) The initial mass of the potassium chlorate. (B) The weight of the beaker. (C) The diameter of the beaker. (D) The number and size of the bubbles. (E) The amount of liquid in the beaker. Problem H.8 The illustration shows a graduated beaker. Suppose instead you use an ungraduated glass beaker. How will that change things? (A) The oxygen will not collect properly in the beaker. (B) The experimenter will not know whether to raise or lower the beaker. (C) The experimenter will not be able to measure the volume of gas. Problem H.9 At the start of the experiment, the beaker needs to be full of water, with its mouth in the basin below the surface of the water in the basin. How is this state achieved? (A) Fill the beaker with water rightside up, turn it upside down, and lower it upside down into the basin. (B) Put the beaker rightside up into the basin below the surface of the water; let it fill with water; turn it upside down keeping it underneath the water; and then lift it upward, so that the base is out of the water, but keeping the mouth always below the water. (C) Put the beaker upside down into the basin below the surface of the water; and then lift it back upward, so that the base is out of the water, but keeping the mouth always below the water. (D) Put the beaker in the proper position, and then splash water upward from the basin into it. (E) Put the beaker in its proper position, with the mouth below the level of the water; break a small hole in the base of the beaker; suction the water up from the basin into the beaker using a pipette; then fix the hole. Millikan Oil-Drop Experiment Problem H.10: In the Millikan oil-drop experiment, a tiny oil drop charged with a single electron was suspended between two charged plates (figure 2). The charge on the plates was adjusted until the electric force on the drop exactly balanced its weight. How were the plates charged? (A) Both plates had a positive charge. (B) Both plates had a negative charge. (C) The top plate had a positive charge, and the bottom plate had a negative charge. (D) The top plate had a negative charge, and the bottom plate had a positive charge. (E) The experiment would work the same, no matter how the plates were charged. Problem H.11: If the oil drop started moving upward, Millikan would (A) Increase the charge on the plates. (B) Reduce the charge on the plates. (C) Increase the charge on the drop. (D) Reduce the charge on the drop. (E) Make the SPRING 2016 19 Articles drop heavier. (F) Make the drop lighter. (G) Lift the bottom plate. Problem H.12: If the oil drop fell onto the bottom plate, Millikan would (A) Increase the charge on the plates. (B) Reduce the charge on the plates. (C) Increase the charge on the drop. (D) Reduce the charge on the drop. (E) Start over with a new oil drop. Problem H.13: The experiment demonstrated that the charge is quantized; that is, the charge on an object is always an integer multiple of the charge of the electron, not a fractional or other noninteger multiple. To establish this, Millikan had to measure the charge on (A) One oil drop. (B) Two oil drops. (C) Many oil drops. Astronomy Problems Problem H.14: Does it ever happen that there is an eclipse of the sun one day and an eclipse of the moon the next? Problem H.15: Does it ever happen that someone on Earth sees an eclipse of the moon shortly after sunset? Problem H.16: Does it ever happen that someone on Earth sees an eclipse of the moon at midnight? Problem H.17: Does it ever happen that someone on Earth sees an eclipse of the moon at noon? Problem H.18: Does it ever happen that one person on Earth sees a total eclipse of the moon, and at exactly the same time another person sees the moon uneclipsed? Problem H.19: Does it ever happen that one person on Earth sees a total eclipse of the sun, and at exactly the same time another person sees the sun uneclipsed? Problem H.20: Suppose that you are standing on the moon, and Earth is directly overhead. How soon will Earth set? (A) In about a week. (B) In about two weeks. (C) In about a month. (D) Earth never sets. Problem H.21: Suppose that you are standing on the moon, and the sun is directly overhead. How soon will the sun set? (A) In about a week. (B) In about two weeks. (C) In about a month. (D) The sun never sets. Problem H.22: You are looking in the direction of a particular star on a clear night. The planet Mars is on a direct line between you and the star. Can you see the star? Problem H.23: You are looking in the direction of a particular star on a clear night. A small planet orbiting the star is on a direct line between you and the star. Can you see the star? Problem H.24: Suppose you were standing on one of the moons of Jupiter. Ignoring the objects in the solar system, which of the following is true: (A) The pattern of stars in the sky looks almost identical to the way it looks on Earth. (B) The pattern of stars in the sky looks very different from the way it looks on Earth. Problem H.25: Nearby stars exhibit parallax due to the annual motion of Earth. If a star is nearby, and is in the plane of Earth’s revolution, and you track its relative motion against the background of very distant stars over the course of a year, what figure does it trace? (A) A straight line. (B) A square. (C) An ellipse. (D) A cycloid. 20 AI MAGAZINE Problem H.26: If a star is nearby, and the line from Earth to the star is perpendicular to the plane of Earth’s revolution, and you track its relative motion against the background of very distant stars over the course of a year, what figure does it trace? (A) A straight line. (B) A square. (C) An ellipse. (D) A cycloid. Problems in Everyday Settings Problem H.27: Suppose that you have a large closed barrel. Empty, the barrel weighs 1 kg. You put into the barrel 10 gm of water and 1 gm of salt, and you dissolve the salt in the water. Then you seal the barrel tightly. Over time, the water evaporates into the air in the barrel, leaving the salt at the bottom. If you put the barrel on a scales after everything has evaporated, the weight will be (A) 1000 gm (B) 1001 gm (C) 1010 gm (D) 1011 gm (E) Water cannot evaporate inside a closed barrel. Problem H.28: Suppose you are in a room where the temperature is initially 62 degrees. You turn on a heater, and after half an hour, the temperature throughout the room is now 75 degrees, so you turn off the heater. The door to the room is closed; however there is a gap between the door and the frame, so air can go in and out. Assume that the temperature and pressure outside the room remain constant over the time period. Comparing the air in the room at the start to the air in the room at the end, which of the following is true: (A) The pressure of the air in the room has increased. (B) The air in the room at the end occupies a larger volume than the air in the room at the beginning. (C) There is a net flow of air into the room during the half hour period. (D) There is a net flow of air out of the room during the half hour period. (E) Impossible to tell from the information given. Problem H.29: The situation is the same as in problem H.28, except that this time the room is sealed, so that no air can pass in or out. Which of the following is true: (A) The pressure of the air in the room has increased. (B) The pressure of the air in the room has decreased. (C) The air in the room at the end occupies a larger volume than the air in the room at the beginning. (D) The air in the room at the end occupies a smaller volume than the air in the room at the beginning. (E) The ideal gas constant is larger at the end than at the beginning. (F) The ideal gas constant is smaller at the end than at the beginning. Problem H.30: You blow up a toy rubber balloon, and tie the end shut. The air pressure in the balloon is: (A) Lower than the air pressure outside. (B) Equal to the air pressure outside. (C) Higher than the air pressure outside. Apparent Advantages of Standardized Tests An obvious alternative to creating our own SQUABU test is to use existing standardized tests. However, it seems to me that the apparent advantages of using standardized tests as benchmarks are mostly either minor or illusory. The advantages that I am aware of are the following: Articles Standardized Tests Exist Standardized tests exist, in large number; they do not have to be created. This “argument from laziness” is not entirely to be sneezed at. The experience of the computational linguistics community shows that, if you take evaluation seriously, developing adequate evaluation metrics and test materials requires a very substantial effort. However, the experience of the computational linguistic community also suggests that, if you take evaluation seriously, this effort cannot be avoided by using standardized tests. No one in the computational linguistics community would dream of proposing that progress in natural language processing (NLP) should be evaluated in terms of scores on the English language SATs. Investigator Bias Entrusting the issue of evaluation measures and benchmarks to the same physical reasoning community that is developing the programs to be evaluated is putting the foxes in charge of the chicken coops. The AI researchers will develop problems that fit their own ideas of how the problems should be solved. This is certainly a legitimate concern; but I expect in practice much less distortion will be introduced this way than by taking tests developed for testing people and applying them to AI. Vetting and Documentation Standardized tests have been carefully vetted and the performance of the human population on them is very extensively documented. On the first point, it is not terribly difficult to come up with correct tests. On the second point, there is no great value to the AI community in knowing how well humans of different ages, training, and so on do on this problem. It hardly matters which questions can be solved by 5 year olds, which by 12 year olds, and which by 17 year olds, since, for the foreseeable future, all AI programs of this kind will be idiot savants (when they are not simply idiots), capable of superhuman calculations at one minute, and subhuman confusions at the next. There is no such thing as the mental age of an AI program; the abilities and disabilities of an AI program do not correspond to those of any human being who has ever existed or could ever exist. Public Acceptance Success on standardized tests is easily accepted by the public (in the broad sense, meaning everyone except researchers in the area), whereas success on metrics we have defined ourselves requires explanation, and will necessarily be suspect. This, it seems to me, is the one serious advantage of using standardized tests. Certainly the public is likely to take more interest in the claim that your program has passed the SAT, or even the fourth-grade New York Regents test, than in the claim that it has passed a set of questions that you yourself designed and whose most conspicuous feature is that they are spectacularly easy. However, this is a double-edged sword. The public can easily jump to the conclusion that, since an AI program can pass a test, it has the intelligence of a human that passes the same test. For example, Ohlsson et al. (2013) titled their paper “Verbal IQ of a Four-Year Old Achieved by an AI System.”5 Unfortunately, this title was widely misinterpreted as a claim about verbal intelligence or even general intelligence. Thus, an article in ComputerWorld (Gaudin 2013) had the headline “Top Artificial Intelligence System Is As Smart As a 4-Year-Old;” the Independent published an article “AI System Found To Be as Clever as a Young Child after Taking IQ Test;” and articles with similar titles were published in many other venues. These headlines are of course absurd; a four-year old can make up stories, chat, occasionally follow directions, invent words, learn language at an incredible pace; ConceptNet (the AI system in question) can do none of these. Unpublished Finally, some standardized tests, including the SATs, are not published and are available to researchers only under stringent nondisclosure agreements. It seems to me that AI researchers should under no circumstances use such a test with such an agreement. The loss from the inability to discuss the program’s behavior on specific examples far outweighs the gain from using a test with the imprimatur of the official test designer. This applies equally to Haroun and Hestenes’ (1985) well-known basic physics test; in any case, it would seem from the published information that that test focuses on testing understanding of force and energy rather than testing the relation of formal physics to basic world knowledge. The same applies to the restrictions placed by kaggle.com on the use of their data sets. Standardized tests carry an immense societal burden and must meet a wide variety of very stringent constraints. They are taken by millions of students annually under very plain testing circumstances (no use of calculators, let alone Internet). They bear a disproportionate share in determining the future of those students. They must be fair across a wide range of students. They must conform to existing curricula. They must maintain a constant level of difficulty, both across the variants offered in any one year, and from one year to the next. They are subject to intense scrutiny by large numbers of critics, many of them unfriendly. These constraints impose serious limitations on what can be asked and how exams can be structured. In developing benchmarks for AI physical reasoning, we are subject to none of these constraints. Why tie our own hands, by confining ourselves to standardized tests? Why not take advantage of our freedom? SPRING 2016 21 Articles Conclusion I have not worked out all the practical issues that would be involved in actually offering one of the SQUABU tests as an AI challenge, but I feel confident that it can be done, if there is any interest in it. The kind of knowledge tested in SQUABU is, of course, only a small part of the knowledge of science that a K– 12 student possesses; however, it is one of the fundamental bases underlying all scientific knowledge. An AI system for general scientific knowledge that cannot pass the SQUABU challenge, no matter how vast its knowledge base and how powerful its reasoning engine, is built on sand. Acknowledgements Thanks to Peter Clark, Gary Marcus, and Andrew Sundstrom for valuable feedback. Notes 1. See Nasar (1998), p. 80. 2. There may be some unresolved issues in the theory of continuously branching time. 3. amr.isi.edu. 4. Do not attempt to carry out this experiment based on the description here. Potassium chlorate is explosive, and safety precautions, not described here, must be taken. 5. They have since changed the title to Measuring an Artificial Intelligence System’s Performance on a Verbal IQ Test for Young Children. References Barker, K.; Chaudhri, V. K.; Chaw, S. Y.; Clark, P.; Fan, J.; Israel, D.; Mishra, S.; Porter, B.; Romero, P.; Tecuci, D.; and Yeh, P. 2004. A Question-Answering System for AP Chemistry: Assessing KR&R Technologies. In Principles of Knowledge Representation and Reasoning: Proceedings of the Ninth International Conference. Menlo Park, CA: AAAI Press. puter Is an Honor student — But How Intelligent Is It? Standardized Tests as a Measure of AI. AI Magazine 37(1). Clark, P.; Harrison, P.; and Balasubramanian, N. 2013. A Study of the Knowledge Base Requirements for Passing an Elementary Science Test. In AKBC’13: Proceedings of the 2013 Workshop on Automated Knowledge Base Construction. New York: Association for Computing Machinery. Davis, E. 1990. Representations of Commonsense Reasoning. San Mateo, CA: Morgan Kaufmann. Davis, E., and Marcus, G. 2016. The Scope and Limits of Simulation in Automated Reasoning. Artificial Intelligence 233(April): 60– 72. dx.doi.org/10.1016/j.artint.2015.12.003 Gaudin, S. 2013. Top Artificial Intelligent System Is as Smart as a 4-Year Old, Computerworld July 15, 2013. Haroun, I., and Hestenes, D. 1985. The Initial Knowledge State of College Physics Students. American Journal of Physics 53(11): 1043–1055. dx.doi.org/10.1119/1.14030 Havasi, C.; Speer, R.; and Alonso, J. 2007. Conceptnet 3: A Flexible Multilingual Semantic Network for Common Sense Knowledge. Paper presented at the Recent Advances in Natural Language Processing Conference, Borovets, Bulgaria, September 27–29. Kaplan. 2013. Kaplan SAT Subject Test: Physics. 2013–2014. New York: Kaplan Publishing. Lenat, D.; Prakash, M.; and Shepherd, M. 1986. CYC: Using Common Sense Knowledge to Overcome Brittleness and Knowledge Acquisition Bottlenecks. AI Magazine 6(4): 65–85. Levesque, H., Davis, E.; and Morgenstern, L. 2012. The Winograd Schema Challenge. In Principles of Knowledge Representation and Reasoning: Proceedings of the Thirteenth International Conference. Palo Alto, CA: AAAI Press. Mitchell, T.; Cohen, W.; Hruschka, E.; Talukdar, P.; Betteridge, J.; Carlson, A.; Dalvi, B.; Gardner, M.; Kisiel, B.; Krishnamurthy, J.; Lao, N.; Mazaitis, K.; Mohamed, T.; Nakashole, N.; Platanios, E.; Ritter, A; Samadi, M.; Settles, B.; Wang, R.; Wijaya, D.; Gupta, A.; Chen, X.; Saparov, A.; Greaves, M.; Welling, J.. 2015. Never-Ending Learning. In Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence. Palo Alto: AAAI Press. Brachman, R.; Gunning, D.; Bringsjord, S.; Genesereth, M.; Hirschman, L.; and Ferro, L. 2005. Selected Grand Challenges in Cognitive Science. MITRE Technical Report 051218. Bedford MA: The MITRE Corporation. Nasar, S. 1998. A Beautiful Mind: The Life of Mathematical Genius and Nobel Laureate John Nash. New York: Simon and Schuster. Brown, T. L.; LeMay, H. E.; Bursten, B; and Burdge, J. R. 2003. Chemistry: The Central Science, ninth edition. Upper Saddle River, NJ: Prentice Hall. New York State Education Department. 2014. The Grade 4 Elementary-Level Science Test. Albany, NY: University of the State of New York. Clark, P., and Etzioni, O. 2016. My Com- Ohlsson, S.; Sloan, R. H.; Turán, G.; and 22 AI MAGAZINE Urasky, A. 2013. Verbal IQ of a Four-Year Old Achieved by an AI System. Paper presented at the Eleventh International Symposium on Logical Foundations of Commonsense Reasoning, Ayia Napa, Cyprus, 27–29 May. Reiter, R. 2001. Knowledge in Action: Logical Foundations for Specifying and Implementing Dynamical Systems. Cambridge, Mass.: The MIT Press. Seo, M.; Hajishiri, H.; Farhadi, A.; Etzioni, O.; and Malcolm, C. 2015. Solving Geometry Problems: Combining Text and Diagram Interpretation. In EMNLP 2015: Proceedings of the Empirical Methods in Natural Language Processing. Stroudsburg, PA: Association for Computational Linguistics. dx.doi.org/10. 18653/v1/D15-1171 Strickland, E. 2013. Can an AI Get into the University of Tokyo? IEEE Spectrum 21 August. dx.doi.org/10.1109/mspec.2013. 6587172 von Ahn, L.; Blum, M.; Hopper, N.; and Langford, J. 2003. CAPTCHA: Using Hard AI Problems for Security. In Proceedings of the Annual International Conference on the Theory and Applications of Cryptographic Techniques (EUROCRYPT-03). Carson City, NV: International Association for Cryptologic Research. dx.doi.org/10.1007/3-540-392009_18 Welty, C. undated. Why Toronto? Unpublished MS. Weston, J.; Bordes, A.; Chopra, S.; Mikolov, T.; and Rush, A. 2015. Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks. arXiv preprint arXiv:1502.05698v6. Ithaca, NY: Cornell University Library. Wu, W.; Li, H.; Wang, H.; and Zhu, K.Q. 2012. Probase: A Probabilistic Taxonomy for Text Understanding. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, 481-492. New York: Association for Computing Machinery. dx.doi.org/10.1145/2213836. 2213891 Ernest Davis is a professor of computer science at New York University. His research area is automated commonsense reasoning, particularly commonsense spatial and physical reasoning. He is the author of Representing and Acquiring Geographic Knowledge (1986), Representations of Commonsense Knowledge (1990), and Linear Algebra and Probability for Computer Science Applications (2012); and coeditor of Mathematics, Substance and Surmise: Views on the Meaning and Ontology of Mathematics (2015). Articles Toward a Comprehension Challenge, Using Crowdsourcing as a Tool Praveen Paritosh, Gary Marcus I Human readers comprehend vastly more, and in vastly different ways, than any existing comprehension test would suggest. An ideal comprehension test for a story should cover the full range of questions and answers that humans would expect other humans to reasonably learn or infer from a given story. As a step toward these goals we propose a novel test, the crowdsourced comprehension challenge (C3), which is constructed by repeated runs of a three-person game, the Iterative Crowdsourced Comprehension Game (ICCG). ICCG uses structured crowdsourcing to comprehensively generate relevant questions and supported answers for arbitrary stories, whether fiction or nonfiction, presented across a variety of media such as videos, podcasts, and still images. A rtificial Intelligence (AI) has made enormous advances, yet in many ways remains superficial. While the AI scientific community had hoped that by 2015 machines would be able to read and comprehend language, current models are typically superficial, capable of understanding sentences in limited domains (such as extracting movie times and restaurant locations from text) but without the sort of widecoverage comprehension that we expect of any teenager. Comprehension itself extends beyond the written word; most adults and children can comprehend a variety of narratives, both fiction and nonfiction, presented in a wide variety of formats, such as movies, television and radio programs, written stories, YouTube videos, still images, and cartoons. They can readily answer questions about characters, setting, motivation, and so on. No current test directly investigates such a variety of questions or media. The closest thing that one might find are tests like the comprehension questions in a verbal SAT, which only assess reading (video and other formats are excluded) and tend to emphasize tricky questions designed to discriminate between strong and weak human readers. Basic questions that would be obvious to most humans — but perhaps not to a machine — are excluded. Yet is is hard to imagine an adequate general AI that could not comprehend with at least the same sophistication and breadth as an average human being, and easy to imagine that progress in building machines with deeper comprehension could radically alter the state of the art. Machines that could comprehend with the sophistication and breadth of humans could, for instance, learn vastly more than current systems from unstructured texts such as Wikipedia and the daily news. How might one begin to test broad-coverage comprehension in a machine? Copyright © 2016, Association for the Advancement of Artificial Intelligence. All rights reserved. ISSN 0738-4602 SPRING 2016 23 Articles In principle, the classic Turing test might be one way to assess the capacity of a computer to comprehend a complex discourse, such as a narrative. In practice, the Turing test has proved to be highly gameable, especially as implemented in events such as the Loebner competitions, in which the tests are too short (a few minutes) to allow any depth (Shieber 1994; Saygin, Cicekli, and Akman 2003). Furthermore, empirical experimentation has revealed that the best way to “win” the Turing test is to evade most questions, answering with jokes and diversionary tactics. This winds up teaching us little about the capacity of machines to comprehend narratives, fictional or otherwise. As part of the Turing Championships, we (building on Marcus [2014]) would like to see a richer test of comprehension, one that is less easily gamed, and one that probes more deeply into the capacity of machines to understand materials that might be read or otherwise perceived. We envision that such a challenge might be structured into separate tracks for audio, video, still images, images with captions, and so forth, including both fiction and nonfiction. But how might one generate the large number of questions that provide the requisite breadth and depth? Li et al. (forthcoming) suggest one strategy, focused on generating “journalist-style” questions (who, what, when, where, why) for still images.1 Poggio and Meyers (2016) and Zitnick et al. (2016) suggest approaches aimed at testing question answering from still images. Here, we suggest a more general procedure, suitable for a variety of media and a broad range of questions, using crowdsourcing as the primary engine. In the remainder of this article we briefly examine what comprehension consists of, discuss some existing approaches to assessing it, present desiderata for a comprehension challenge, and then turn toward crowdsourcing and how it can help define a meaningful comprehension challenge. What Is Human Comprehension? Human comprehension entails identifying the meaning of a text as a connected whole, beyond a series of individual words and sentences (Kintsch and van Dijk 1978, Anderson and Pearson 1984, Rapp et al. 2007). Comprehension reflects the degree to which appropriate, meaningful connections are established between elements of text and the reader’s prior knowledge. Referential and causal/logical relations are particularly important in establishing coherence, by enabling readers to keep track of objects, people, events, and the relational information connecting facts and events mentioned in the text. These relations that readers must infer are not necessarily obvious. They can be numerous and complex; extend over long spans of the text; involve extensive back- 24 AI MAGAZINE ground commonsense, social, cultural, and world knowledge; and require coordination of multiple pieces of information. Human comprehension involves a number of different cognitive processes. Davis (1944), for instance, describes a still-relevant taxonomy of different skills tested in reading comprehension tests, and shows empirical evidence regarding performance variance across these nine different skills: knowledge of word meanings; ability to select the appropriate meaning for a word or phrase in light of its particular contextual setting; ability to follow the organization of a passage and to identify antecedents and references in it; selecting the main thought of a passage; answering questions that are specifically answered in a passage; answering questions that are answered in a passage but not in the words in which the question is asked; drawing inferences from a passage about its content; recognition of literary devices used in a passage and determination of its tone and mood; inferring a writer’s purpose, intent, and point of view. Subsequent research into comprehension examining long-term performance data of humans shows that comprehension is not a single gradable dimension, but comprises many distinct skills (for example, Keenan, Betjemann, and Olson [2008]). Most extant work examines small components of comprehension, rather than the capacity of machines to comprehend a complete discourse in its entirety. Existing Approaches for Measuring Machine Comprehension How can we test progress in this area? In this section, we summarize current approaches to measuring machine comprehension. AI has a wide variety of evaluations in the form of shared evaluations and competitions, many of which bear on the question of machine comprehension. For example, TREC-8 (Voorhees 1999) introduced the question-answering track in which the participants were given a collection of documents and asked to answer factoid questions such as “How many calories are in a Big Mac?” or “Where is the Taj Mahal?” This led to a body of research in applying diverse techniques in information retrieval and structured databases to question answering and comprehension tasks (Hirschman and Gaizauskas 2001). The Recognizing Textual Entailment (RTE) Challenge (Dagan, Glickmann, and Magnini 2006) is another competition with relevance to comprehension. Given two text fragments, the task requires recognizing whether the meaning of one text is entailed by (can be inferred from) the other text. From 2004 to 2013, eight RTE Challenges were organized with the aim of providing researchers with concrete data sets on which to evaluate and compare their approaches. Neither the TREC nor the RTE competitions, however, addresses the Articles breadth and depth of human comprehension we seek. One approach to testing broader-coverage machine comprehension seeks to leverage the existing diverse battery of human comprehension tests, such as SATs, domain-specific science tests, and so on (for example, Barker et al. [2004] and Clark and Etzioni [2016]). The validity of standardized tests lies in their ability to identify humans who are more likely to succeed at a certain task, such as in the practice of medicine or law. As such, human tests are intended to effectively discriminate among intelligent human applicants, but as E. Davis (2016) notes, they do not necessarily contain classes of questions relevant to discriminating between human and artificial intelligence: questions that are easy for humans but difficult for machines, that are subjective, and so on. Recent work on commonsense reasoning points to one possible alternative approach. The Winograd Schema Challenge (Levesque, Davis, and Morgenstern 2012; Morgenstern et al. 2016), for instance, can be seen as comprehension in a microcosm: a single story in a single sentence or very short passage with a single binary question that can in principle be reliably answered only by a system that has some commonsense knowledge. In each question there is a special word, such as that underlined in the following example, that can be replaced by an alternative word in a way that fundamentally changes the sentence’s meaning. The trophy would not fit into the brown suitcase because it was too big/small. What was too big/small? Answer 0: the trophy Answer 1: the suitcase In each example, the reader’s challenge is to disambiguate the passage. By design, clever tricks involving word order or other features of words or groups of words will not work. In the example above, contexts where “big” can appear are statistically quite similar to those where “small” can appear, and yet the answer must change. The claim is that doing better than guessing requires readers to figure out what is going on; for example, a failure to fit is caused by one of the objects being too big and the other being too small, and readers must determine which is which. SQUABU, for “science questions appraising basic understanding” (Davis 2016), generalizes this approach into a test-construction methodology and presents a series of test material for machines at fourth-grade and high school levels. Unlike the human counterparts of such tests, which focus on academic material, these tests focus on commonsense knowledge such as the understanding of time, causality, impossible or pointless scenarios, the human body, combining facts, making simple inductive arguments of indeterminate length, relating for- mal science to the real world, and so forth. Here are two example questions from SQUABU for fourthgrade level: Sally’s favorite cow died yesterday. The cow will probably be alive again (A) tomorrow; (B) within a week; (C) within a year; (D) within a few years; (E) The cow will never be alive again. Is it possible to fold a watermelon? Winograd schemas and SQUABU demonstrate some areas where standardized tests lack coverage for testing machines. Both tests, however, are entirely generated by experts and are difficult to scale to large numbers of questions and domains; neither is directed at broad-coverage comprehension. Desiderata for a Comprehension Challenge In a full-coverage test of comprehension, one might want to be able to ask a much broader range of questions. Suppose, for example, that a candidate software program is confronted with a just-published spy thriller, for which there are no web-searchable CliffsNotes yet written. An adequate system (Marcus 2014, Schank 2013) should be able to answer questions such as the following: Who did what to whom? Who was the protagonist? Was the CIA director good or evil? Which character leaked the secrets? What were those secrets? What did the enemy plan to do with those secrets? Where did the protagonist live? Why did the protagonist fly to Moscow? How does the story make the reader/writer feel? And so forth. A good comprehension challenge should evaluate the full breadth and depth of human comprehension, not just knowledge of common sense. To our knowledge, no previous test or challenge has tried to do this in a general way. Another concern with existing test-construction methodology for putative comprehension challenges is the lack of transparency in the test creation and curation process. Namely, why does a test favor some questions and certain formulations over others? There is a central, often-unspoken role of the test curator in choosing the questions to ask, which is a key aspect of the comprehension task. Given a news article, story, movie, podcast, novel, radio program, or photo — referred to as a document from this point forward — an adequate test should draw from a full breadth of all document-relevant questions with document-supported answers that humans can infer. We suggest that the coverage goal of the comprehension challenge can be phrased as an empirical statement: A comprehension test should cover the full range of questions and answers that humans would expect other humans to reasonably learn or infer from a given document. How can we move toward this goal? SPRING 2016 25 Articles The C3 Test We suggest that the answer begins with crowdsourcing. Previous work has shown that crowdsourcing can be instrumental in creating large-scale shared data sets for evaluation and benchmarking. The major benefits of crowdsourcing are enabling scaling to broader coverage (for example, of domains, languages), building significantly larger data sets, and capturing broader sets of answers (Arroyo and Welty 2014), as well as gathering empirical data regarding reliability and validity of the test (Paritosh 2012). Imagenet (Deng et al. 2009), for example, is a large-scale crowdsourced image database consisting of 14 million images with over a million human annotations, organized by the Wordnet lexicon; it has been a catalyst for recent computer vision research with deep convolutional networks (Krizhevsky, Sutskever, and Hinton 2012). Freebase (Bollacker et al. 2008) is a large database of humancurated structured knowledge that has similarly sparked research fact extraction (Mintz et al. 2009; Riedel, Yao, and McCallum 2010). Christoforaki and Ipeirotis (2014) present a methodology for crowdsourcing the construction of tests using the questions and answers on the community question-answering site Stack Overflow.2 This work shows that open-ended question and answer content can be turned into multiple-choice questions using crowdsourcing. Using item response theory on crowdsourced performance on the test items, they were able to identify the relative difficulty of each question. MCTEST (Richardson, Burges, and Renshaw 2013) is a crowdsourced comprehension test corpus that consists of approximately 600 fictional stories written by Amazon Mechanical Turk crowd workers. Additionally, the crowd workers generated multiplechoice questions and their correct answers, as well as plausible but incorrect answers. The workers were given guidelines regarding the story, questions, and answers, such as that they should ask questions that make use of information in multiple sentences. The final test corpus was produced by manual curation of the resulting stories, questions, and answers. This approach is promising, as it shows that it is possible to generate comprehension tests using crowdsourcing. However, much like the standardized and commonsense tests, the test-curation process here is not entirely transparent nor generalizable to other types of documents and questions. The question at hand is whether we can design reliable processes for crowdsourcing the construction of comprehension tests that provide us with measurable signals and guarantees of quality, relevance, and coverage, not just whether we can design a test. As an alternative, and as a starting point for further discussion, we propose here a crowdsourced comprehension challenge (C3). At the root is a document-focused imitation game, which we call the iter- 26 AI MAGAZINE ative crowdsourcing comprehension game (ICCG), the goal of which is to generate a systematic and comprehensive set of questions and validated answers relevant to any given document (video, text story, podcast, or other). Participants are incentivized to explore questions and answers exhaustively, until the game terminates with an extensive set of questions and answers. The C3 is then produced by aggregating and curating questions and answers generated from multiple iterations of the ICCG. The structure, which necessarily depends on cooperative yet independent judgments from multiple humans, is inspired partly by Luis von Ahn’s work. For example, in the two-player ESP game (von Ahn and Dabbish 2004) for image labeling, the goal is to guess what label your partner would give to the image. Once both players have typed the exact same string, they win the round, and a new image appears. This game and others in the games with a purpose series (von Ahn 2006) introduced the methodology of input agreement (Law and von Ahn 2009), where the goal of the participants is to try to agree on the input, encouraging them to model the other participant. The ICCG extends this to a three-person imitation game, itself partially in the spirit of Turing’s original test (Turing 1950). The Iterative Crowdsourcing Comprehension Game The iterative crowdsourcing comprehension game (ICCG) is a three-person game. Participants are randomly assigned to fill one of three roles in each run of the game: reader (R), guesser (G), or judge (J). Players are sampled from a norming population of interest (for example, one might make tests at the secondgrade level or college level). They should not know each other and should be identified only by anonymized screen names that are randomly assigned afresh in each round. They cannot communicate with each other besides the allowed game interactions. Only the judge and the reader have access to the document (as defined earlier, text, image, video, podcast, and others); the guesser is never allowed to see it. The only thing readers and judges have in common is this document that they can both comprehend. The purpose of the game is to generate a comprehensive set of document-relevant questions (with corresponding document-supported answers) as an outcome. The judge’s goal is to identify who is the genuine document holder. The reader’s goal is to prove possession of the document, by asking document-relevant questions and by providing document-supported answers. The guesser’s goal is to establish possession of the document, by learning from prior questions and answers. A game consists of a sequence of rounds, as depicted in figure 1. A shared whiteboard is used for keeping track of questions and answers, which are pub- Articles lished at the end of each round. The whiteboard is visible to all participants and allows the guesser to learn about the content of the document as the game proceeds. (Part of the fun for the guesser lies in making leaps from the whiteboard in order to make educated guesses about new questions.) Each round begins with randomly assigning either the reader or the guesser to play the questioner for the round. The questioner writes down a question for this round. The reader’s goal, while playing questioner, is to ask novel questions that have reliable document-supported answers. As the game proceeds, the reader is incentivized to exhaust the space of document-supported questions to be distinguished from the guesser. The reader, as questioner, does not earn points for asking questions that the guesser could answer correctly using nondocument knowledge or conclude from prior questions and answers on the whiteboard. When the questioner is the guesser, their goal is to ask revealing questions to learn as much about the story as quickly as possible. At this point we have a question, from either the reader or guesser. The question is shared with the other participant,3 who independently answers. The judge is presented with both the question and the two answers with authors anonymized and attempts to identify which one is the reader. This anonymization is done afresh for the next round. The objective of both the reader and guesser is to be chosen as the reader by the judge, so both are incentivized to ask questions and generate answers that will convince the judge that they are in possession of the document. The round is scored using this simple rubric: The judge earns a point for identifying the reader correctly, and the reader or guesser earns a point for being identified as the document holder by the judge. At the end of each round, the question and the reader’s and guesser’s answers are published on the whiteboard. The reader’s job is exhaustively to ask document-relevant questions, without generating questions that the guesser could extract from the accumulated whiteboard notes; the guesser’s job is to glean as much information as possible to improve at guessing. Initially, it is very easy for the judge to identify the reader. However, roughly every other round the guesser (when chosen to be the questioner) gets to ask a question and learn the reader’s and judge’s answers to that question. The main strategic goal of the guesser is to erode their disadvantage, the lack of access to the document, as quickly as possible. For example, the guesser might begin by asking basic information-gathering questions: who, what, where, when, why, and how questions.4 The increased knowledge of the document revealed through the questions and answers should improve guessing performance over rounds. START t3BOEPNMZBTTJHOQBSUJDJQBOUTUPSPMFTPG readerguesserBOEjudge t3FBEFSBOEjudgeSFWJFXEPDVNFOU Questioner generates question tQuestionerSBOEPNMZDIPTFOGSPNreaderBOEguesser t2VFTUJPOFSHFOFSBUFTBRVFTUJPO Reader and guesser answer t3eaderBOEguesser BOTXFSRVFTUJPOJOEFQFOEFOUMZ Judge attempts to identify the reader tJudgeJTQSFTFOUFEUIFRVFTUJPOBOECPUIBOTXFST XJUIBOPOZNPVTTDSFFOOBNFTSFGSFTIFEQFSSPVOE tJudgeBUUFNQUTUPJEFOUJGZUIFreader Round is scored tJudge: readerDPSSFDUMZ t3FBEFSJGJEFOUJýFECZjudge tGuesser:JGJEFOUJýFECZjudge Whiteboard is updated t3PVOE’TRVFTUJPO reader BOEguesser BOTXFST BOEjudge’sTDPSFQVCMJTIFEUPXIJUFCPBSE Judge better than chance? Yes No STOP 8IJUFCPBSEDPOUBJOTBDPNQSFIFOTJWFTFUPG RVFTUJPOTBOTXFSTGPSUIFHJWFOEPDVNFOU Figure 1. The Iterative Crowdsourcing Comprehension Game The game concludes when all attempts at adding further questions fail to discriminate between the guesser and reader. This implies that the corpus of questions and answers collected on the whiteboard is a comprehensive set, that is, sufficient to provide an understanding comparable to having read the document. There can be many different sets of questions, SPRING 2016 27 Articles Round Questioner Question Reader Answer Guesser Judge Answer Identification Judge Answer 1 Guesser Is it a happy story? No Yes Reader +1 No 2 Reader What’s for sale? Shoes Jewelry Reader +1 Shoes 3 Reader Who were shoes for? A baby Protagonist Reader +1 Nobody 4 Guesser How many characters are in the story? One One Guesser +1 One Reader What’s happening to the shoes? Being Sold Being Bought Reader +0 Being Sold Guesser When were the shoes worn? Never Once Reader +1 Never 5 ME6NT DOCU le: For sa oes, h baby s rn wo never Figure 2. An Example Whiteboard. Created for the document “For sale: baby shoes, never worn.” due to sequence effects and variance in participants. We repeat the ICCG manifold to collect the raw material for the construction of the crowdsourced comprehension challenge. Figure 2 depicts an example whiteboard after several rounds of questioning for a simple document, a six-word novel attributed to Ernest Hemingway. Constructing the Crowdsourced Comprehension Challenge Given a document, each run of the game above produces a set of document-relevant questions and document-validated answers, ultimately producing a comprehensive (or at least extensive) set of questions. By aggregating across multiple iterations of the game with the same document, we obtain a large corpus of document-relevant questions and validated answers. This is the raw data for constructing the comprehension test. Finalizing the test requires further aggregation, de- 28 AI MAGAZINE duplication, and filtering using crowdsourced methods, for example, the Find/Fix/Verify methodology (Bernstein et al. 2010). This approach suggests that comprehension must be considered relative to a population. This turns our original goal for the challenge — full range of questions and answers that humans would expect other humans to reasonably learn or infer from a given document — into an empirical and crowdsourceable goal. Additionally, this allows us to design testing instruments tailored across skill levels, ages, or domains, as well as adaptable to a wide swath of cultural contexts, by sampling participants from different populations. Figure 3 depicts the process of constructing the final test, which features the crowdsourced collection of the question–answer pairs. Using the C3, a broad-coverage comprehension challenge can be constructed using crowdsourcing. By vary- ing the population, we can construct comprehension tests that reveal the comprehension of second graders or doctors. In addition, by varying the format of questions and answers, open-ended, multiple choice, Boolean, and others, or restricting allowable questions to be of a certain type, we can construct different challenges. Conclusions and Future Work Improved machine comprehension would be a vital step toward more general artificial intelligence and could potentially have enormous benefits for humanity, if machines could integrate medical, scientific, and technological information in ways that were humanlike. Here we propose C3, the crowdsourced comprehension challenge, and one candidate technique for generating such tests, the ICCG, which yield a comprehensive, relevant, and human-validated corpus of questions and answers for arbitrary content, fiction or nonfiction, presented in a variety of forms. The game also produces human-level performance data for constructing tests, which with suitable participants (such as second graders or adult native speakers of a certain language) could be used to yield a range of increasingly challenging benchmarks. It could also be tailored to specific areas of knowledge and inference (for example, the domain of questions could be restricted to commonsense understanding, to science or medicine, or to cultural and social understanding). Unlike specific tests of expertise, this is a general test-generation procedure whose scope is all questions that can be reliably answered by humans (either in general, or drawn from a population of interest) holding the document. Of course, more empirical and theoretical work is needed to implement, validate, and refine the ideas proposed here. Variations of the ICCG might be useful for different data-collection processes (for example, Paritosh [2015] explores a version where the individual reader and guesser are replaced by samples of readers and guessers). An important area of future work is the Articles design of incentives to make the game more engaging and useful (for example, Prelec [2004]). We believe that crowdsourced processes for the design of human-level comprehension tests will be an invaluable addition to the arsenal of assessments of machine intelligence and will spur research in deep understanding of language. DOCUMENT such as a story, news article, image, video Acknowledgments The authors would like to thank Ernie Davis, Stuart Shieber, Peter Norvig, Ken Forbus, Doug Lenat, Nancy Chang, Eric Altendorf, David Huynh, David Martin, Nick Hay, Matt Klenk, Jutta Degener, Kurt Bollacker, and participants and reviewers of the Beyond the Turing Test Workshop at AAAI 2015 for insightful comments and suggestions on the ideas presented here. R G J R R G J G J Notes 1. This is part of the VisualGenome corpus, visualgenome. org. 2. stackoverflow.com. 3. One might also secure an answer from the judge, as a validity check and to gain a broader range of acceptable answers (for example, shoes or baby shoes might both work for a question about the Hemingway story shown in figure 2). 4. The popular Twenty Questions game is a much simpler version, where the guesser tries to identify an object within twenty yes/no questions. Questions such as “Is it bigger than a breadbox?” or “Does it involve technology for communications, entertainment, or work?” allow the questioner to cover a broad range of areas using a single question. References Anderson, R. C., and Pearson, P. D. 1984. A Schema-Theoretic View of Basic Processes in Reading Comprehension. Handbook of Reading Research Volume 1, 255–291. London Routledge. Barker, K.; Chaudhri, V. K.; Chaw, S. Y.; Clark, P.; Fan, J.; Israel, D.; Mishra, S.; Porter, B.; Romero, P.; Tecuci, D,; and Yeh, P. 2004. A Question-Answering System for AP Chemistry: Assessing KR&R Technologies, 488–497. In Principles of Knowledge Representation and Reasoning: Proceedings of the Ninth International Conference. Menlo Park, Calif: AAAI Press. Bernstein, M. S.; Little, G.; Miller, R. C.; Hartmann, B.; Ackerman, M. S.; Karger, D. R.; Crowell, D.; and Panovich, K. 2010. Soylent: A Word Processor with a Crowd Inside. In Proceedings of the 23nd Annual ACM Symposium on User Interface Software and Technology, 313–322. New York: Association for Computing Machinery. dx.doi.org/10.1145/1866029. 1866078 Bollacker, K.; Evans, C.; Paritosh, P.; Sturge, T.; and Taylor, J. 2008. Freebase: A Collaboratively Created Graph Database for Structuring Human Knowledge, 1247–1250. In Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2008. New York: Association for Computing Machinery. dx.doi.org/10.1145/1376616. 1376746 Christoforaki, M., and Ipeirotis, P. 2014. STEP: A Scalable Testing and Evaluation Platform. In Proceedings of the Second AAAI Conference on Human Computation and Crowdsourcing. Palo Alto, CA: AAAI Press. Whiteboard 1 Whiteboard 3 {Question, Answer} pairs {Question, Answer} pairs Whiteboard 2 {Question, Answer} pairs Aggregate and Curate {Question, Answer} pairs Crowdsourced Comprehension Challenge, C3 Question Answer Figure 3. Crowdsourced Comprehension Challenge Generation. Clark, P., and Etzioni, O. 2016. My Computer Is an Honor Student — But How Intelligent Is It? Standardized Tests as a Measure of AI. AI Magazine 37(1). Dagan, I.; Glickman, O.; and Magnini, B. 2006. The PASCAL Recognising Textual Entailment Challenge. In Machine SPRING 2016 29 Articles Learning Challenges, Lecture Notes in Computer Science Volume 3944, 177–190. Berlin: Springer. dx.doi.org/10.1007/ 11736790_9 Davis, E. 2016. How to Write Science Questions That Are Easy for People and Hard for Computers. AI Magazine 37(1). Davis, F. B. 1944. Fundamental Factors of Comprehension in Reading. Psychometrika 9(3): 185–197. dx.doi.org/ 10.1007/ BF02288722 Deng, J.; Dong, W.; Socher, R.; Li, L. J.; Li, K.; and Fei-Fei, L. 2009. Imagenet: A LargeScale Hierarchical Image Database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 248–255. Piscataway, NJ: Institute of Electrical and Electronics Engineers. dx.doi.org/ 10.1109/CVPR.2009. 5206848 Hirschman, L., and Gaizauskas, R. 2001. Natural Language Question Answering: The View from Here. Natural Language Engineering 7(04): 275–300. dx.doi.org/10.1017/ S1351324901002807 Keenan, J. M.; Betjemann, R. S.; and Olson, R. K. 2008. Reading Comprehension Tests Vary in the Skills They Assess: Differential Dependence on Decoding and Oral Comprehension. Scientific Studies of Reading 12(3): 281–300. dx.doi.org/10.1080/ 10888430802132279 Kintsch, W., and van Dijk, T. A. 1978. Toward a Model of Text Comprehension and Production. Psychological Review 85(5): 363. dx.doi.org/10.1037/0033-295X.85.5. 363 Krizhevsky, A.; Sutskever, I.; and Hinton, G. E. 2012. Imagenet Classification with Deep Convolutional Neural Networks. In Advances in Neural Information Processing Systems 25: 26th Annual Conference on Neural Information Processing Systems 2012, 1097– 1105. La Jolla, CA: Neural Information Processing Systems Foundation, Inc. Law, E., and von Ahn, L. 2009. Input-Agreement: A New Mechanism for Collecting Data Using Human Computation Games. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, 1197– 1206. New York: Association for Computing Machinery. dx.doi.org/10.1145/1518701. 1518881 Levesque, H.; Davis, E.; and Morgenstern, L. 2012 The Winograd Schema Challenge. In Principles of Knowledge Representation and Reasoning: Proceedings of the Thirteenth International Conference (KR2012), 552–561. Palo Alto, CA: AAAI Press. Marcus, G. 2014. What Comes After the Turing Test? New Yorker (June 9). Mintz, M.; Bills, S.; Snow, R.; and Jurafsky, D. 2009. Distant Supervision for Relation Extraction Without Labeled Data. In Pro- 30 AI MAGAZINE ceedings of the Joint Conference of the 47th Annual Meeting of the Association for Computational Linguistics, 1003–1011. Stroudsburg, PA: Association for Computational Linguistics. dx.doi.org/10.3115/1690219.1690287 Morgenstern, L.; Davis, E.; Ortiz, C. L. Jr. 2016. Planning, Executing, and Evaluating the Winograd Schema Challenge. AI Magazine 37(1). Paritosh, P. 2012. Human Computation Must Be Reproducible. In CrowdSearch 2012: Proceedings of the First International Workshop on Crowdsourcing Web Search. Ceur Workshop Proceedings Volume 842. Aachen, Germany: RWTH-Aachen University. Paritosh, P. 2015. Comprehensive Comprehension: A Document-Focused, HumanLevel Test of Comprehension. Paper presented at Beyond the Turing Test, AAAI Workshop W06, Austin TX, January 25. Poggio, T., and Meyers, E. 2016. Turing++ Questions: A Test for the Science of (Human) Intelligence. AI Magazine 37(1). Prelec, D. 2004. A Bayesian Truth Serum for Subjective Data. Science 306(5695): 462– 466. dx.doi.org/10.1126/science.1102081 Rapp, D. N.; Broek, P. V. D.; McMaster, K. L.; Kendeou, P.; and Espin, C. A. 2007. HigherOrder Comprehension Processes in Struggling Readers: A Perspective for Research and Intervention. Scientific Studies of Reading 11(4): 289–312. dx.doi.org/10.1080/ 10888430701530417 Richardson, M.; Burges, C. J.; and Renshaw, E. 2013. MCTest: A Challenge Dataset for the Open-Domain Machine Comprehension of Text. In EMNLP 2013: Proceedings of the Empirical Methods in Natural Language Processing. Stroudsburg, PA: Association for Computational Linguistics. Riedel, S.; Yao, L.; and McCallum, A. 2010. Modeling Relations and Their Mentions Without Labeled Text. In Machine Learning and Knowledge Discovery in Databases: Proceedings of the European Conference, ECML PKDD 2010. Lecture Notes in Artificial Intelligence Volume 6322, 148–163. Berlin: Springer. dx.doi.org/10.1007/978-3-64215939-8_10 Saygin, A. P.; Cicekli, I.; and Akman, V. 2003. Turing Test: 50 Years Later. In The Turing Test: The Elusive Standard of Artificial Intelligence, ed. J. H. Moor, 23–78. Berlin: Springer. dx.doi.org/10.1007/978-94-0100105-2_2 Schank, R. P. 2013. Explanation Patterns: Understanding Mechanically and Creatively. London: Psychology Press. Shieber, S. M. 1994. Lessons from a Restricted Turing Test. Communications of the ACM 37(6): 70–78. dx.doi.org/10.1145/175208. 175217 Turing, A. M. 1950. Computing Machinery and Intelligence. Mind 59(236): 433–460. dx.doi.org/10.1093/mind/LIX.236.433 von Ahn, L. 2006. Games with a Purpose. Computer 39(6): 92–94. dx.doi.org/10.1109/ MC.2006.196 von Ahn, L., and Dabbish, L. 2004. Labeling Images with a Computer Game. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, 319–326. New York: Association for Computing Machinery. dx.doi.org/10.1145/985692.985733 Voorhees, E. M. 1999. The TREC-8 Question Answering Track Report. In Proceedings of The Eighth Text Retrieval Conference, TREC 1999, 77–82. Gaithersburg, MD: National Institue of Standards and Technology. Poggio, T., and Meyers, E. 2016. Turing++ Questions: A Test for the Science of (Human) Intelligence. AI Magazine 37(1). Zitnick, C. L.; Agrawal, A.; Antol, S.; Mitchell, M.; Batra, D.; Parikh, D. 2016. Measuring Machine Intelligence Through Visual Question Answering. AI Magazine 37(1). Praveen Paritosh is a a senior research scientist at Google leading research in the areas of human and machine intelligence. He designed the large-scale humanmachine curation systems for Freebase and the Google Knowledge Graph. He was the co-organizer and chair for the SIGIR WebQA 2015 workshop, the Crowdsourcing at Scale 2013, the shared task challenge at HCOMP 2013, and Connecting Online Learning and Work at HCOMP 2014, CSCW 2015, and CHI 2016 toward the goal of galvanizing research at the intersection of crowdsourcing, natural language understanding, knowledge representation, and rigorous evaluations for artificial intelligence. Gary Marcus is a professor of psychology and neural science at New York University and chief executive officer and founder of Geometric Intelligence, Inc. He is the author of four books, including Kluge: The Haphazard Evolution of the Human Mind and Guitar Zero, and numerous academic articles in journals such as Science and Nature. He writes frequently for The New York Times and The New Yorker, and is coeditor of the recent book, The Future of the Brain: Essays By The World’s Leading Neuroscientists. Articles The Social-Emotional Turing Challenge William Jarrold, Peter Z. Yeh I Social-emotional intelligence is an essential part of being a competent human and is thus required for humanlevel AI. When considering alternatives to the Turing test it is therefore a capacity that is important to test. We characterize this capacity as affective theory of mind and describe some unique challenges associated with its interpretive or generative nature. Mindful of these challenges we describe a five-step method along with preliminary investigations into its application. We also describe certain characteristics of the approach such as its incremental nature, and countermeasures that make it difficult to game or cheat. T he ability to make reasonably good predictions about the emotions of others is an essential part of being a socially functioning human. Without it we would not know what actions will most likely make others around us happy versus mad or sad. Our abilities to please friends, placate enemies, inspire our children, and secure cooperation from our colleagues would suffer. For these reasons a truly intelligent human-level AI will need the ability to reason about other agents’ emotions in addition to intellectual capabilities embodied in other tasks such as the Winograd schema challenge, textbook reading and question answering (Gunning et al. 2010, Clark 2015), image understanding, or task planning. Thinking at the Copyright © 2016, Association for the Advancement of Artificial Intelligence. All rights reserved. ISSN 0738-4602 SPRING 2016 31 Articles human level also requires the ability to have reasonable hunches about other agents’ emotions. Social-Emotional Intelligence as Affective Theory of Mind The ability to predict and understand another agent’s emotional reactions is subsumed by a cognitive capacity that goes by various names including folk psychology, naïve psychology, mindreading, empathy, and theory of mind. We prefer the latter term, considering it is more precise and is more frequently used by psychologists nowadays. Theory of mind encompasses the capacity to attribute and explain the mental states of others such as beliefs, desires, intentions, and emotions. In this article, we focus on affective theory of mind because it restricts itself to emotions. We further restrict ourselves to consensual affective theory of mind (AToM) to rule out idiosyncratic beliefs of particular individuals. Is There a Logic to Emotion? Each of us humans has our own oftentimes unique affective reaction to a given situation. Although we live in the same world, our emotional interpretations of it are multitudinous. Does this mean that emotion is an “anything goes” free-for-all? In spite of the extreme variability in our affective evaluations, there nonetheless seems to be a rationality, a logic, of what constitutes a viable, believable, or sensible emotional response to a given situation. When we hear of someone’s emotional reaction to a situation sometimes, we think to ourselves, “I would have responded the same way.” For other reactions, we might say, “That would not be my reaction, but I can certainly understand why he or she would feel that way.” At still other times, another’s actual emotional reaction may vary far afield of our prediction and we say, “I cannot make any sense out of his or her reaction.” For these reasons there does appear to be some sort of “logic” to emotion. Yet, how do we resolve the tension between the extreme possible richness and variability in emotional response and the sense that only certain reactions are sensible, legitimate, or understandable? In the next two sections, we show how the concepts of falsifiability — the possibility of proving an axiom or prediction incorrect (for example, all swans are white is disproven by finding a black swan [Popper 2005]) — and generativity — the capacity of a system to be highly productive and original — play an important role in the resolution of this tension. Later, in the Proposed Framework section, we shall see how these two concepts influence the methods we propose for assessing machine social-emotional intelligence. Falsifiability and AToM In our approach to assessing affective theory of mind, 32 AI MAGAZINE we take the term theory seriously. Prominent philosophers of science claim that scientific theories are, by definition, falsifiable (Popper 2005). Although an optimistic agent may view a situation with a glass half full bias and pessimistic agents may tend to view the very same situations with a glass half empty bias, they can still both be correct. How then do we demonstrate the falsifiability of affective theory of mind? The answer comes when one considers a predicted emotion paired with the explanation of this prediction. If we consider both together then we have a theory that is falsifiable. Consider the following situation and the following predictions: Situation: Sue and Mary notice it is raining. Appraisal U1: Sue feels happy because she expects the sun will come out tomorrow. Appraisal U2: Mary feels sad because she hates rain and it will probably keep on raining. Although some of us may tend to agree more with one or the other’s reaction, virtually all of us will judge both of these replies as potentially valid (modulo some relatively minor assumptions about normal personality differences). By contrast, consider what happens if we invert the emotions felt by each character: Appraisal R1: Mary feels sad because she expects the sun will come out tomorrow. Appraisal R2: Sue feels happy because she hates rain and it will probably keep on raining. We take it as a given that the vast majority of typical humans representative of a given cultural group will judge the immediately above appraisals as invalid or extremely puzzling. In sum, emotion is not an anything goes phenomenon — we have demonstrated that some appraisals violate our intuitions about what makes sense. Although there are a multitude of different emotions that could make sense, falsifiability is demonstrable when one considers the predicted emotion label along with its explanation (Jarrold 2004). As will be described next, falsifiability of AToM is important in the context of Turing test alternatives. A Generative AToM Leaving falsifiability aside, there remains the need to provide an account for the multitude of potential emotional appraisals of a situation. The need is addressed by viewing appraisal not as an inference but rather as a generative process. Generative processes are highly productive, able to produce novel patterns of outputs such as cellular automata, generative grammars, and fractals such as the Mandelbroit or Julia Set. Ortony (2001) posited that generative capacity is critical to computational accounts of emotion. As a demonstration of this generativity, consider the range of appraisals obtained from “college sophomore” participants in Jarrold (2004) (see table 1). Articles Scenario Tracy wants a banana. Mommy gives Tracy an apple. Question How will Tracy feel? (Choose from happy, sad, or indifferent) Valence Explanation Happy She’ll feel happy even though she didn’t get exactly what she wanted; it is still something. Indifferent Because nonetheless she still has something to eat just not exactly what she wanted. Indifferent She will feel indifferent as long as she likes apples too. It isn’t exactly what she wanted, but she was probably just hungry and if she likes apples then she would be satisfied because it would do the same thing as a banana. Sad Because she was probably excited about eating the banana that day and when mom gave her an apple instead she probably felt disappointed and wondered why her mom wouldn’t give her what she wanted. Sad She did not get what she wanted. Appraisals Table 1. Five Human Appraisals of a Simple Scenario Alhough research subjects were presented with a very simple scenario, answers ranged from happy to indifferent to sad. The explanations for a given emotion also varied in terms of assumptions, focus, and complexity. Note that the inferences in explanations are often not deductions derived strictly from scenario premises. They can contain abductions or assumptions (for example, in table 1, row 3 “she is probably just hungry”) and a series of subappraisals (for example, row 4 excitement yielding to disappointment). Furthermore, note that the above data were generated in response to very simple scenarios derived from an autism therapy workbook (Howlin, BaronCohen, and Hadwin 1999). Imagine the generative diversity attainable in real-world appraisals where the scenarios can include N preceding chapters in a novel or a person’s life history. Typical humans predict and explain another’s emotions and find it easy to generate, understand, and evaluate the full range of appraisal phenomena described above. For this reason it is important that human-level AI models of emotion be able to emulate this generative capacity. Outline In the remainder of this article, we will first describe how test items are involved in a five-stage framework or methodology for conducting an evaluation of computational social-emotional intelligence. Challenges to the integrity of the test are anticipated and countermeasures are described. Finally, issues with the specifics of implementing this framework are addressed. Proposed Framework Each of the framework’s five stages (see figure 1) is described: first, developing the test items; second, obtaining ground truth; third, computational modeling; and, finally, two stages of evaluation. In these last two evaluation stages models are judged on the basis of two corresponding tasks: (1) generating appraisals (stage 4) and (2) their ability to evaluate other’s appraisals — some of which have been manipulated (stage 5) Test Items The framework revolves around the ability of a system to predict the emotions of agents in particular situations in a human-like way across a sufficiently large number of test items. As will be explained in detail, test items are questions posed to examinees (both humans and machines). They require the examinee to generate appraisals (answers to the questions). Machine-generated appraisals are evaluated in terms of how well they compare to the human-generated ones. Items have the following structural elements: (1) a scenario that is posed to the human or machine examinee and that consists of (1a) a target character whose emotion is to be predicted; a scenario involving the target (and possibly other characters). (2) a two-part emotion question that prompts the examinee to (2a) select through multiple choice an emotion descriptor that best matches the emotion he, she, or it predicts will likely be felt by the target character, and (2b) explain why the character might feel that way. SPRING 2016 33 Articles Stage 2: Obtain Human-Generated Appraisals Stage 1: Generate Scenario Items Stage 2: Humans Generate Appraisals Stage 3: Develop Models Stage 4: Humans Evaluate Appraisals Model versus Human The overall goal of this stage is to obtain a ground truth for the test. Concretely, the goal of this stage is to task a group of human participants to generate at least one appraisal for items produced in stage 1. Generating an appraisal involves choosing an emotion to answer the emotion question and producing an explanation for that answer. Given the generativity of emotional appraisal we expect a wide range of responses even for a single scenario instance. Recall the example of appraisal data derived from the simple scenario in table 1. The range of distinct appraisals should increase with the range of possible emotions from which to chose, the length of the allowable explanation, and the number of participants. That said, the increase at some point will level off because the themes of the nth participant’s appraisal will start to overlap with those of earlier participants. While the number of different scenario instances may circumscribe the generative breadth we require our computational models to cover, one might also say that the generative depth of the model is circumscribed by the number of distinct appraisals generated for each scenario. Some of the resulting human-generated appraisals can be passed to the next stage as training data for modeling. The remainder are sequestered as a test set to be used during evaluation phases. Stage 3: Develop Appraisal Models Stage 5: Models Evaluate Appraisals Figure 1. High Level Schematic of the Framework’s Five Stages. Stage 1: Generate Scenario Items The purpose of stage one is to produce a set of scenario items that can be used later in the evaluation. The range of scenarios circumscribes the breadth of the modeling task. In the early years of the competition, we will focus on simple scenarios (for example, “Eric wanted to ride the train but his father took him in the car. Was he happy or sad?”) and in later years, move to ever more complex material from brief stories to, much later, entire novels. 34 AI MAGAZINE The contestants, computational modelers, are challenged to develop a model that for any given scenario instance can (1) predict an appropriate emotion label for the target scenario character (for example, happy, sad, and so on); and (2) generate an appropriate natural language (NL) explanation for this prediction. Appropriate is judged by human raters in stage 4 in reference to human-generated appraisals. Contestants are given a sample of scenario instances and the corresponding human-generated appraisals to train or engineer their models. Stage 4: Evaluate Appraisals: Model Versus Human The purpose of this stage is to obtain an evaluation of how well a given model performs appraisal in comparison to humans. This is achieved by a new group of human participants serving as raters. The input to this process is a set of appraisals including humangenerated ones from stage 2 and model-generated ones from stage 3. Valence Reversal Before being submitted to a human judge, each appraisal has a 50 percent chance of being subject to an experimental manipulation known as valence reversal. Operationally, this means replacing the emotion label of a given appraisal with a different Articles label of preferably “opposite” emotional valence. Under such a manipulation, happy would be replaced with sad, and sad with happy. For example: Situation: Eric wants a train ride and his father gives him one. Unreversed Appraisal: Eric feels happy because he got what he wanted. Reversed Appraisal: Eric feels sad because he got what he wanted. Reversal provides a contrast variable. We expect the statistical effect of reversal on appraisal quality to be strong. In contrast, if the model’s appraisals are adequate, then among unreversed appraisals there should be no significant difference between humanversus model-generated appraisals. This methodology was successfully used in Jarrold (2004) and this article is essentially a scaling up of that approach. Submission to Human Evaluators Either the reversed or unreversed version of each appraisal is administered to at least one judge. The judges are to rate appraisals independently according to some particular subjective measure(s) of quality such as commonsensicality, believability, novelty, and so on. The measure is specified by the contest organizers. Judges are blinded to the reversal status — reversed or unreversed — and source — human or machine — of each item. Stage 5: Model and Evaluate Human MetaAppraisal The purpose of this stage is to evaluate a model’s ability not to generate but rather to validate appraisals. This capacity is important because human-level AToM involves not just the capacity to make one decent prediction and explanation of another agent’s emotions in a given situation. It also involves breadth, the ability to assess the validity of any of the multitude of the generatable appraisals of that situation. If a model’s pattern of quality ratings for all the stage 4 appraisals — be they model or human generated, reversed or unreversed — matches the pattern of ratings given by stage 4 human judges, then it demonstrates the full generative breadth of understanding. The capacity for validating appraisals is important for another reason — detecting the authenticity of an emotional reaction. Consider the following: Bob: How are you today? Fred: Deeply depressed — no espresso. People know that Fred is kidding. A deep depression is not a believable or commonsensical appraisal of a situation in which one is missing one’s espresso. The input to stage 5 is the output of stage 4, that is, human evaluations of appraisals. The appraisals evaluated include all manner of appraisals generated in prior stages: that is, both human and machine generated, both unreversed and reversed. These rated appraisals are segregated by the organizers into two groups, a training set and a test set. Modelers are given the training set and tasked with enhancing their preexisting models by giving them the ability to evaluate the validity of others’ appraisals. Once modeling is completed, the organizers evaluate the enhanced self-reflective models against the test data. Model appraisal ratings should be similar to human ratings — unreversed appraisals should receive high-quality ratings, and reversed ones, poorer ratings. This phase may add new layers of model complexity and may be too difficult for the early years. Thus, for reasons of incrementality we consider it a stage that is phased in gradually over successive years. Issues in Implementation In this section we discuss specific issues associated with actually running the experiments and competitions. Incrementality Hector Levesque (2011) described the benefits of an incremental staged approach. Any challenge should be matched to existing capabilities. If too easy, the challenge will not be discriminative nor exciting enough to attract developers. If too hard, solutions will fail to generalize and developers will be discouraged. In addition, systems advance every year. In view of all of these needs, it is best to have a test for which it is easy to raise or lower the bar. How can incrementality be implemented within the framework? As will be explained in the next section, parameterization of scenarios provides one relatively low-effort means of adapting the difficulty of the test. Parameterization of Test Scenarios It is important to be able to have a lot of test scenarios. More scenarios means more training data, a more fine-grained evaluation, a greater guarantee of comprehensive coverage. Cohen et al. (1998) used parameterization to create numerous natural language test questions that deviate from sample questions in specific controlled ways. The space of variation within given parameterization can be combinatorially large thus ensuring the ability to cover a broad range of materials. Parameterization was successfully used by Sosnovsky, Shcherbinina, and Brusilovsky (2003) to produce large numbers of training and test items for human education with relatively low effort. A parameterized scenario is essentially a scenario template. Such templates can be created by taking an existing scenario and replacing particular objects in the scenario with variables of the appropriate type. Consider the following scenario instance: Scenario: Tracy wants a banana. Mommy gives Tracy an apple for lunch. SPRING 2016 35 Articles Emotion Question: How will Tracy feel? (Choose from one of happy or sad.) Explanation: Explain why she will feel that way (in less than 50 words). This item can be parameterized by replacing Tracy, banana, Mommy, and others with variables as shown next. Scenario Template <target-character> wants <object1>. <alt-character> gives <target-character> <object2> for <condition> Answer Template Emotion: How does <target-character> feel? Choose from: <range of emotion terms / levels> Explanation: <answer constraints — length, vocabulary, and others> The range for each parameter is specified by the test administrator. For example the range for <object1> could include any object within the vocabulary of a four year old (for example, banana, lump of coal, chocolate, napkin). Additional item instances are instantiated by choosing values for the parameters of a given template. If parameters can take on a large set of values, a very large set of items can be generated. To meet the needs of incrementality, one can increase (or decrease) the level of difficulty by increasing the range of values that scenario parameters may take on. Alternatively one can add more templates. How the Framework Prevents Gaming Evaluation Like any contest, it can be gamed by clever trickery that violates the spirit of the rules and evades constructive progress in the field. We describe a variety of gaming tactics and how the Framework prevents them. Bag of Words to Predict Emotion A bag of words (BOW) classifier assigns an input document to one of a predefined set of categories based on weighted word frequencies. Thus, one “cheat” is to use this simple technique to predict the correct emotion label. One problem is that such classifiers ignore word order — thus “John loves Mary” and “Mary loves John” would assign the same emotion to Mary. Further, they are not generative and thus unable to produce novel explanations necessary in stage 4. In stage 5, it is hard to imagine how such a shallow approach would do well in evaluating the match between a scenario plus the appraisal emotion and explanation. Chatbots In stage 5, a chatbot will not do well because the task involves no NL generation — it just involves producing scores rating the quality of an appraisal. In stage 4, the case against the chatbot is more involved. A chatbot hack for this stage would be to 36 AI MAGAZINE chose an arbitrary emotion and generate explanation through a chatbot. Chatty or snarky explanations might sound human but contain no specific content. Such explanations would intentionally be a form of empty speech hand-crafted by the modeler to go with any chosen emotion. For example, a Eugene Goostman-like agent could chose happy or sad and provide the same explanation, “Tracy feels that way just because that’s the way she is.” A related but slightly more sophisticated tactic is always to chose the same emotion but devise a handcrafted appraisal that could go with virtually any scenario. For example, “Tracy feels happy because she has a very upbeat personality — no matter what happens she’s always looking on the bright side.” There are several reasons a chatbot will likely fail. First, we expect chatbots may be detectable through the human ratings. Although humans may sometimes provide answers like the above, more often than not, we expect their answers to exhibit greater specificity to the scenario and emotion chosen. We suspect that direct answers will generally receive higher ratings than chatty ones. Unlike the Turing test, there is no chance to build conversational rapport because there is no conversation and thus little for the chat bot to hide behind. If necessary, contest administrators can give specific instructions to human judges to penalize appraisals that are ironic, chatty, not specific to the scenario, and so on. These considerations could be woven into a single overall judgment score per appraisal or by allowing for additional rating scales (for example, one dimension might be believability, another could be specificity, and so on). Elaborating the instructions in this way demands more training of judges and raises some issues associated with interrater reliability and multidimensional scoring. The second countermeasure leverages falsifiability and the valence reversal manipulation done to all appraisals (machine as well as human generated) in stage 4. A chatbot lacks an (affective) theory of mind and thus does not know what kind of emotion goes with what kind of explanation in an appraisal. There should therefore be little to no dependency between its emotion labels and explanations. Put another way, being “theory free,” chatbot “predictions” about other agents’ appraisals are not falsifiable. Thus, valencereversed appraisals from a chatbot will likely not be judged worse than their unreversed counterparts. Thus if a given appraisal and its reversed counterpart score about as well, this should factor negatively in that contestant’s overall score. Contest Evolution An attractive design feature of this method is the number of contest configuration variables that can be readjusted each year in response to advancing technology, pitfalls, changing goals, or emphasis. If organizers want to maximize the generative pro- Articles ductivity of contestants’ models they can use fewer scenario instances; involve more human participants to generate more appraisals at stage 2; allow longer appraisal explanations with a larger vocabulary; and / or reward models that generated multiple appraisals per scenario. By contrast, to maximize the breadth of appraisal domains organizers can have more scenario templates, more parameters in a template, more parameter values for a given parameter; or adjust the size of vocabulary allowed for a scenario. To increase an appraisal’s algorithm sophistication one can increase the number of characters in each scenario, increase the number of emotions to chose between, or allow multiple or mixed emotions to be chosen The first contests should involve a small handful of emotions because Jarrold (2004) demonstrated there is a tremendous amount of complexity yet to be modeled to simply distinguish between happy and sad. Affective reasoning requires a substantial body of commonsense knowledge. To bound the amount of such background knowledge required and focus efforts on affective reasoning, organizers can decrease the diversity of scenario characters — for example, human children ages 3 to 5; narrow the range of scenario parameters to a focused knowledge domain; or restrict the vocabulary or length allowed in explanations. In later contest years, there may be rater disagreement for some of the more nuanced or subtle scenario or appraisal pairs due to differing cultural or social-demographic representativeness factors. A variety of options present themselves — make rater “cultural group” a contextual variable; increase the cultural homogeneity of the human raters; or remove appraisals with low interrater reliability from the contest. Crowdsourcing It is possible that considerable numbers of participants will be required at certain stages. For example, modelers may desire a large number of appraisals to be generated in stage 2 as training data. Prior work in dialog systems (Yang et al. 2010) or the creation of ImageNet (Su, Deng, and Fei-Fei 2012) (to pick just two of many crowdsourced studies) has shown that large numbers of people can be recruited online (for example, through Amazon Mechanical Turk) as a form of crowdsourcing. It is hoped that over successive years a large library of scenarios each with a large number of appraisals and associated human ratings could be collected in this way over time to compose an emotion-oriented ImageNet analog. Public Interest Newsworthiness and public excitement are important because prior competitive challenges such as Robocup, IBM Watson, and Deep Blue have demonstrated how these factors drive talented individuals and other resources to attack a problem. One factor helping the social-emotional Turing challenge is that emotional content has mass appeal and may be less dry than other challenges such as chess. Stage 5, where machine- and human-generated appraisals are judged side by side, may be the most accessible media-worthy part of the framework. Prior stages may be reserved for a qualifying round, which may be of more scientific interest. Akin to the Watson competition, both human and machine contestants may be placed side by side while scenarios are presented to them in real time. Judges will score each appraisal blind to whether it was human versus machine generated. Scores can be read off one by one akin to a gymnastics competition. Conclusion We argue for the importance of assessing social-emotional intelligence among Turing test alternatives. We focus on a specific aspect of this capacity, affective theory of mind, which enables prediction and explanation of others’ emotional reactions to situations. We explain how a generative logic can account for the diversity yet specificity of predicted affective reactions. The falsifiability of these predictions is leveraged in a five-stage framework for assessing the degree to which computer models can emulate this behavior. Issues in implementation are discussed including the importance of incremental challenge, parameterization, and resisting hacks. It is hoped that over successive years a large set of scenarios, appraisals, and ratings would accrue and compose a kind of affective version of ImageNet. Acknowledgement We would like to thank Deepak Ramachandran for some helpful discussions. References Clark, P. 2015. Elementary School Science and Math Tests as a Driver for AI: Take the Aristo Challenge! In Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, 4019–4021. Palo Alto, CA: AAAI Press. Cohen, P. R.; Schrag, R.; Jones, E.; Pease, A.; Lin, A.; Starr, B.; Gunning, D.; and Burke, M. 1998. The DARPA High-Performance Knowledge Bases Project. AI Magazine 19(4): 25. Gunning, D.; Chaudhri, V. K.; Clark, P. E.; Barker, K.; Chaw, S.-Y.; Greaves, M.; Grosof, B.; Leung, A.; McDonald, D. D.; Mishra, S.; Pacheco, J.; Porter, B.; Spaulding, A.; Tecuci, D.; and Tien, J. 2010. Project Halo Update — Progress Toward Digital Aristotle. AI Magazine 31(3): 33–58. Howlin, P.; Baron-Cohen, S.; and Hadwin, J. 1999. Teaching Children with Autism to Mind-Read: A Practical Guide for Teachers and Parents. Chichester, NY: J. Wiley & Sons. Jarrold, W. 2004. Towards a Theory of Affective Mind. Ph.D. Dissertation, Department of Educational Psychology, University of Texas at Austin, Austin, TX. SPRING 2016 37 Articles AI in Industry Columnists Wanted! AI Magazine is soliciting contributions for a column on AI in industry. Contributions should inform AI Magazine’s readers about the kind of AI technology that has been created or used in the company, what kinds of problems are addressed by the technology, and what lessons have been learned from its deployment (including successes and failures). Prospective columns should allow readers to understand what the current AI technology is and is not able to do for the commercial sector and what the industry cares about. We are looking for honest assessments (ideally tied carefully to the current state of the art in AI research) — not product ads. Articles simply describing commercially available products are not suitable for the column, although descriptions of interesting, innovative, or high impact uses of commercial products may be. Questions should be discussed with the column editors. Columns should contain a title, names of authors, affiliations and email addresses (and a designation of one author as contact author), a 2–3 sentence abstract, and a brief bibliography (if appropriate). The main text should be brief (600–1,000 words) and provide the reader with high-level information about how AI is used in their companies (we understand the need to protect proprietary information), trends in AI use there, as well as an assessment of the contribution. Larger companies might want to focus on one or two suitable projects so that the description of their development or use of AI technology can be made sufficiently detailed. The column should be written for a knowledgeable audience of Al researchers and practitioners. Reports go through an internal review process (acceptance is not guaranteed). The column editors and the AI Magazine editorin-chief are the sole reviewers of summaries. All articles will be copyedited, and authors will be required to transfer copyright of their columns to AAAI. If you are interested in submitting an article to the AI in Industry column, please contact column editors Sven Koenig ([email protected]) and Sandip Sen ([email protected]) before submission. 38 AI MAGAZINE Levesque, H. J. 2011. The Winograd Schema Challenge. In Logical Formalizations of Commonsense Reasoning: Papers from the 2011 AAAI Spring Symposium, 63–68. Palo Alto, CA: AAAI Press. Ortony, A. 2001. On Making Believable Emotional Agents Believable. In Emotions in Humans and Artifacts, ed. R. Trappl, P. Petta, and S. Payr, 189–213. Cambridge, MA: The MIT Press. Popper, K. 2005. The Logic of Scientific Discovery. New York: Routledge / Taylor & Francis. Sosnovsky, S.; Shcherbinina, O.; and Brusilovsky, P. 2003. Web-Based Parameterized Questions as a Tool for Learning. In Proceedings of E-Learn 2003: World Conference on E-Learning in Corporate, Government, Healthcare, and Higher Education, 309–316. Waynesville, NC: Association for the Advancement of Computing in Education. Su, H.; Deng, J.; and Fei-Fei, L. 2012. Crowdsourcing Annotations for Visual Object Detection. In Human Computation: Papers from the 2012 AAAI Workshop. AAAI Technical Report WS-12-08, 40–46. Palo Alto, CA: AAAI Press. Yang, Z.; Li, B.; Zhu, Y.; King, I.; Levow, G.; and Meng, H. 2010. Collection of User Judgments on Spoken Dialog System with Crowdsourcing. In 2010 IEEE Spoken Language Technology Workshop (SLT 2010), 277–282. Piscataway, NJ: Institute of Electrical and Electronics Engineers. William Jarrold is a senior scientist at Nuance Communications. His research in intelligent conversational assistants draws upon expertise in ontology, knowledge representation and reasoning, natural language understanding, and statistical natural language processing (NLP). Throughout his career he has developed computational models to augment and understand human cognition. In prior work at the University of California, Davis and the SRI Artificial Intelligence Lab he has applied statistical NLP to the differential diagnosis of neuropsychiatric conditions. At SRI and the University of Texas he developed ontologies for intelligent tutoring (HALO) and cognitive assistants (CALO). Early in his career he worked at MCC and Cycorp developing ontologies to support commonsense reasoning in Cyc — a large general-purpose knowledge-based system. His Ph.D. is from the University of Texas at Austin and his BS is from the Massachusetts Institute of Technology. Peter Z. Yeh is a senior principal research scientist at Nuance Communications. His research interests lie at the intersection of semantic technologies, data and web mining, and natural language understanding. Prior to joining Nuance, Yeh was a research lead at Accenture Technology Labs where he was responsible for investigating and applying AI technologies to various enterprise problems ranging from data management to advanced analytics. Yeh is currently working on enhancing interpretation intelligence within intelligent virtual assistants and automatically constructing large-scale knowledge repositories necessary to support such interpretations. He received his Ph.D. in computer science from The University of Texas at Austin. Articles Artificial Intelligence to Win the Nobel Prize and Beyond: Creating the Engine for Scientific Discovery Hiroaki Kitano I This article proposes a new grand challenge for AI: to develop an AI system that can make major scientific discoveries in biomedical sciences and that is worthy of a Nobel Prize. There are a series of human cognitive limitations that prevent us from making accelerated scientific discoveries, particularity in biomedical sciences. As a result, scientific discoveries are left at the level of a cottage industry. AI systems can transform scientific discoveries into highly efficient practices, thereby enabling us to expand our knowledge in unprecedented ways. Such systems may outcompute all possible hypotheses and may redefine the nature of scientific intuition, hence the scientific discovery process. W hat is the single most significant capability that artificial intelligence can deliver? What pushes the human race forward? Our civilization has advanced largely by scientific discoveries and the application of such knowledge. Therefore, I propose the launch of a grand challenge to develop AI systems that can make significant scientific discoveries. As a field with great potential social impacts, and one that suffers particularly from information overflow, along with the limitations of human cognition, I believe that the initial focus of this challenge should be on biomedical sciences, but it can be applied to other areas later. The challenge is “to develop an AI system that can make major scientific discoveries in biomedical sciences and that is worthy of a Nobel Prize and far beyond.” While recent progress in high-throughput “omics” measurement technologies has enabled us to generate vast quantities of data, scientific discoveries themselves still depend heavily upon individual intuition, while researchers are often overwhelmed by the sheer amount of data, as well as by the complexity of the biological phenomena they are seeking to understand. Even now, scientific discovery remains something akin to a cottage industry, but a great transformation seems to have begun. This is an ideal domain and the ideal timing for AI to make a difference. I anticipate that, in the near future, AI systems will make a succession of discoveries that have immediate medical implications, saving millions of lives, and totally changing the fate of the human race. Copyright © 2016, Association for the Advancement of Artificial Intelligence. All rights reserved. ISSN 0738-4602 SPRING 2016 39 Articles Grand Challenges as a Driving Force in AI Research Throughout the history of research into artificial intelligence, a series of grand challenges have been significant driving factors. Advances in computer chess demonstrated that a computer can exhibit human-level intelligence in a specific domain. In 1997, IBM’s chess computer Deep Blue defeated human world champion Gary Kasparov (Hsu 2004). Various search algorithms, parallel computing, and other computing techniques originating from computer chess research have been applied in other fields. IBM took on another challenge when it set the new goal of building a computer that could win the TV quiz show Jeopardy! In this task, which involved the real-time answering of open-domain questions (Ferrucci et al. 2010, Ferrucci et al. 2013), IBM’s Watson computer outperformed human quiz champions. IBM is currently applying technology from Watson as part of its business in a range of industrial and medical fields. In an extension of prior work on computer chess, Japanese researchers have even managed to produce a machine capable of beating human grand masters of Shogi, a Japanese chess variant with a significantly larger number of possible moves. RoboCup is a grand challenge founded in 1997 that traverses the fields of robotics and soccer. The aim of this initiative is to promote the development by the year 2050 of a team of fully autonomous humanoid robots that is able to beat the most recent winners of the FIFA World Cup (Kitano et al. 1997). This is a task that requires both an integrated, collective intelligence and exceptionally high levels of physical performance. Since the inaugural event, the scheme has already given birth to a series of technologies that have been deployed in the real world. For example, KIVA Systems, a technology company that was formed based largely on technologies from Cornell University’s team for RoboCup’s Small Size League, provided a highly automated warehouse management system that Amazon.com acquired in 2012. Various robots that were developed for the Rescue Robot League — a part of RoboCup focused on disaster rescue — have been deployed in real-world situations, including search and rescue operations at New York’s World Trade Center in the aftermath of the 9/11 terror attacks, as well as for surveillance missions following the accident at the Fukushima Daiichi Nuclear Power Plant. These grand challenges present a sharp contrast with the Turing test, aimed as they are at the development of superhuman capabilities as opposed to the Turing test’s attempts to answer the question “Can machines think?” by creating a machine that can generate humanlike responses to natural language dialogues (Turing 1950). These differing approaches present different scientific challenges, and, while going forward we may expect some cross-fertilization 40 AI MAGAZINE between these processes, this article focuses on the grand challenge of building superhuman capabilities. History provides many insights into changes over time in the technical approaches to these challenges. In the early days of AI research, it was widely accepted that a brute force approach would not work for chess, and that heuristic programming was essential for very large and complex problems (Feigenbaum and Feldman 1963). Actual events, however, confounded this expectation. Among the features critical for computer chess were the massive computing capability required to search millions of moves; vast memory to store a record of all past games; and a learning mechanism to evaluate the quality of each move and adjust search paths accordingly. Computing power, memory, and learning have proven to hold the winning formula, overcoming sophisticated heuristics. The 1990s saw a similar transformation of approach in speech recognition, where rule-based systems were outperformed by data- and computingdriven systems based on hidden Markov models (Lee 1988). Watson, the IBM computer that won the Jeopardy! quiz show, added new dimensions of massively parallel heterogeneous inference and real-time stochastic reasoning. Coordination of multiple different reasoning systems is also key when it comes to Shogi. Interestingly, similar technical features are also critical in bioinformatics problems (Hase et al. 2013; Hsin, Ghosh, and Kitano 2013). Elements currently seen as critical include massively parallel heterogeneous computing, real-time stochastic reasoning, limitless access to information throughout the network, and sophisticated multistrategy learning. Recent progress in computer GO added a combination of deep learning, reinforcement learning, and tree search to be the winning formula (Silver et al. 2016). Challenges such as those described have been highly effective in promoting AI research. By demonstrating the latest advances in AI, and creating highimpact industrial applications, they continue to contribute to the progress of AI and its applications. The Scientific Discovery Grand Challenge It is time to make an even greater stride, by imagining and initiating a new challenge that may change our very principles of intelligence and civilization. While scientific discovery is not the only driving force of our civilization, it has been one of the most critical factors. Creating AI systems with a very high capability for scientific discovery will have a profound impact, not only in the fields of AI and computer science, but also in the broader realms of science and technology. It is a commonly held perception that scientific discoveries take place after years of dedicated effort or at a moment of great serendipity. The process of scientific discovery as we know it today is considered unpredictable and ineffi- Articles cient and yet is blithely accepted. I would argue, however, that the practice of scientific discovery is stuck at a level akin to that of a cottage industry. I believe that the productivity and fundamental modalities of the scientific discovery process can be dramatically improved. The real challenge is to trigger a revolution in science equivalent to the industrial revolution. It should be noted that machine discovery, or discovery informatics (Gil et al. 2014, Gil and Hirsh 2012), has long been a major topic for AI research. BEACON (Langley and Simon 1987), DENDRAL (Lindsay et al. 1993), AM, and EURISKO (Lenat and Brown 1984) are just some of the systems of this nature developed to date. We must aim high. What distinguishes the proposed challenge from past efforts is its focus on biomedical sciences in the context of dramatic increases in the amount of information and data available, along with levels of interconnection of experimental devices that were unavailable in the past. It is also set apart by the focus on research, with the extremely ambitious goal of facilitating major scientific discoveries in the biomedical sciences that may go on to earn the Nobel Prize in Physiology or Medicine, or achieve even more profound results. This is the moonshot in AI. Just as the Apollo project’s goal went beyond the moon (Kennedy 1961, 1962), the goals of this project go far beyond the Nobel Prize. The goal is to promote a revolution in scientific discovery and to enable the fastest-possible expansion in the knowledge base of mankind. The development of AI systems with such a level of intelligence would have a profound impact on the future of humanity. Human Cognitive Limitations in Biomedical Sciences There are fundamental difficulties in biomedical research that overwhelm the cognitive capabilities of humans. This problem became even more pronounced with the emergence of systems biology (Kitano 2002a, 2002b). Some of the key problems are outlined below. First, there is the information horizon problem. Biomedical research is flooded with data and publications at a rate of production that goes far beyond human information-processing capabilities. Over 1 million papers are published each year, and this rate is increasing rapidly. Researchers are already overwhelmed by the flood of papers and data, some of which may be contradictory, inaccurate, or misused. It is simply not possible for any researcher to read, let alone comprehend, such a deluge of information in order to maintain consistent and up-to-date knowledge. The amount of experimental data is exploding at an even faster pace, with widespread use of highthroughput measurement systems. Just as the rapidly expanding universe creates a cosmic event horizon that prevents even light emitted in the distant past from reaching us, thus rendering it unobservable, the never-ending abundance of publications and data creates an information horizon that prevents us from observing a whole picture of what we have discovered and what data we have gathered. It is my hope that, with the progress instigated by the challenge I am proposing, AI systems will be able to compile a vast body of intelligence in order to mitigate this problem (Gil et al. 2014). Second, there is also the problem of an information gap. Papers are written in language that frequently involves ambiguity, inaccuracy, and missing information. Efforts to develop a large-scale comprehensive map of molecular interactions (Caron et al. 2010, Matsuoka et al. 2013, Oda and Kitano 2006, Oda et al. 2005) or any kind of biological knowledge base of any form will encounter this problem (see sidebar). Our interpretation, hence human-based knowledge extraction, largely depends on subjectively filling in the gaps using the reader’s own knowledge or representation of knowledge with missing details, results in an arbitral interpretation of knowledge in the text. Obviously, solving this is far beyond the capacity to convey information of the language of a given text (Li, Liakata, and Rebholz-Schuhmann 2014). It also involves actively searching for missing information to discern what is missing and how to find it. It is important to capture details of the interactions within a process rather than merely an abstracted overview, because researchers are well aware of overall interaction and expect such a knowledge base, or maps, to provide consistent and comprehensive yet in-depth description of each interaction. Similar issues exist when it comes to understanding images from experiments. They include: how to interpret images, checking consistency with the sum of past data, identifying differences and the reasons for these, and recovering missing information on experimental conditions and protocol. Third, there is a problem of phenotyping inaccuracy. The word phenotyping refers to representation and categorization of biological anomalies such as disease, effects of genetics mutations, and developmental defects. Phenotyping is generally performed based on subjective interpretation and consensus of medical practitioners and biologists, described using terms that are relatively easy to understand. This practice itself is tightly linked with human cognitive limitations. Biomedical sciences have to deal with complex biological systems that are highly nonlinear, multidimensional systems. Naïve delineation of observation into coarse categorization can create significant inaccuracies and lead to misdiagnosis and inaccurate understanding of biological phenomena (figure 1a). This is a practical clinical problem as signified in some rare disease cases that took decades for patients to be diagnosed and had almost 40 percent SPRING 2016 41 Articles In contrast, in response to mating pheromones, the Far1–Cdc24 complex is exported from the nucleus by Msn5 Far1-Cdc24 Far1-Cdc24 Msn5 From the nucleus to where? Is Msn5 within the nucleus? Are all forms of Far1-Cdc24 exported? Can all forms of Msn5 do this? An Example of Missing Information in a Biological Statement Biomedical science is knowledge-intensive and empirical science. Currently, knowledge is embedded in the text and images in publications. The figure exemplifies a case of missing information implicit in biomedical papers. Take the example of the following typical sentence from a biology paper: “In contrast, in response to mating pheromones, the Far1-Cdc24 complex is exported from the nucleus by Msn5” (taken from the abstract by Shimada, Gulli, and Peter [2000]). We can extract knowledge on a specific molecular interaction involving the Far1-Cdc24 complex and Msn5 and represent this graphically. The sentence itself does not, however, describe where the Far-Cdc24 complex is exported to, and where Msn5 is located. In such cases, researchers can fill in the conceptual gaps from their own biological knowledge. However, it is not clear if all forms of the Far1-Cdc24 complex will become the subject of this interaction, nor if all forms of Msn5 can conduct this export process. In this case, the general biological knowledge of researchers will generally prove insufficient to fill in such gaps, thereby necessitating either the inclusion of a specific clarifying statement elsewhere in the paper, or the need to search other papers and databases to fill this gap. of initial misdiagnosis rate (EURORDIS 2007). Clinical diagnosis is a process of observation, categorization of observed results, and hypothesis generation on a patient’s disease status. Misdiagnosis leads to inappropriate therapeutic interventions. Identification of proper feature combinations for each axis, the proper dimension for the representation of space, and the proper granularity for categorization shall significantly improve diagnosis, hence therapeutic efficacy (figure 1b). Extremely complex feature combinations for each axis, extreme high-dimensional 42 AI MAGAZINE representation of space, and extremely fine-grained categorization that can be termed as extreme classification shall dramatically improve accuracy of diagnosis. Since many diseases are constellations of very large numbers of subtypes of diseases, such an extreme classification shall enable us to properly identify specific patient subgroups that may not be identified as an isolate group at present and lead to specific therapeutic options. However, an emerging problem would be that humans may not be able to comprehend what exactly each category means in Articles A B feature A Correct False positive False negative High Which feature or feature combinations to use? For example, Y-Axis = feature A versus Y-Axis = f(feature A, feature B, feature D) s? ension Mid any How m Low Low Mid High feature B dim tation en repres What is the best granularity for categorization? For example, Low, Mid, High (coarse) versus Low-low, Low-mid, Low-high, Mid-low, Mid-mid, Mid-high, Mid-very-high, High-low, High-high (Fine-grain) Figure 1. Problems in the Representation and Categorization of Biological Objects and Processes. Left figure modified based on Kitano (1993). Figure 1a is an example of an attempt to represent a nonlinear boundary object, assumed to be a simplification of a phenotype, with a simple two-feature dimensional space with coarse categorization such as Low, Mid, and High. The object can be most covered with “feature A = Mid and feature B = Mid condition.” However, it inevitably results in inaccuracy (false-positives and false-negatives). Improving accuracy of nonlinear object coverage requires the proper choice of the feature complex for each axis, the proper dimension of representational space, and the proper choice of categorization granularity (figure 1b). relation to their own biomedical knowledge, which was developed based on the current coarse and lowdimensional categorization. Another closely related problem is that of cognitive bias. Due to the unavoidable use of language and symbols in our process of reasoning and communication, our thought processes are inevitably biased. As discussed previously, natural language does not properly represent biological reality. Alfred Kozybski’s statement that “the map is not the territory” (Korzybski 1933) is especially true in biomedical sciences (figure 2). Vast knowledge of the field comes in the form of papers that are full of such biases. Our ability to ignore inaccuracies and ambiguity facilitates our daily communication, yet poses serious limitations on scientific inquiry. Then there is the minority report problem. Biology is an empirical science, meaning knowledge is accumulated based on experimental findings. Due to the complexity of biological systems, diversity of individuals, uncertainty of experimental conditions, and other factors, there are substantial deviations and errors in research outcomes. While consensus among a majority of reports can be considered to portray the most probable reality regarding a specific aspect of biological systems, reports exist that are not consistent with this majority (figure 3). Whether such minority reports can be discarded as errors or false reports is debatable. While some will naturally fall into this category, others may be correct, and may even report unexpected biological findings that could lead to a major discovery. How can we distinguish between such erroneous reports and those with the potential to facilitate major discoveries? Are We Ready to Embark on This Challenge? I have described some of the human cognitive limitations that act as obstacles to efficient biomedical research, and that AI systems may be able to resolve during the course of the challenge I am proposing. Interestingly, there are a few precedents that may provide a useful starting point. Of the early efforts to mitigate the information horizon problem, research using IBM’s Watson computer is currently focused on the medical domain. The intention is to compile the vast available literature and present it in a coherent manner, in contrast to human medical practitioners and researchers who cannot read and digest the entire available corpus of information. Watson was used in a collaboration between IBM, Baylor College of Medicine, and MD Anderson Cancer Center that led to the identification of novel modification sites of p53, an important protein for cancer suppression (Spangler et al. 2014). A recent DARPA Big Mechanism Project (BMP) aimed at automated extraction of SPRING 2016 43 Articles Reality versus Human Cognition Human Cognitive Representation 1 Human Cognitive Representation 2 Reality Figure 2. The Same Reality Can Be Expressed Differently, or the Same Linguistic Expressions May Represent Different Realities. Frequency of reports Majority reports Minority Reports Average value Value Figure 3. Should Minority Reports Be Discarded? Or Might They Open Up Major Discoveries? 44 AI MAGAZINE Articles large-scale molecular interactions related to cancer (Cohen 2014). With regard to problems of phenotyping inaccuracy, progress in machine learning as exemplified in deep learning may enable us to resolve some cognitive issues. There are particular hopes that computers may learn to acquire proper features for representing complex objects (Bengio 2009; Bengio, Courville, and Vincent 2013; Hinton 2011). Deep phenotyping is an attempt to develop much finer-grained and indepth phenotyping than current practice provides to establish highly accurate diagnosis, patient classification, and precision clinical decisions (Frey, Lenert, and Lopez-Campos 2014; Robinson 2012), and some of pioneering researchers are using deep learning (Che et al. 2015). Combining deep phenotyping and personal genomics as well as other comprehensive measurements leads to dramatically improved accurate diagnosis and effective therapeutic interventions, as well as improving drug discovery efficiency. For generating hypotheses and verifying them, Ross King and his colleagues have developed a systematic robot scientist that can infer possible biological hypotheses and design simple experiments using a defined-protocol automated system to analyze orphan genes in budding yeast (King et al. 2009a, 2009b; King et al. 2004). While this brought only a moderate level of discovery within the defined context of budding yeast genes, the study represented an integration of bioinformatics-driven hypothesis generation and automated experimental processes. Such an automatic experimental system has great potential for expansion and could become a driving force for research in the future. Most experimental devices these days are highly automated and connected to networks. In the near future, it is likely that many will be supplemented by high-precision robotics systems, enabling AI systems not only to access digital information but also to design and execute experiments. That would mean that every detail of experimental results, including incomplete or erroneous data, could be stored and made accessible. Such progress would have a dramatic impact on the issues of long-tail distribution and dark data in science (Heidorn 2008). Crowdsourcing of science, or citizen science, offers many interesting opportunities, and great potential for integration with AI systems. The protein-folding game FoldIt, released in 2008, demonstrated that with proper redefinition of a scientific problem, ordinary citizens can contribute to the process of scientific discovery (Khatib et al. 2011). Patient-powered research network Patientslikeme is another example of how motivated ordinary people can contribute to science (Wicks et al. 2015, Wicks et al. 2011). While successful deployment of community-based science requires carefully designed missions, clear definition of problems, and the implementation of appropriate user interfaces (Kitano, Ghosh, and Matsuoka 2011), crowdsourcing may offer an interesting opportunity for AI-based scientific discovery. This is because, with proper redefinition of a problem, a system may also help to facilitate the best use of human intelligence. There are efforts to develop platforms that can connect a broad range of software systems, devices, databases, and other necessary resources. The Garuda platform is an effort to develop an open application programming interface (API) platform aimed at attaining a high-level of interoperability among biomedical and bioinformatics analysis tools, databases, devices, and others (Ghosh et al. 2011). The Pegasus and Wings system is another example that focuses on sharing the workflow of scientific activities (Gil et al. 2007). A large-scale collection of workflow from the scientific community that may direct possible sequences of analysis and experiments used and reformulated by AI systems would be a powerful knowledge asset. With globally interconnected highperformance computing systems such as InfiniCortex Michalewicz, et al. 2015), we are now getting ready to undertake this new and formidable challenge. Such research could form the partial basis of this challenge. At the same time, we still require a clear game plan, or at the very least an initial hypothesis. Scientific Discovery as a Search Problem: Deep Exploration of Knowledge Space What is the essence of discovery? To rephrase the question, what could be the engine for scientific discovery? Consistent and broad-ranging knowledge is essential, but does not automatically lead to new discoveries. When I talk about this initiative, many scientists ask whether AI can be equipped with the necessary intuition for discovery. In other words, can AI systems be designed to ask the “right” questions that may lead to major scientific discoveries? While this certainly appears to be a valid question, let us think more deeply here. Why is asking the right question important? It may be due to resource constraints (such as the time for which researchers can remain active in their professional careers), budget, competition, and other limitations. Efficiency is, therefore, the critical factor to the success of this challenge. When time and resources are abundant, the importance of asking the right questions is reduced. One might arrive at important findings after detours, so the route is not of particular significance. At the same time, science has long relied to a certain extent on serendipity, where researchers made a major discovery by accident. Thinking about such observations, it is possible to arrive at a hypothesis that infers that the critical aspect of scientific discovery is how many hypotheses can be generated and tested, including examples that may seem highly unlikely. This indicates the potential to scientific discovery SPRING 2016 45 Articles Entire Hypothetical Body of Scientific Knowledge Some hypotheses require experimental verification Hypotheses Generated Experiments tabas to da Experiments may include errors and noise added Hypothesis generation from knowledge Are newly verified hypotheses consistent with current knowledge, or do they generate inconsistencies? e Verified Data False Scientific Knowledge in AI System Integrating Know ledge Extra ction Portions of knowledge believed to be correct may in fact be false Dark Data Papers and Databases Papers and databases contains errors, inconsistencies, and even fabrications Figure 4. Bootstrapping of Scientific Discovery and Knowledge Accumulation. Correct and incorrect knowledge, data, and experimental results are involved throughout this process, though some may be ambiguous. Scientific discovery requires an iterative cycle aimed at expanding our knowledge on this fragile ground. The aim is to compute, verify, and integrate every possible hypothesis, thereby building a consistent body of knowledge. of a brute-force approach in which AI systems generate and verify as many hypotheses as possible. Such an approach may differ from the way in which scientists traditionally conduct their research, but could become a computational alternative to the provision of scientific insights. It should be stressed that while the goal of the grand challenge is to make major scientific discoveries, this does not necessarily mean those discoveries should be made as if by human scientists. The brute-force approach empowered by machine learning and heterogeneous inference has already provided the basis of success for a number of grand challenges to date. As long as a hypothesis can be verified, scientific discovery can also incorporate computing to search for probable correct hypotheses from among the full range of possible ones. The fundamental thrust should be toward massive combinator- 46 AI MAGAZINE ial hypothesis generation, the maintenance of a consistent repository of global knowledge, and perhaps a number of other fundamental principles that we may not be aware of at present. Thus, by using computing to generate and verify as quickly as possible the full range of logically possible hypotheses, it would mitigate resource constraint issues and enable us to examine even unexpected or seemingly far-fetched ideas. Such an approach would significantly reduce the need to ask the right questions, thereby rendering scientific intuition obsolete, and perhaps even enabling us to explore computational serendipity. The engine of discovery should be a closed-loop system of hypothesis generation and verification, knowledge maintenance, knowledge integration, and so on (figure 4) and should integrate a range of technologies (figure 5). Fundamentally speaking, hypotheses, along with constraints imposed on Articles Massive Hypothesis Generation and Verification Distributed and Heterogeneous Computing Massively Parallel Heterogeneous Processing Real-Time Stochastic Reasoning “Twilight-Zone Reasoning” Computing Chess Memory Limitless Information Access over Internet CyberPhysical and Croud Integration and Active Data Acquisition Jeopardy! Scientific Discovery Learning Multistrategy Adaptive Learning Heterogeneous Cross-Domain Learning Figure 5. Evolution of Key Elements in Grand Challenges and Possible Elements of the Scientific Discovery Grand Challenge. Computing, memory, and learning have long been key elements in computer chess. Further techniques have originated from the application of computers to the quiz show Jeopardy! To facilitate scientific discovery, an even more complex and sophisticated range of functions is required. The term twilight-zone reasoning refers to the parsing of data and publications that may be highly ambiguous, error-prone, or faulty. The elements introduced here represent general ideas on how to approach the scientific discovery grand challenge, rather than originating from precise technical analysis of the necessary functionalities. hypothesis generation and the initial validation process, would be derived from the vast body of knowledge to be extracted from publications, databases, and automatically executed experiments. Successfully verified hypotheses would be added to the body of knowledge, enabling the bootstrapping process to continue. It is crucial to recognize that not all papers and data to emerge from the scientific community are correct or reliable; they contain substantial errors, missing information, and even fabrications. It may be extremely difficult to reproduce the published experimental results, and some may prove impossible to re-create (Prinz, Schlange, and Asadullah 2011). At the same time, major progress is continually being made in the field of biomedical science. How can this be possible if such a high proportion of papers present results that are false or not reproducible? While individual reports may con- tain a range of problems, collective knowledge has the potential to uncover truths from even an errorprone scientific process. This is a twilight zone of scientific discovery, and AI systems need to be able to reason in the twilight zone. The proposed challenge would shed light on this conundrum. Advanced Intelligence What is certain is that such a system would substantially reinforce the intellectual capabilities of humans in a manner that is entirely without precedent and that holds the potential to change fundamentally the way science is conducted. The first-ever defeat of a chess grand master by an AI system was followed by the emergence of a new style of chess known as advanced chess, in which human and computer work together as a team, to SPRING 2016 47 Articles take on similarly equipped competitors. This partnership may be considered a form of human-computer symbiosis in intelligent activities. Similarly, we can foresee that in the future sophisticated AI systems and human researchers will work together to make major scientific discoveries. Such an approach can be considered “advanced intelligence.” Advanced intelligence as applied to scientific discovery would go beyond existing combinations of AI and human experts. Just as most competitive biomedical research institutions are now equipped with high-throughput experimental systems, I believe that AI systems will become a fundamental part of the infrastructure for top-level research institutions in the future. This may involve a substantial level of crowd intelligence, utilizing the contributions of both qualified researchers and ordinary people to contribute, each for different tasks, thereby forming a collaborative form of intelligence that could be ably and efficiently orchestrated by AI systems. Drawing this idea out to its extreme, it may be possible to place AI systems at the center of a network of intelligent agents — comprising both other AI systems and humans — to coordinate large-scale intellectual activities. Whether this path would ultimately make our civilization more robust (by facilitating a series of major scientific discoveries) or more fragile (due to extensive and excessive dependence on AI systems) is yet to be seen. However, just as Thomas Newcomen’s atmospheric engine was turned into a modern form of steam engine by James Watt to become the driving force of the industrial revolution, AI scientific discovery systems have the potential to drive a new revolution that leads to new frontiers of civilization. References Ferrucci, D.; Levas, A; Bagchi, S.; Gondek, D.; and Mueller, E. 2013. Watson: Beyond Jeopardy! Artificial Intelligence 199– 200: (June–July): 93–105. dx.doi.org/10.1016/j.artint.2012 .06.009 Frey, L. J.; Lenert, L.; and Lopez-Campos, G. 2014. EHR Big Data Deep Phenotyping. Contribution of the IMIA Genomic Medicine Working Group. Yearbook of Medical Informatics 9: 206–211. dx.doi.org/10.15265/IY-2014-0006 Ghosh, S.; Matsuoka, Y.; Asai, Y.; Hsin, K. Y.; and Kitano, H. 2011. Software for Systems Biology: From Tools to Integrated Platforms. Nature Reviews Genetics 12(12): 821–832. dx.doi.org/10.1038/nrg3096 Gil, Y.; Greaves, M.; Hendler, J.; and Hirsh, H. 2014. Artificial Intelligence. Amplify Scientific Discovery with Artificial Intelligence. Science 346(6206): 171–172. dx.doi.org/10. 1126/science.1259439 Gil, Y., and Hirsh, H. 2012. Discovery Informatics: AI Opportunities in Scientific Discovery. In Discovery Informatics: The Role of AI Research in Innovating Scientific Processes: Papers from the AAAI Fall Symposium, 1–6. Technical Report FS-12-03. Palo Alto, CA: AAAI Press. Gil, Y.; Ratnakar, V.; Deelman, E.; Mehta, G.; and Kim, J. 2007. Wings for Pegasus: Creating Large-Scale Scientific Applications Using Sematic Representations of Computational Workflows. In Proceedings of the 19th Innovative Applications of Artificial Intelligence (IAAI-07). Palo Alto, CA: AAAI Press. Hase, T.; Ghosh, S.; Yamanaka, R.; and Kitano, H. 2013. Harnessing Diversity Towards the Reconstructing of Large Scale Gene Regulatory Networks. PLoS Computational Biology 9(11): e1003361. dx.doi.org/10.1371/journal.pcbi.1003361 Heidorn, P. B. 2008. Shedding Light on the Dark Data in the Long Tail of Science. Library Trends 57(2): 280–299. dx.doi.org/10.1353/lib.0.0036 Bengio, Y. 2009. Learning Deep Architecture for AI. Foundations and Trends in Machine Learning 2(1): 1–127. dx.doi.org/ 10.1561/2200000006 Hinton, G. 2011. A Better Way to Learn Features. Communications of the ACM 54(10). dx.doi.org/10.1145/2001269. 2001294 Bengio, Y.; Courville, A.; and Vincent, P. 2013. Representation Learning: A Review and New Prespectives. IEEE Transactions on Pattern Analysis and Machine Intelligence 35(8): 1798–1828. dx.doi.org/10.1109/TPAMI.2013.50 Hsin, K. Y.; Ghosh, S.; and Kitano, H. 2013. Combining Machine Learning Systems and Multiple Docking Simulation Packages to Improve Docking Prediction Reliability for Network Pharmacology. PLoS One 8(12): e83922. dx.doi.org/10.1371/journal.pone .0083922 Caron, E.; Ghosh, S.; Matsuoka, Y.; Ashton-Beaucage, D.; Therrien, M.; Lemieux, S.; Perreault, C.; Roux, P.; and Kitano, H. 2010. A Comprehensive Map of the mTOR Signaling Network. Molecular Systems Biology 6, 453. dx.doi.org/ 10.1038/msb.2010.108 Che, Z.; Kale, D.; Li, W.; Bahadori, M. T.; and Liu, Y. 2015. Deep Computational Phenotyping. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: Association for Computing Machinery. dx.doi.org/10.1145/2783258.2783365 Cohen, P. 2014. Big Mechanism [Project Announcement]. Arlington, VA: Defense Advanced Research Projects Agency. EURORDIS. 2007. Survey of the Delay in Diagnosis for 8 Rare Diseases in Europe (EurordisCare2). Brussels, Belgium: EURODIS Rare Diseases Europe. Feigenbaum, E., and Feldman, J. 1963. Computers and Thought. New York: McGraw-Hill Book Company. 48 Ferrucci, D.; Brown, E.; Chu-Carroll, J.; Fan, J.; Gondek, D.; Kalyanpur, A; Lally, A.; Murdock, J. W.; Nyberg, E.; Prager, J.; Schlaefer, N.; Welty, C. 2010. Building Watson: An Overview of the DeepQA Project. AI Magazine 31(3): 59–79. AI MAGAZINE Hsu, F.-H. 2004. Behind Deep Blue: Buidling the Computer That Defeated the World Chess Champion. Princeton, NJ: Princeton University Press. Kennedy, J. F. 1961. Special Message to Congress on Urgent National Needs, 25 May 1961. Papers of John F. Kennedy. Presidential Papers. President’s Office Files. JFKPOF-034-030 John F. Kennedy Presidential Library, Boston, MA. Kennedy, J. F. 1962. Address at Rice University on the Nation’s Space Effort 12 September 1962. Accession Number USG:15 reel 29. John F. Kennedy Presidential Library, Boston, MA. Khatib, F.; DiMaio, F.; Foldit Contenders Group; Foldit Void Crushers Group; Cooper, S.; Kazmierczyk, M.; Gilski, M.; Krzywda, S.; Zabranska, H.; Pichova, I.; Thompson, J.; Popovi, Z.; Jaskolski, M.; Baker, D. 2011. Crystal Structure of a Monomeric Retroviral Protease Solved by Protein Folding Game Players. Natural Structural and Molecular Biology Articles 18(10): 1175–1177. dx.doi.org/10.1038/ nsmb.2119 King, R. D.; Rowland, J.; Oliver, S. G.; Young, M.; Aubrey, W.; Byrne, E.; Liakata, M.; Markham, M.; Pir, P.; Soldatova, L. N.; Sparkes, A.; Whelan, K. E.; Clare, A. 2009a. The Automation of Science. Science 324(5923): 85–89. dx.doi.org/10.1126/science.1165620 King, R. D.; Rowland, J.; Oliver, S. G.; Young, M.; Aubrey, W.; Byrne, E.; Liakata, M.; Markham, M.; Pir, P.; Soldatova, L. N.; Sparkes, A.; Whelan, K. E.; Clare, A. 2009b. Make Way for Robot Scientists. Science 325(5943), 945. dx.doi. org/10.1126/science. 325_945a King, R. D.; Whelan, K. E.; Jones, F. M.; Reiser, P. G.; Bryant, C. H.; Muggleton, S. H.; Kell, D. B.; Oliver, S. G. 2004. Functional Genomic Hypothesis Generation and Experimentation by a Robot Scientist. Nature 427(6971): 247–252. dx.doi.org/10. 1038/nature02236 Kitano, H. 1993. Challenges of Massive Parallelism. In Proceedings of the 13th International Joint Conference on Artificial Intelligence, 813–834. San Mateo, CA: Morgan Kaufmann Publishers. Kitano, H. 2002a. Computational Systems Biology. Nature 420(6912): 206–210. dx.doi. org/10.1038/nature01254 Kitano, H. 2002b. Systems Biology: A Brief Overview. Science 295(5560): 1662–1664. dx.doi.org/10.1126/science. 1069492 Kitano, H.; Asada, M.; Kuniyoshi, Y.; Noda, I.; Osawa, E.; and Matsubara, H. 1997. RoboCup: A Challenge Problem for AI. AI Magazine 18(1): 73–85. dx.doi.org/10.1145/267658. 267738 Kitano, H.; Ghosh, S.; and Matsuoka, Y. 2011. Social Engineering for Virtual ‘Big Science’ in Systems Biology. Nature Chemical Biology 7(6): 323–326. dx.doi.org/10.1038/nchembio.574 Korzybski, A. 1933. Science and Sanity: An Introduction to NonAristotelian Systems and General Semantics. Chicago: Institute of General Semantics. Langley, P., and Simon, H. 1987. Scientific Discovery: Computational Exploration of the Creative Processes. Cambridge, MA: The MIT Press. Lee, K. F. 1988. Automatic Speech Recognition: The Development of the SPHINX System. New York: Springer. Lenat, D., and Brown, J. 1984. Why AM and EURISKO Appear to Work. Artificial Intelligence 23(3): 269–294. dx.doi.org/10.1016/0004-3702(84)90016-X Li, C.; Liakata, M.; and Rebholz-Schuhmann, D. 2014. Biological Network Extraction from Scientific Literature: State of the Art and Challenges. Brief Bioinform 15(5): 856–877. dx.doi.org/10.1093/bib/bbt006 Lindsay, R.; Buchanan, B.; Feigenbaum, E.; and Lederberg, J. 1993. DENDRAL: A Case Study of the First Expert System for Scientific Hypothesis Formation. Artificial Intelligence 61(2): 209–261. dx.doi.org/10.1016/ 0004-3702(93)90068-M Matsuoka, Y.; Matsumae, H.; Katoh, M.; Eisfeld, A. J.; Neumann, G.; Hase, T.; Ghosh, S.; Shoemaker, J. E.; Lopes, T.; Watanabe, T.; Watanabe, S.; Fukuyama, S.; Kitano, H.; Kawaoka, Y. 2013. A Comprehensive Map of the Influenza: A Virus Replication Cycle. BMC Systems Biology 7: 97(2 October). dx.doi.org/10.1186/1752-0509-7-97 Michalewicz, M.; Poppe, Y.; Wee, T.; and Deng, Y. 2015. InfiniCortex: A Path To Reach Exascale Concurrent Supercomputing Across the Globe Utilising Trans-Continental Infiniband and Galaxy Of Supercomputers. Position Paper Presented at the Third Big Data and Extreme-Scale Computing Workshop (BDEC), Barcelona, Spain, 29–30 January. Oda, K., and Kitano, H. 2006. A Comprehensive Map of the Toll-Like Receptor Signaling Network. Molecular Systems Biology 2: 2006 0015. dx.doi.org/10.1038/msb4100057 Oda, K.; Matsuoka, Y.; Funahashi, A.,; and Kitano, H. 2005. A Comprehensive Pathway Map of Epidermal Growth Factor Receptor Signaling. Molecular Systems Biology 1 2005 0010. dx.doi.org/10.1038/msb4100014 Prinz, F.; Schlange, T.; and Asadullah, K. 2011. Believe It or Not: How Much Can We Rely on Published Data on Potential Drug Targets? Nature Reviews Drug Discovery 10(9): 712. dx.doi.org/10.1038/nrd3439-c1 Robinson, P. N. 2012. Deep Phenotyping for Precision Medicine. Human Mutation 33(5), 777–780. dx.doi.org/10.1002/ humu.22080 Shimada, Y.; Gulli, M. P.; and Peter, M. (2000). Nuclear Sequestration of the Exchange Factor Cdc24 by Far1 Regulates Cell Polarity During Yeast Mating. Nature Cell Biology 2(2): 117–124. dx.doi.org/10.1038/35000073 Silver, D.; Huang, A.; Maddison, C.; Guez, A.; Sifre, L.; den Driessche, G.; Schrittwieser, J.; Antonoglou, I.; Panneershelvam, V.; Lanctot, M.; Dieleman, S.; Grewe, D.; Nham, J.; Kalchbrenner, N.; Sutskever, I.; Lillicrap, T.; Leach, M.; Kavukcuoglu, K.; Graepel, T.; Hassabis, D. 2016. Mastering the Game of Go with Deep Neural Networks and Tree Search. Nature 529(7587): 484-489. dx.doi.org/10.1038/ nature16961 Spangler, S.; Wilkins, A.; Bachman, B.; Nagarajan, M.; Dayaram, T.; Haas, P.; Regenbogen, S.; Pickering, C. R.; Corner, A.; Myers, J. N.; Stanoi, I.; Kato, L.; Lelescu, A.; Labire, J. J.; Parikh, N.; Lisewski, A. M.; Donehower, L.; Chen, Y.; Lichtarge, O. 2014. Automated Hypothesis Generation Based on Mining Scientific Literature. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: Association for Computing Machinery. dx.doi.org/10.1145/2623330.2623667 Turing, A. M. 1950. Computing Machinery and Intelligence. Mind 59(236): 433–460. dx.doi.org/10.1093/mind/LIX.236. 433 Wicks, P.; Lowe, M.; Gabriel, S.; Sikirica, S.; Sasane, R.; and Arcona, S. 2015). Increasing Patient Participation in Drug Development. Nature Biotechnology 33(2): 134–135. dx.doi. org/10.1038/nbt.3145 Wicks, P.; Vaughan, T. E.; Massagli, M. P.; and Heywood, J. 2011. Accelerated Clinical Discovery Using Self-Reported Patient Data Collected Online and a Patient-Matching Algorithm. Nature Biotechnology 29(5): 411–414. dx.doi.org/10. 1038/nbt.1837 Hiroaki Kitano is director, Sony Computer Science Laboratories, Inc., president of the Systems Biology Insitute, a professor at Okinawa Institute of Science and Technology, and a group director for Laboratory of Disease Systems Modeling at Integrative Medical Sciences Center at RIKEN. Kitano is a founder of RoboCup, received the Computers and Thoughts Award in 1993, and Nature Award for Creative Mentoring in Science in 2009. His current research focuses on systems biology, artifiical intelligence for biomedical scientific discovery, and their applications. SPRING 2016 49 Articles Planning, Executing, and Evaluating the Winograd Schema Challenge Leora Morgenstern, Ernest Davis, Charles L. Ortiz, Jr. I The Winograd Schema Challenge (WSC) was proposed by Hector Levesque in 2011 as an alternative to the Turing test. Chief among its features is a simple question format that can span many commonsense knowledge domains. Questions are chosen so that they do not require specialized knoweldge or training and are easy for humans to answer. This article details our plans to run the WSC and evaluate results. 50 AI MAGAZINE T he Winograd Schema Challenge (WSC) (Levesque, Davis, and Morgenstern, 2012) was proposed by Hector Levesque in 2011 as an alternative to the Turing test. Turing (1950) had first introduced the notion of testing a computer system’s intelligence by assessing whether it could fool a human judge into thinking that it was conversing with a human rather a computer. Although intuitively appealing and arbitrarily flexible — in theory, a human can ask the computer system that is being tested wide-ranging questions about any subject desired — in practice, the execution of the Turing test turns out to be highly susceptible to systems that few people would wish to call intelligent. The Loebner Prize Competition (Christian 2011) is in particular associated with the development of chatterbots that are best viewed as successors to ELIZA (Weizenbaum 1966), the program that fooled people into thinking that they were talking to a human psychotherapist by cleverly turning a person’s statements into questions of the sort a therapist would ask. The knowledge and inference that characterize conversations of substance — for example, discussing alternate metaphors in sonnets of Shakespeare — and which Turing presented as examples of the sorts of conversation that an intelligent system should be able to produce, are absent in these chatterbots. The focus is merely on engaging in surfacelevel conversation that can fool some humans who do not delve too deeply into a conversation, for at least a few minutes, into thinking that they are speaking to another person. The widely reported triumph of the chatterbot Eugene Goostman in fooling 10 out of 30 judges to judge, after a fiveminute conversation, that it was human (University of Read- Copyright © 2016, Association for the Advancement of Artificial Intelligence. All rights reserved. ISSN 0738-4602 Articles ing 2014), was due precisely to the system’s facility for this kind of shallow conversation. Winograd Schemas In contrast to the Loebner Prize Competition, the Winograd Schema Challenge is designed to test a system’s ability to understand natural language and use commonsense knowledge. Winograd schemas (WSs) are best understood by first considering Winograd schema halves, which are sentences with at least one pronoun and two possible referents for that pronoun, along with a question that asks which of the two referents is correct. An example1 is the following: The customer walked into the bank and stabbed one of the tellers. He was immediately taken to the emergency room. Who was taken to the emergency room? The customer / the teller The correct answer is the teller. We know this because of all the commonsense knowledge that we have about stabbings, injuries, and how they are treated. We know that if someone is stabbed, he is very likely to be seriously wounded, and that if someone is seriously wounded, he needs medical attention. We know, furthermore, that people with acute and serious injuries are frequently treated at emergency rooms. Moreover, there is no indication in the text that the customer has been injured, and therefore no apparent reason for him to be taken to the emergency room. We reason with much of this information when we determine that the referent of who in the second sentence in the example is the teller rather than the customer. So far, we are just describing the problem of pronoun disambiguation. Winograd schemas, however, have a twist: they are constructed so that there is a special word (or short phrase) that can be substituted for one of the words (or short set of words) in the sentence, causing the other candidate pronoun referent to be correct. For example, consider the above sentence with the words police station substituted for emergency room: The customer walked into the bank and stabbed one of the tellers. He was immediately taken to the police station. ensure that test designers do not inadvertently construct a set of problems in which ordering of words or sentence structure can be used by test takers to help the disambiguation process. For example, if a sentence with subject and object is followed by a phrase or sentence that starts with a pronoun, the subject is more likely to be the referent of the pronoun than the object. The test taker, however, who is given a Winograd schema half, knows not to rely on this heuristic because the existence of the special word or set of words negates that heuristic. For instance, in the example, who refers to the subject when the special set of words is police station but the object when the special set of words is emergency room. There are three additional restrictions that we place on Winograd schemas: First, humans should be able to disambiguate these questions easily. We are testing whether systems are as intelligent as humans, not more intelligent. Second, they should not obey selectional restrictions. For example, the following would be an invalid example of a Winograd schema: The women stopped taking the pills, because they were carcinogenic / pregnant. What were carcinogenic / pregnant? The women / the pills This example is invalid because one merely needs to know that women, but not pills, can be pregnant, and that pills, but not women, can be carcinogenic, in order to solve this pronoun disambiguation problem. While this fact can also be viewed as a type of commonsense knowledge, it is generally shallower than the sort of commonsense knowledge exemplified by the emergency room / police station example above, in which one needs to reason about several commonsense facts together. The latter is the sort of deeper commonsense knowledge that we believe is characteristic of human intelligence and that we would like the Winograd Schema Challenge to test. Third, they should be search-engine proof to the extent possible. Winograd schemas should be constructed so that it is unlikely that one could use statistical properties of corpora to solve these problems. Who was taken to the police station? The customer / the teller Executing and Evaluating the Winograd Schema Challenge The correct answer now is the customer. To get the right answer, we use our knowledge of what frequently happens in crime scenarios — that the alleged perpetrator is arrested and taken to the police station for questioning and booking — together with our knowledge that stabbing someone is generally considered a crime. Since the text tells us that the customer did the stabbing, we conclude that it must be the customer, rather than the teller, who is taken to the police station. The existence of the special word is one way to When the Winograd Schema Challenge was originally conceived and developed, details of the execution of the challenge were left unspecified. In May 2013, the participants at Commonsense-2013, the Eleventh Symposium on Logical Formalizations of Commonsense Reasoning, agreed that focusing on the Winograd Schema Challenge was a high priority for researchers in commonsense reasoning. In July 2014, Nuance Communications announced its sponsorship of the Winograd Schema Challenge Competition (WSCC), with cash prizes awarded for top computer SPRING 2016 51 Articles systems surpassing some threshold of performance on disambiguating pronouns in Winograd schemas. At the time this article was written, the first competition was scheduled to be held at IJCAI-2016 in July, 2016 in New York, New York, assuming there are systems that are entered into competition. Because doing well at the WSC is difficult, it is possible no systems will be entered at that time; in this case, the first competition will be delayed until we have received notification of interested entrants. Subsequent competitions will be held annually, biennially, or at some other set interval of time to be determined. During the last year, we have developed a set of rules for the competition that are intended to facilitate test corpus development and participation of serious entrants. While some parts will naturally change from one competition to the next — date and time, obviously, as well as hardware limitations — we expect the overall structure of the competition to remain the same. Exact details are given at the Winograd Schema Challenge Competition website;2 the general structure and requirements are discussed next. The competition will consist of a maximum of two rounds: a qualifying round and a final round. There will be at least 60 questions in each round. Each set of questions will have been tested on at least three human adult annotators. At least 90 percent of the questions in the test set will have been answered correctly by all human annotators. The remaining questions in the test set (no more than 10 percent of the test set) will have been answered correctly by at least half of the human annotators. This will ensure that the questions in the test set are those for which pronoun disambiguation is easy. It is possible that no system will progress beyond the first level, in which case the second round will not be held. The threshold required to move from the first to the second level, or to achieve a prize, must be at least 90 percent or no more than three percentage points below the interannotator agreement achieved on the test set, whichever is greater. (For example, if interannotator agreement on a test is 52 AI MAGAZINE 95 percent, the required system score is 92 percent.) Prounoun Disambiguation Problems in the Winograd Schema Challenge The first round will consist of pronoun disambiguation problems (PDPs) that are taken directly or modified from examples found in literature, biographies, autobiographies, essays, news analyses, and news stories; or have been constructed by the organizers of the competition. The second round will consist of halves of Winograd schemas; almost all of these will have been constructed by the competition organizers. Some examples of the sort of pronoun disambiguation problems that could appear in the first round follow: Example PDP 1 Mrs. March gave the mother tea and gruel, while she dressed the little baby as tenderly as if it had been her own. She dressed: Mrs. March / the mother As if it had been: tea / gruel / baby Example PDP 2 Tom handed over the blueprints he had grabbed and, while his companion spread them out on his knee, walked toward the yard. His knee: Tom/ companion Example PDP 3 One chilly May evening the English tutor invited Marjorie and myself into her room. Her room: the English tutor / Marjorie Example PDP 4 Mariano fell with a crash and lay stunned on the ground. Castello instantly kneeled by his side and raised his head. His head: Mariano / Castello The following can be noted from these examples: (1) A PDP can be taken directly from text (example PDP 3 is taken from Vera Brittain’s autobiography Testament of Youth) or may be modified (examples PDP 1, 2, and 4 are modified slightly from the novels Little Women, Tom Swift and His Airship, and The Pirate City: An Algerine Tale). (2) A pronoun disambiguation problem may consist of more than one sentence, as in example PDP 4. In practice, we will rarely use PDPs that contain more than three sentences. (3) There may be multiple pronouns and therefore multiple ambiguities in a sentence, as in example PDP 1. In practice, we will have only a limited number of cases of multiple PDPs based on a single sentence or set of sentences, since misinterpreting a single text could significantly lower one’s score if it is the basis for multiple PDPs. As in Winograd schemas, a substantial amount of commonsense knowledge appears to be needed to disambiguate pronouns. For example, one way to reason that she in she dressed (example PDP 1) refers to Mrs. March and not the mother, is to realize that the phrase “as if it were her own” implies that it (the baby) is not actually her own; that is, she is not the mother and must, by process of elimination, be Mrs. March. Similarly one way to understand that the English tutor is the correct referent of her in example PDP 3 is through one’s knowledge of the way invitations work: X typically invites Y into X’s domain, and not into Z’s domain. Especially, X does not invite Y into Y’s domain. Similar knowledge of etiquette comes into play in example PDP 2: one way to understand that the referent of his is Tom is through the knowledge that X typically spreads documents out over X’s own person, and not Y’s person. (Other knowledge that comes into play is the fact that a person doesn’t have a lap while he is walking, and the structure of the sentence entails that Tom is the individual who walks to the yard.) Why Have PDPs in the WSC Competition? From the point of view of the computer system taking the test, there is no difference between Winograd schemas and pronoun disambiguation problems.3 In either case, the system must choose between two (or more) possible referents for a pronoun. Nevertheless, the move from a competition that is run solely on Winograd schemas to a competition that in its first round runs solely on pronoun disambiguation problems requires some explanation. The primary reason for having PDPs is entirely pragmatic. As originally conceived, the Winograd Schema Challenge was meant to be a one-time chal- Articles lenge. An example corpus of more than 100 Winograd schemas was developed and published on the web.1 Davis developed an additional 100 Winograd schemas to be used in the course of that one-time challenge. Since Nuance’s decision to sponsor the Winograd Schema Challenge Competition, however, the competition is likely to be run at regular intervals, perhaps yearly. Creating Winograd schemas is difficult, requiring creativity and inspiration, and too burdensome to do on a yearly or biennial basis. By running the first round on PDPs, the likelihood of advancing to the second round without being able to answer correctly many of the Winograd schemas in the competition is minimized. Indeed, if a system can advance to the second round, we believe there is a good chance that it will successfully meet the Winograd Schema Challenge. Once we had decided on using PDPs in the initial round, other advantages became apparent: First, pronoun disambiguation problems occur very frequently in natural language text in the wild. One finds examples in many genres, including fiction, science fiction, biographies, and essays. In contrast Winograd schemas are fabricated natural language text and might be considered irrelevant to automated natural language processing in the real world. It is desirable to show that systems are proficient at handling the general pronoun disambiguation problem, which is a superset of the Winograd Schema Challenge. This points toward a realworld task that a system excelling in this competition should be able to do. Second, a set of PDPs taken from the wild, and from many genres of writing, may touch on different aspects of commonsense knowledge than that which a single person or small group of people could come up with when creating Winograd schemas. At the same time it is important to keep in mind one of the original purposes of Winograd schemas — that the correct answer be dependent on commonsense knowledge rather than sentence structure and word order — and to choose carefully a set of PDPs that retain this property. In addition, strong preference will be given to PDPs that do not rely on selectional restriction or on syntactical characteristics of corpora, and which are of roughly the same complexity as Winograd schemas. Transparency The aim of this competition is to advance science; all results obtained must be reproducible, and communicable to the public. As such, any winning entry is encouraged to furnish to the organizers of the Winograd Schema Challenge Competition its source code and executable code, and to use open source databases or knowledge bases or make its databases and knowledge structures available for independent verification of results. If an organization cannot do this, other methods for assuring reproducibility of results will be considered, such as furnishing a detailed trace of execution. Details of such methods will be published on the Winograd Schema Challenge Competition website. Entries that do not satisfy these requirements, even if excelling at the competition, will be disqualified. An individual representing an organization’s entry must be present at the competition, and must bring a laptop on which the entry will run. The specifications of the laptop to be used are given at the Winograd Schema Challenge Competition website. It is assumed that the laptop will have a hard drive no larger than one terabyte, but researchers may negotiate this point and other details of laptop specifications with organizers. Reasonable requests will be considered. Some entries will need to use the Internet during the running of the test. This will be allowed but restricted. The room in which the competition will take place will have neither wireless nor cellular access to the Internet. Internet access will be provided through a highspeed wired cable modem or fiber optic service. Access to a highly restricted set of sites will be provided. Access to the Google search engine will be allowed. All access to the Internet will be monitored and recorded. If any entry that is eligible for a prize has accessed the Internet during the competition, it will be necessary to ver- ify that the system can achieve similar results at another undisclosed time. The laptop on which the potentially prize-winning system has run must be given to the WSCC organizers. They will then run the system on the test at some undisclosed time during a twoweek period following the competition. Following the system run, organizers will compare the results obtained with the results achieved during the competition, and check that they are reasonably close. Assuming that the code contains statistical algorithms, the answers may not be identical because what is retrieved through Internet query will not be exactly the same; however, the differences should be relatively small. In the three weeks following the competition, researchers with winning or potentially winning entries will be expected to submit to WSCC organizers a paper explaining the algorithms, knowledge sources, and knowledge structures used. These papers will be posted on the commonsensereasoning.org website. Publication on the commonsensereasoning.org website does not preclude any other publication. Entries not submitting such a paper will be disqualified. Provisional results will be announced the day after the competition. Three weeks after the competition, final results will be announced. AI Community’s Potential Gain Publishing papers on approaches to solving the Winograd Schema Challenge is required for those eligible for a prize and highly encouraged for everyone else. All papers submitted will be posted on the Winograd Schema Challenge Competition website; it is hoped that in addition they will be submitted and published in other venues. A central aim of the Winograd Schema Challenge is that it ought to serve as motivation for research in commonsense reasoning, and we are eager to see the many directions that this research will take. WSSC organizers will try to use the data obtained from running the competition to assess progress in automating commonsense reasoning by calcu- SPRING 2016 53 Articles lating the proportion of correct results in various subfields of commonsense reasoning. The existing example corpus and test corpus of Winograd schemas have been developed with the goal of automating commonsense reasoning, and span many areas of common sense, including physical, spatial, and social reasoning, as well as commonsense knowledge about many common domains such as transportation, criminal acts, medical treatment, and household furnishings. PDPs will be chosen with this goal and with these areas of commonsense in mind as well. Current plans are to annotate example PDPs and WSs with some of the commonsense areas that might prove useful in disambiguating the text. The WSCC organizers will choose an annotation scheme that is (partly) based on an existing taxonomy, such as that given by OpenCyc4 or DBPedia.5 Note that a PDP or WS might be annotated with several different commonsense domains. An entire test corpus, annotated in this way, may prove useful in assessing a system’s proficiency in specific domains of commonsense reasoning. For example, a system might correctly answer 65 percent of all PDPs and WSs that involve spatial reasoning; but correctly answer only 15 percent of all PDPs and WSs involving social reasoning. Assuming the sentences are of roughly the same complexity, this could indicate that the system is more proficient at spatial reasoning than at social reasoning. The systems that excel in answering PDPs and WSs correctly should be capable of markedly improved natural language processing compared to current systems. For example, in translating from English to French, Google Translates often translates pronouns incorrectly, using incorrect gender, presumably because it cannot properly determine pronoun references; the technology underlying a system that wins the WSCC could improve Google Translate’s performance in this regard. More broadly, a system that contains the commonsense knowledge that facilitates correctly answering the many PDPs and WSs in competition should be capable of supporting a wide range of commonsense reasoning that 54 AI MAGAZINE would prove useful in many AI applications, including planning, diagnostics, story understanding, and narrative generation. The sooner a system wins the Winograd Schema Challenge Competition, the sooner we will be able to leverage the commonsense reasoning that such a system would support. Even before the competition is won, however, we look forward to AI research benefiting from the commonsense knowledge and reasoning abilities that researchers build into the systems that will participate in the challenge. Acknowledgements This article grew out of an invited talk by the first author at the Beyond Turing Workshop organized by Gary Marcus, Francesca Rossi, and Manuela Veloso at AAAI-2016; the ideas were further developed through conversations and email with the second and third authors after the conclusion of the workshop, and during a very productive panel session on the WSC at Commonsense-2015, held as part of the AAAI Spring Symposium Series. Thanks especially to Andrew Gordon, Jerry Hobbs, Ron Keesing, Pat Langley, Gary Marcus, and Bob Sloane for helpful discussions. Notes 1. See E. Davis’s web page, A Collection of Winograd Schemas, 2012: www.cs.nyu. edu/davise/papers/WS.html 2. www.commonsensereasoning.org/winograd. 3. Except that possibly there may be more than two choices in a PDP, which is disallowed in WSs by construction. So if a system notices three or more possibilities for an answer, it could know that it is dealing with a PDP. But it is a distinction without a difference; this knowledge does not seem to lead to any new approach for solution. 4. www.opencyc.org 5. wiki.dbpedia.org References Christian, B. 2011. Mind Versus Machine. The Atlantic, March. Levesque, H.; Davis, E.; and Morgenstern, L. 2012 The Winograd Schema Challenge. In Principles of Knowledge Representation and Reasoning: Proceedings of the Thirteenth International Conference (KR2012), 552–561. Palo Alto, CA: AAAI Press. Turing, A. 1950. Computing Machinery and Intelligence. Mind 59(236): 433–460. dx.doi.org/10.1093/mind/LIX.236.433 University of Reading. 2015. Turing Test Success Marks Milestone in Computing History. Press Release, June 8, 2014. Communications Office, University of Reading, Reading, UK (www.reading.ac.uk/news-andevents/releases/PR583836.aspx). Weizenbaum, J. 1966. ELIZA — A Computer Program for the Study of Natural Language Communication Between Man and Machine. Communications of the ACM 9(1): 36–45. dx.doi.org/10.1145/365153.365168 Leora Morgenstern is a technical fellow and senior scientist at Leidos Corporation. Her research focuses on developing innovative techniques in knowledge representation and reasoning, targeted toward deep understanding of large corpora in a wide variety of domains, including legal texts, biomedical research, and social media. She heads the Executive Committee of commonsensereasoning.org, which has run the biennial Commonsense Symposium series since 1991. She received a BA in mathematics from the City College of New York and a Ph.D. in computer science from Courant Institute of Mathematical Sciences, New York Universtiy. Ernest Davis is a professor of computer science at New York University. His research area is automated commonsense reasoning, particularly commonsense spatial and physical reasoning. He is the author of Representing and Acquiring Geographic Knowledge (1986), Representations of Commonsense Knowledge (1990), and Linear Algebra and Probability for Computer Science Applications (2012); and coeditor of Mathematics, Substance and Surmise: Views on the Meaning and Ontology of Mathematics (2015). Charles Ortiz is the director of the Nuance Natural Language and AI Laboratory. His research is in collaborative multiagent systems, knowledge representation and reasoning (causation, counterfactuals, and commonsense reasoning), and robotics (cognitive and team-based robotics). His previous positions include director of research in collaborative multiagent systems at the AI Center at SRI International, adjunct professor at the University of California, Berkeley, and postdoctoral research fellow at Harvard University. He received an S.B. in physics from the Massachusetts Institute of Technology and a Ph.D. in computer and information science from the University of Pennsylvania. Articles Why We Need a Physically Embodied Turing Test and What It Might Look Like Charles L. Ortiz, Jr. I The Turing test, as originally conceived, focused on language and reasoning; problems of perception and action were conspicuously absent. To serve as a benchmark for motivating and monitoring progress in AI research, this article proposes an extension to that original proposal that incorporates all four of these aspects of intelligence. Some initial suggestions are made regarding how best to structure such a test and how to measure progress. The proposed test also provides an opportunity to bring these four important areas of AI research back into sync after each has regrettably diverged into a fairly independent area of research of its own. F or Alan Turing, the problem of creating an intelligent machine was to be reduced to the problem of creating a thinking machine (Turing 1950). He observed, however, that such a goal was somewhat ill-defined: how was one to conclude whether or not a machine was thinking (like a human)? So Turing replaced the question with an operational notion of what it meant to think through his now famous Turing test. The details are well known to all of us in AI. One feature of the test worth emphasizing, however, is its direct focus on language and its use: in its most well known form, the human interrogator can communicate but not see the computer and the human subject participating in the test. Hence, in a sense, it has always been tacitly assumed that physical embodiment plays no role in the Turing test. Hence, if the Turing test is to represent the de facto test for intelligence, having a body is not a prerequisite for demonstrating intelligent behavior.1 The general acceptance of the Turing test as a sensible measure of achievement in the quest to make computers intelligent has naturally led to an emphasis on equating intelligence with cogitation and communication. But, of course, in AI this has only been part of the story: disembodied thought alone will not get one very far in the world. The enterprise to achieve AI has always equally concerned itself with the problems of perception and action. In the physical world, this means that an agent needs to be able to perform physical actions and understand the physical actions of others. Also of concern for the field of AI is the problem of how to Copyright © 2016, Association for the Advancement of Artificial Intelligence. All rights reserved. ISSN 0738-4602 SPRING 2016 55 Articles quantify progress and how to support incremental development; it is, by now, pretty much agreed upon that the Turing test represents a rather weak tool for measuring the level of demonstrable intelligence or thinking associated with a particular subject, be it human or artificial. The passing of the test by a machine would certainly justify one in announcing the arrival of human-level AI, but along the way, it can only provide a rather crude measure. To address this deficiency, variants of the Turing test have been proposed and are being pursued; one notable example is the Winograd Schema Challenge2 that supports incremental testing and development (Levesque, Davis, and Morgenstern 2012). The Winograd Schema Challenge does not, however, address the physical embodiment concerns that are the subject of this article. Nevertheless, any proposed alternative must bring with it a reasonable set of quantifiable measures of performance. So, what is it about the Turing test that makes it unsuitable for gauging progress in intelligent perception and action? 3 From the perspective of action, the Turing test can only be used to judge descriptions of actions that one could argue were sufficiently detailed to be, in principle, executable. Consider some simple everyday ascriptions of action: “Little Johnny tied his shoelace,” or “LeBron James just hit a layup.” If perception is taken completely out of the picture, a purely linguistic description of these types of actions is rather problematic (read: a royal pain): one would have to write down a set of rules or axioms that correctly captured the appropriate class of movement actions and how they were stitched together to produce a particular spatiotemporally bounded highlevel movement, in this case, bona fide instances of shoelace tying or basketball layups. A more sensible alternative might involve learning from many examples, along the lines demonstrated by Li and Li (2010). And for that, you need to be able to perceive. It’s hard for me to describe a shoe-tying to you if you have never seen one or could never see one.4 However, consider now the problem of judging the feasibility of certain actions without perception, such as reported by the statement, “the key will not fit in the lock.” Through a process of spatial reasoning, an agent can determine whether certain objects (such as a ball) might fit into certain other objects (such as a suitcase). However, this sort of commonsense reasoning could only help with our example during initial considerations: perhaps to conclude whether a particular key was a candidate for fitting into a particular lock given that it was of a particular type. After all, old antique keys, car keys, and house keys all look different. However, it would still be quite impossible to answer the question, “Will the key fit?” without being able to physically perceive the key and the keyhole, physically manipulating the key, trying to get it into the hole, and turning the key.5 It’s no surprise, then, that the challenges that these sorts of actions 56 AI MAGAZINE raise have received considerable attention in the robotics literature: Matt Mason at CMU categorizes them as paradigmatic examples of “funneling actions” in which other artifacts in the environment are used to guide an action during execution (Mason 2001). Note that from a purely linguistic standpoint, the details of such action types have never figured into the lexical semantics of the corresponding verb. From a commonsense reasoning perspective in AI, their formalization has not been attempted for the reasons already given.6 These observations raise the question of whether verbal behavior and reasoning are the major indicators of intelligence, as Descartes and others believed. The lessons learned from AI over the last 50 years should suggest that they do not. Equally challenging and important are problems of perception and action. Perhaps these two problems have historically not received as much attention due to a rather firmly held belief that what separates human from beast is reasoning and language: all animals can see and act, after all: one surely should not ascribe intelligence to a person simply because he or she can, for example, open a door successfully. However, any agent that can perform only one action — opening a door — is certainly not a very interesting creature, as neither is one that can utter only one particular sentence. It is, rather, the ability to choose and compose actions for a very broad variety of situations that distinguishes humans. In fact, humans process a rather impressive repertoire of motor skills that distinguish them from lower primates: highly dexterous, enabling actions as diverse as driving, playing the piano, dancing, playing football, and others. And certainly, from the very inception of AI, problems of planning and acting appeared center stage (McCarthy and Hayes 1969). Functional Individuation of Objects The preceding illustrations served to emphasize the difficulty in reasoning and talking about many actions without the ability to perceive them. However, our faculty of visual perception by itself, without the benefit of being able to interact with an object or reason about its behavior, runs up against its own difficulties when it attempts to recognize correctly many classes of objects. For example, recognizing something as simple as a hinge requires not only that one can perceive it as something that resembles those hinges seen in the past, but also that one can interact with it to conclude that it demonstrates the necessary physical behavior: that is, that it consist of two planes that can rotate around a common axis. Finally, one must also be able to reason about the object in situ. The latter requires that one can reason commonsensically to determine whether it is sufficiently rigid, can be attached to two other objects (such as a door and a Articles Figure 1. Collaboratively Setting Up a Tent. A major challenge is to coordinate and describe actions, such as “Hold the pole like this while I attach the rope.” wall), and is also constructed so that it can bear the weight of one or both of those objects. So this very simple example involving the functional individuation of an object requires, by necessity, the integration of perception, action, and commonsense reasoning. The challenge tasks described in the next section nicely highlight the need for such integrated processes. The Challenge This leads finally to the question of what would constitute a reasonable physically embodied Turing test that would satisfy the desiderata so far outlined: physical embodiment coupled with reasoning and communication, support for incremental development, and the existence of clear quantitative measures of progress. In my original description of this particular challenge, I attempted to parallel the original Turing test as much as possible. I imagined a human tester communicating with a partially unseen robot and an unseen human; the human would have access to a physically equivalent but teleoperated pair of robot manipulators. The tester would not be able to see the body of either, only the mechanical arms and video sensors. Significant differences in the appearance of motion between the two could be reduced through stabilizing software to smooth any jerky movements. The interrogator would interact with the human and robot subject through language, as in the Turing test, and would be able to ask questions or make commands that would lead to the appropriate physical actions. The tester would also be able to demonstrate actions. However, some of the participants of the workshop at which this idea was first presented7 observed that particular expertise involving tele-operation might render comparisons difficult. The participants of the workshop agreed that the focus should instead be on defining a set of progressively more challenging problem types. The remainder of this document follows that suggestion. This challenge will consist of two tracks: The construction track and the exploration track. The construction track’s focus will be on building predefined structures (such as a tent or modular furniture) given a combination of verbal instructions and diagrams or pictures. A collaborative subtrack will extend this to multiple individuals, a human agent and a robotic agent. The exploration track will be more improvisational in flavor and focus on experiments in building, modifying, and interacting with complex structures in terms of more abstract mental models, possibly acquired through experimentation itself. These struc- SPRING 2016 57 Articles Figure 2. The IkeaBots Developed at the Massachusetts Institute of Technology Can Collaborate on the Construction of Modular Furniture. tures can be static (for example, as in figure 3) or dynamic (as in figure 6). Communication through natural language will be an integral part of each track. One of the principal goals of this challenge is to demonstrate grounding of language both during execution of a task and after completion. For example, for both the exploration and the construction tracks, the agents must be able to accept initial instructions, describe and explain what they are doing, accept critique or guidance, and consider hypothetical changes. 8 The Construction Track The allowable variability of target structures in the construction track is expected to be less than in the exploration track. The construction task will involve building predefined structures that would be specified through a combination of natural language and pictures. Examples might include an object such as a tent (figure 1) or putting together Ikea-like furniture (figure 2). Often, ancillary information in the form of diagrams or snapshots plays an important role in instructions (see, for example, figure 4). During the task challenge definition phase, the degree to which this complex problem can be limited (or perhaps included as part of another challenge) will be investigated. Crowdsourced sites that contain such 58 AI MAGAZINE instructions might be useful to consult in this respect.9 The collaboration task requires that the artificial and human agents exchange information before and during execution to guide the construction task. A teammate might ask for help through statements such as, “Hold the tent pole like this while I tighten the rope”; the system must reason commonsensically about the consequences of the planned action involving the rope-tightening to the requested task as well as how an utterance such as “Hold . . . like this . . .” should be linguistically interpreted and coordinated with the simultaneous visual interpretation. Rigidity of materials, methods of attachment, and the structural function of elements (that is, that tent poles are meant to hold up the fabric of a tent) will be varied as well as the successful intended functionality of the finished product (for example, a tent should keep water out and also not fall apart when someone enters it). Eventually, time to completion could also be a metric; however, for now, these proposed tasks are of sufficient difficulty that the major concern should simply be success. The description given here of the construction task places emphasis on robotic manipulation; however, there are nonmanipulation robotic tasks that could be incorporated into the challenge that also involve an integration of perception, reasoning, and action. Articles Stage Abilities demonstrated Construction by one agent Basic physical, perceptual, and motor skills Monitoring the activity (perceive progress, identify obstacles), contribute help Reference (“hold like <this>”), offer help, explain, question answering (“why did you let go?”), narrate activity as necessary Collaboration Communication Table 1. Some Possible Levels of Progression for the Construction Track Challenge Tasks. Certain capabilities might best be first tested somewhat independently; for example, perception faculties might be tested for by having the agent watch a human perform the task and being able to narrate what it observes. Examples include finding a set of keys, counting the number of chairs in a room, and delivering a message to some person carrying a suitcase.10 An organization committee that will be selected for this challenge will investigate the proper mix of such tasks into the final challenge roadmap. There are many robotic challenges involving manipulation and perception related to this challenge. However, a number of recent existence proofs provide some confidence that such a challenge can be initiated now. The final decisions on subchallenge definition will be made by the organizing committee. As the complexity of these tasks increases, one can imagine their real-world value in robot-assistance tasks as demanding as, say, repairing roads, housing construction, or setting up camp on Mars. The IkeaBot system (figure 2) developed at MIT is one such existence proof: it demonstrates the collaboration of teams of robots in assembling Ikea furniture during which robots are able to ask automatically for help when needed (Knepper et al. 2013). Other work involving communication and human-robot collaboration coupled with sophisticated laboratory manipulation capabilities has been demonstrated at Carnegie Mellon University, and represents another good starting point (Strabala et al. 2012). Research in computer vision has made impressive progress lately (Li and Li 2010, Le et al. 2012), enabling the learning and recognition of complex movements and feature-rich objects. It is hoped that this challenge would motivate extensions that would factor in functional considerations into any objectrecognition process. Finally, the organization committee hopes to be able to leverage robotic resources under other activities such as the RoboCup Home Challenge,11 as much as possible. The Exploration Track If you’ve ever watched a child play with toys such as Lego blocks, you know that the child does not start with a predefined structure in mind. There is a strong element of improvisation and experimentation during a child’s interactions, exploring possible structures, adapting mental models (such as that of a house or car), experimenting with sequences of attachment, modifying structures, and so on. Toys help a child groom the mind-body connection, serving as a sort of laboratory for exploring commonsense notions of space, objects, and physics. For the exploration track, I therefore propose focusing on the physical manipulation of children’s toys, such as Lego blocks (figure 3). The main difference between the two tracks is that the exploration track supports experimentation involving the modification of component structures, adjusting designs according to resources available (number of blocks, for example), and exploring states of stability during execution. These are all possible because of the simple modular components that agents would work with. The exploration track would also allow for testing the ability of intelligent agents to build a dynamic system and describe its operation in commonsense terms. Incremental progression of difficulty would be possible by choosing tasks to roughly reflect levels of child development. Table 2 summarizes possible levels of progression. The idea is to create scenarios with a pool of physical resources that could support manipulation, commonsense reasoning, and abstraction of structures and objects, vision, and language (for description, explanation, hypothetical reasoning, and narrative). Figure 3 illustrates a static complex structure while the object in figure 6 involves the interaction of many parts. In the latter case, success in the construction of the object also involves observing and demonstrating that the end functionality is the intended one. In the figure, there is a small crank at the bottom left that results in the turning of a long screw, which lifts metal balls up a column into another part of the assembly in which the balls fall down ramps turning various wheels and gates along SPRING 2016 59 Articles Development stage Example 1. Simple manipulation Create a row of blocks; then a wall Connect two walls; then build a “house”; size depends on number of blocks available Add integrated structures such as a parking garage to house 2. Construction and abstraction 3. Modification 4. Narrative generation “This piece is like a hinge that needs to be placed before the wall around it; otherwise it won’t fit later” (said while installing the door of a house structure) 5. Explanation “The tower fell because the base was too narrow” 6. Hypothetical reasoning “What will happen if you remove this?” Table 2. A Sequence of Progressively More Sophisticated Skills to Guide the Definition of Subtask Challenges Within the Exploration Track. the way. A description along the lines of the last sentence is an example of the sort of explanation that a robot should be able to provide, in which the abstract objects are functionally individuated, in the manner described earlier. Figure 5 shows another assembly that demonstrates the creation of new objects (such as balls from clay, a continuous substance), operating a machine that creates small balls, fitting clay into syringes, and making lollipop shapes with swirls made from multiple color clays.12 Tasks involving explaining the operation of such a device, demonstrating its operation, having a particular behavior replicated, and answering questions about processes involved are all beyond the abilities of current AI systems. Manipulation of Lego blocks and other small toy structures would require robotic manipulators capable of rather fine movements. Such technology exists in robotic surgical systems as well as in less costly components under development by a number of organizations. Relation to Research in Commonsense Reasoning Figure 3. An Abstract Structure of a House Built Using Lego Blocks. 60 AI MAGAZINE The more ambitious exploration track emphasizes the development of systems that can experiment on their own, intervening into the physical operation of a system and modifying the elements and connections of the system to observe the consequences and, in the process, augment their own commonsense knowledge. Rather than having a teacher produce many examples, such self-motivated exploring agents would be able to create alternative scenarios and learn from them on their own. Currently this is all done by hand; for example, if one wants to encode the small bit of knowledge that captures the fact that Articles Figure 4. Instructions Often Require Pictures or Diagrams. The step-by-step instructions are for a Lego-like toy. Notice that certain pieces such as the window or wheels are unrecognizable as such unless they are placed in the correct context of the overall structure. not tightening the cap on a soda bottle will cause it to loose its carbonation, one would write down a suitable set of axioms. The problem, of course, is that there is so much of this sort of knowledge. Research in cognitive science suggests the possibility of the existence of bodies of core commonsense knowledge (Tenenbaum 2015). The exploration track provides a setting for exploring these possibilities. Perhaps within such a laboratory paradigm, the role of traditional commonsense reasoning research would shift to developing general principles, such as models of causation or collaboration. AI systems would then instantiate such principles during selfdirected experimentation. The proposed tests will provide an opportunity to bring four important areas of AI research (language, reasoning, perception, and action) back into sync after each has regrettably diverged into a fairly independent area of research. Figure 5. Manipulation and Object Formation with Nonrigid Materials. Summary This article was not about the blocks world and it has not argued for the elimination of reasoning from intelligent systems in favor of a stronger perceptual component. This article argued that the Turing test was too weak an instrument for testing all aspects of intelligence and, inspired by the Turing test, proposed an alternative that was argued to be more suitable for motivating and monitoring progress in settings that demand an integrated deployment of perceptual, action, commonsense reasoning, and language faculties. The challenge described in this document differs from other robotic challenges in terms of its integrative aspects. Also unique here is the per- Figure 6. The Exploration Track Will Also Involve Dynamic Toys with Moving Parts and Some Interesting Aggregate Physical Behavior. The modularity afforded by toys makes this much easier than working with large expensive systems. This picture is a good illustration of the need for functional understanding of elements of a structure. In the picture, the child can turn a crank at the bottom left — a piece that has functional significance — that turns a large red vertical screw that then lifts metal balls up a shaft after which they fall through a series of ramps turning various gears along the way. SPRING 2016 61 Articles spective on agent embodiment as leading to an agent-initiated form of experimentation (the world as a physical laboratory) that can trigger commonsense learning. The considerable span of time that has elapsed since Turing proposed his famous test should be sufficient for the field of AI to devise more comprehensive tests that stress the abilities of physically embodied intelligent systems to think as well as do. Notes 1. One should resist the temptation here of equating intelligence with being smart in the human sense, as in having a high IQ. That has rarely been the case in AI where we have usually been quite happy to try to replicate everyday human behavior. In the remainder of this article, I will use the term intelligence in this more restrictive, technical sense. 2. Winograd Challenge, 2015, commonsensereasoning.org/ winograd.html. 3. I certainly would not deny that a program that passed the Turing test was intelligent. What I am suggesting is that it would not be intelligent in a broad enough set of areas for the many problems of interest to the field of AI. The Turing test was never meant as a necessary test of intelligence, only a sufficient one. The arguments that I am presenting, then, suggest that the Turing test also does not represent a sufficient condition for intelligence, only evidence for intelligence (Shieber 2004). 4. I take this point to be fairly uncontroversial in AI: a manual with a picture describing some action (such as setting up a tent) is often fairly useless without the pictures. 5. A similar observation was made in the context of the spatial manipulation of buttons (Davis 2011). 6. Put most simply, the best that the Turing test could test for is whether a subject would answer correctly to something like, “Suppose I had a key that looked like . . . and a lock that looked like . . . Would it fit?” How on earth is one to find something substantive to substitute (that is, to say) for the ellipses here that would have any relevant consequence for the desired conclusion in the actual physical case? 7. Beyond the Turing Test: AAAI-15 Workshop WS06. January 25, 2015, Austin, Texas. 8. One might be concerned that the inclusion of language is overly ambitious. However, without it one would be left with a set of challenge problems that could just as easily be sponsored by the robotics or computer vision communities alone. The inclusion of language makes this proposed challenge more appropriately part of the concerns of general AI. 9. See, for example, www.wikihow.com/Assemble-a-Tent. 10. I am grateful to an anonymous reviewer for bringing up this point. 11. www.robocupathome.org. 12. See www.youtube.com/watch?v=Cac7Nkki_X0. References Davis, E. 2011. Qualitative Spatial Reasoning in Interpreting Narrative. Keynote talk presented at the 2011 Conference on Spatial Information Theory, September 14, Belfast, Maine. Knepper, R. A.; Layton, T.; Romanishin, J.; and Rus, D. 2013. IkeaBot: An Autonomous multiRobot Coordinated Furni- 62 AI MAGAZINE ture Assembly System. In Proceedings of the 2013 IEEE International Conference on Robotics and Automation (ICRA). Piscataway, NJ: Institute for Electrical and Electronics Engineers. dx.doi.org/10.1109/ICRA.2013.6630673 Le, Q. V.; Ranzato, M.; Monga, R.; Devin, M.; Chen, K.; Corrado, G. S.; Dean J.; and Ng, A. Y. 2012. Building High-Level Features Using Large Scale Unsupervised Learning. In Proceedings of the 29th International Conference on Machine Learning. Madison, WI: Omnipress. Levesque, H.; Davis, E.; and Morgenstern, L. 2012. The Winograd Schema Challenge. In Principles of Knowledge Representation and Reasoning: Proceedings of the Thirteenth International Conference (KR2012), 552–561. Palo Alto: AAAI Press. Li, F. F., and Li, L.-J. 2010. What, Where, and Who? 2010. Telling the Story of an Image by Activity Classification, Scene Recognition, and Object Categorization. In Computer Vision: Detection, Recognition, and Reconstruction, Studies in Computational Intelligence Volume 285. Berlin: Springer. Mason, M. T. 2001. Mechanics of Robotic Manipulation. Cambridge, MA: The MIT Press.. McCarthy, J., and Hayes, P. J. Some Philosophical Problems from the Standpoint of Artificial Intelligence. Machine Intelligence 4, 463–502. Edinburgh, UK: Edinburgh University Press. Shieber, S., ed. 2004. The Turing Test: Verbal Behavior as the Hallmark of Intelligence. Cambridge, MA: The MIT Press. Strabala, K.; Lee, M. K.; Dragan, A.; Forlizzi, J.; and Srinivasa, S. 2012. Learning the Communication of Intent Prior to Physical Collaboration. In Proceedings of the 21st IEEE International Symposium on Robot and Human Interactive Communication. Piscataway, NJ: Institute of Electrical and Electronics Engineers. dx.doi.org/10.1109/roman.2012.6343875 Tennenbaum, Josh. Cognitive Foundations for CommonsSense Knowledge Representation. Invited talk presented at the AAAI 2015 Spring Symposium on Knowledge Representation and Reasoning: Integrating Symbolic and Neural Approaches. Alexandria, VA, 23–25 March. Turing, A. M. 1950. Computing Machinery and Intelligence. Mind 59(236): 433–460. dx.doi.org/10.1093/mind/LIX.236 .433 Charles Ortiz is director of the Laboratory for Artificial Intelligence and Natural Language at the Nuance Communications. Prior to joining Nuance, he was the director of research in collaborative multiagent systems at the AI Center at SRI International. His research interests and contributions are in multiagent systems (collaborative dialog-structured assistants and logic-based BDI theories), knowledge representation and reasoning (causation, counterfactuals, and commonsense reasoning), and robotics (cognitive and team robotics). He is also involved in the organization of the Winograd Schema Challenge with Leora Morgenstern and others. He holds an S.B. in physics from the Massachusetts Institute of Technolgoy and a Ph.D. in computer and information science from the University of Pennsylvania. He was a postdoctoral research fellow at Harvard University and has taught courses at Harvard and the University of California, Berkeley (as an adjunct professor) and has also presented tutorials at many technical conferences such as IJCAI, AAAI, and AAMAS. Articles Measuring Machine Intelligence Through Visual Question Answering C. Lawrence Zitnick, Aishwarya Agrawal, Stanislaw Antol, Margaret Mitchell, Dhruv Batra, Devi Parikh I As machines have become more intelligent, there has been a renewed interest in methods for measuring their intelligence. A common approach is to propose tasks for which a human excels, but one that machines find difficult. However, an ideal task should also be easy to evaluate and not be easily gameable. We begin with a case study exploring the recently popular task of image captioning and its limitations as a task for measuring machine intelligence. An alternative and more promising task is visual question answering, which tests a machine’s ability to reason about language and vision. We describe a data set, unprecedented in size and created for the task, that contains more than 760,000 human-generated questions about images. Using around 10 million human-generated answers, researchers can easily evaluate the machines. H umans have an amazing ability to both understand and reason about our world through a variety of senses or modalities. A sentence such as “Mary quickly ran away from the growling bear” conjures both vivid visual and auditory interpretations. We picture Mary running in the opposite direction of a ferocious bear with the sound of the bear being enough to frighten anyone. While interpreting a sentence such as this is effortless to a human, designing intelligent machines with the same deep understanding is anything but. How would a machine know Mary is frightened? What is likely to happen to Mary if she doesn’t run? Even simple implications of the sentence, such as “Mary is likely outside” may be nontrivial to deduce. How can we determine whether a machine has achieved the same deep understanding of our world as a human? In our example sentence above, a human’s understanding is rooted in multiple modalities. Humans can visualize a scene depicting Mary running, they can imagine the sound of the bear, and even how the bear’s fur might feel when touched. Conversely, if shown a picture or even an auditory recording of a woman running from a bear, a human may similarly describe the scene. Perhaps machine intelligence could be tested in a similar manner? Can a machine use natural language to describe a picture similar to a human? Similarly, could a machine generate a scene given a written description? In fact these tasks have been a goal of artificial intelligence research since its inception. Marvin Minsky famously stated in 1966 (Crevier 1993) to one of his students, “Connect a television camera to a computer and get the machine Copyright © 2016, Association for the Advancement of Artificial Intelligence. All rights reserved. ISSN 0738-4602 SPRING 2016 63 Articles A man holding a beer bottle with two hands and looking at it. A man in a white t-shirt looks at his beer bottle. A man with black curly hair is looking at a beer. A man holds a bottle of beer examining the label. … A guy holding a beer bottle. A man holding a beer bottle. A man holding a beer. A man holds a bottle. Man holding a beer. Figure 1. Example Image Captions Written for an Image Sorted by Caption Length. to describe what it sees.” At the time, and even today, the full complexities of this task are still being discovered. Image Captioning Are tasks such as image captioning (Barnard and Forsyth 2001; Kulkarni et al. 2011; Mitchell et al. 2012; Farhadi et al. 2010; Hodosh, Young, and Hockenmaier 2013; Fang et al. 2015; Chen and Zitnick 2015; Donahue et al. 2015; Mao et al. 2015; Kiros, Salakhutdinov, and Zemel 2015; Karpathy and Fei-Fei 2015; Vinyals et al. 2015) promising candidates for testing artificial intelligence? These tasks have advantages, such as being easy to describe and being capable of capturing the imagination of the public (Markoff 2014). Unfortunately, tasks such as image captioning have proven problematic as actual tests of intelligence. Most notably, the evaluation of image captions may be as difficult as the image captioning task itself (Elliott and Keller 2014; Vedantam, Zitnick, and Parikh 2015; Hodosh, Young, and Hockenmaier 2013; Kulkarni et al. 2011; Mitchell et al. 2012). It has been observed that captions judged to be good by human observers may actually contain significant variance even though they describe the same image (Vedantam, Zitnick, and Parikh 2015). For instance see figures 1. Many people would judge the longer, more detailed captions as better. However, the details described by the captions vary significantly, for example, two hands, white T-shirt, black curly hair, label, and others. How can we evaluate a caption if 64 AI MAGAZINE there is no consensus on what should be contained in a good caption? However, for shorter, less detailed captions that are commonly written by humans, a rough consensus is achieved: “A man holding a beer bottle.” This leads to the somewhat counterintuitive conclusion that captions humans like aren’t necessarily humanlike. The task of image captioning also suffers from another less obvious drawback. In many cases it might be too easy! Consider an example success from a recent paper on image captioning (Fang et al. 2015), figure 4. Upon first inspection this caption appears to have been generated from a deep understanding of the image. For instance, in figure 4 the machine must have detected a giraffe, grass, and a tree. It understood that the giraffe was standing, and the thing it was standing on was grass. It knows the tree and giraffe are next to each other, and others. Is this interpretation of the machine’s depth of understanding correct? When judging the results of an AI system, it is important to analyze not only its output but also the data used for its training. The results in figure 4 were obtained by training on the Microsoft common objects in context (MS COCO) data set (Lin et al. 2014). This data set contains five independent captions written by humans for more than 120,000 images (Chen et al. 2015). If we examine the image in figure 4 and the images in the training data set we can make an interesting observation. For many testing images, there exist a significant number of semantically similar training images, figure 4 (right). If two images share enough semantic similarity, it is Articles What color are her eyes? What is the mustache made of? How many slices of pizza are there? Is this a vegetarian pizza? Is this location good for a tan? What flag is being displayed? Does it appear to be rainy? Does this person have 20/20 vision? Figure 2. Example Images and Questions in the Visual Question-Answering Data Set. (visualqa.org). possible a single caption could describe them both. This observation leads to a surprisingly simple algorithm for generating captions (Devlin et al. 2015). Given a test image, collect a set of captions from images that are visually similar. From this set, select the caption with highest consensus (Vedantam, Zitnick, and Parikh 2015), that is, the caption most similar to the other captions in the set. In many cases the consensus caption is indeed a good caption. When judged by humans, 21.6 percent of these borrowed captions are judged to be equal to or better than those written by humans for the image specifically. Despite its simplicity, this approach is competitive with more advanced approaches that use recurrent neural networks (Chen and Zitnick 2015; Donahue et al. 2015; Mao et al. 2015; Kiros, Salakhutdinov, and Zemel 2015; Karpathy and Fei-Fei 2015; Vinyals et al. 2015) and other language models (Fang et al. 2015) that can achieve 27.3 percent when compared to human captions. Even methods using recurrent neural networks commonly produce captions that are identical to training captions even though they’re not explicitly trained to do so. If captions are generated by borrowing them from other images, these algorithms are clearly not demonstrating a deep understanding of language, semantics, and their visual interpretation. In comparison, the odds of two humans repeating a sentence are quite rare. One could make the case that the fault is not with the algorithms but in the data used for training. That is, the data set contains too many semantically similar images. However, even in randomly sampled images from the web, a photographer bias is found. Humans capture similar images to each other. Many of our tastes or preferences are conventional. Visual Question Answering As we demonstrated using the task of image captioning, determining a multimodal task for measuring a machine’s intelligence is challenging. The task must be easy to evaluate, yet hard to solve. That is, its evaluation shouldn’t be as hard as the task itself, and it must not be solvable using shortcuts or cheats. To solve these two problems we propose the task of visual question answering (VQA) (Antol et al. 2015; Geman et al. 2015; Malinowski and Fritz 2014; Tu et al. 2014; Bigham et al. 2010; Gao et al. 2015). SPRING 2016 65 Articles Figure 3. Distribution of Questions by Their First Four Words. The ordering of the words starts toward the center and radiates outwards. The arc length is proportional to the number of questions containing the word. White areas indicate words with contributions too small to show. The task of VQA requires a machine to answer a natural language question about an image as shown in figure 2. Unlike the captioning task, evaluating answers to questions is relatively easy. The simplest approach is to pose the questions with multiple choice answers, much like standardized tests administered to students. Since computers don’t get tired of reading through long lists of answers, we can even increase the length of the answer list. Another more challenging option is to leave the answers open ended. Since most answers are single words such as yes, blue, or two, evaluating their correctness is straightforward. Is the visual question-answering task challenging? The task is inherently multimodal, since it requires knowledge of language and vision. Its complexity is further increased by the fact that many questions require commonsense knowledge to answer. For instance, if you ask, “Does the man have 20/20 66 AI MAGAZINE vision?” you need the commonsense knowledge that having 20/20 vision implies you don’t wear glasses. Going one step further, one might be concerned that commonsense knowledge is all that’s needed to answer the questions. For example if the question was “What color is the sheep?,” our common sense would tell us the answer is white. We may test the sufficiency of commonsense knowledge by asking subjects to answer questions without seeing the accompanying image. In this case, human subjects did indeed perform poorly (33 percent correct), indicating that common sense may be necessary but is not sufficient. Similarly, we may ask subjects to answer the question given only a caption describing the image. In this case the humans performed better (57 percent correct), but still not as accurately as those able to view the image (78 percent correct). This helps indicate that the VQA task requires more Articles A giraffe standing in the grass next to a tree. Figure 4. Example Image Caption and a Set of Semantically Similar Images. Left: An image caption generated from Fang et al. (2015). Right: A set of semantically similar images in the MS COCO training data set for which the same caption could apply. detailed information about an image than is typically provided in an image caption. How do you gather diverse and interesting questions for 100,000s of images? Amazon’s Mechanical Turk provides a powerful platform for crowdsourcing tasks, but the design and prompts of the experiments must be careful chosen. For instance, we ran trial experiments prompting the subjects to write questions that would be difficult for a toddler, alien, or smart robot to answer. Upon examination, we determined that questions written for a smart robot were most interesting given their increased diversity and difficulty. In comparison, the questions stumping a toddler were a bit too easy. We also gathered three questions per image and ensured diversity by displaying the previously written questions and stating, “Write a different question from those above that would stump a smart robot.” In total over 760,000 questions were gathered.1 The diversity of questions supplied by the subjects on Amazon’s Mechanical Turk is impressive. In figure 3, we show the distribution of words that begin the questions. The majority of questions begin with What and Is, but other questions include How, Are, Does, and others. Clearly no one type of question dominates. The answers to these questions have a varying diversity depending on the type of question. Since the answers may be ambiguous, for example, “What is the person looking at?” we collected 10 answers per question. As shown in figure 5, many question types are simply answered yes or no. Other question types such as those that start with “What is” have a greater variety of answers. An interesting comparison is to examine the distribution of answers when subjects were asked to answer the questions with and without looking at the image. As shown in Figure 5 (bottom), there is a strong bias to many questions when subjects do not see the image. For SPRING 2016 67 Articles Answers with Images Answers without Images Figure 5. Distribution of Answers Per Question Type. Top: When subjects provide answers when given the image. Bottom: When not given the image. instance “What color” questions invoke red as an answer, or for questions that are answered by yes or no, yes is highly favored. Finally it is important to measure the difficulty of the questions. Some questions such as “What color is the ball?” or “How many people are in the room?” may seem quite simple. In contrast, other questions such as “Does this person expect company?” or “What government document is needed to partake in this activity?” may require quite advanced reasoning to answer. Unfortunately, the difficultly of a question 68 AI MAGAZINE is in many cases ambiguous. The question’s difficultly is as much dependent on the person or machine answering the question as the question itself. Each person or machine has different competencies. In an attempt to gain insight into how challenging each question is to answer, we asked human subjects to guess how old a person would need to be to answer the question. It is unlikely most human subjects have adequate knowledge of human learning development to answer the question correctly. However, this does provide an effective proxy for question Articles 3-4 (15.3%) 5-8 (39.7%) 9-12 (28.4%) 13-17 (11.2%) 18+ (5.5%) Is that a bird in the sky? How many pizzas are shown? Where was this picture taken? Is he likely to get mugged if he walked down a dark alleyway like this? What type of architecture is this? What color is the shoe? What are the sheep eating? What ceremony does the cake commemorate? Is this a vegetarian meal? Is this a Flemish bricklaying pattern? How many zebras are there? What color is his hair? Are these boats too tall to fit under the bridge? What type of beverage is in the glass? How many calories are in this pizza? Is there food on the table? What sport is being played? What is the name of the white shape under the batter? Can you name the performer in the purple costume? What government document is needed to partake in this activity? Is this man wearing shoes? Name one ingredient in the skillet. Is this at the stadium? Besides these humans, what other animals eat here? What is the make and model of this vehicle? Figure 6. Fi 6 Example E l Questions Q ti t Judged J d d to t B Be Answerable A bl b by Diff Different f tA Age G Groups. The percentage of questions falling into each age group is shown in parentheses. difficulty. That is, questions judged to be answerable by a 3–4 year old are easier than those judged answerable by a teenager. Note, we make no claims that questions judged answerable by a 3–4 year old will actually be answered correctly by toddlers. This would require additional experiments performed by the appropriate age groups. Since the task is ambiguous, we collected 10 responses for each question. In Figure 6 we show several questions for which a majority of subjects picked the specified age range. Surprisingly the perceived age needed to answer the questions is fairly well distributed across the different age ranges. As expected the questions that were judged answerable by an adult (18+) generally need specialized knowledge, where those answerable by a toddler (3–4) are more generic. Abstract Scenes The visual question-answering task requires a variety of skills. The machine must be able to understand the image, interpret the question, and reason about the answer. For many researchers exploring AI, they may not be interested in exploring the low-level tasks involved with perception and computer vision. Many of the questions may even be impossible to solve given the current capabilities of state-of-the-art computer vision algorithms. For instance the question “How many cellphones are in the image?” may not be answerable if the computer vision algorithms cannot accurately detect cellphones. In fact, even for state-of-the-art algorithms many objects are difficult to detect, especially small objects (Lin et al. 2014). To enable multiple avenues for researching VQA, we introduce abstract scenes into the data set (Antol, Zitnick, and Parikh 2014; Zitnick and Parikh 2013; Zitnick, Parikh, and Vanderwende 2013; Zitnick, Vedantam, and Parikh 2015). Abstract scenes or cartoon images are created from sets of clip art, figure 7. The scenes are created by human subjects using a graphical user interface that allows them to arrange a wide variety of objects. For clip art depicting humans, their poses and expression may also be changed. Using the interface, a wide variety of scenes can be created including ordinary scenes, scary scenes, or funny scenes. Since the type of clip art and its properties are exactly known, the problem of recognizing objects and their attributes is greatly simplified. This provides researchers an opportunity to study more directly the problems of question understanding and answering. Once computer vision algorithms catch up, perhaps some of the techniques developed for abstract scenes can be applied to real images. The abstract scenes may be useful for a variety of other tasks as well, such as learning commonsense knowledge (Zitnick, Parikh, and Vanderwende 2013; Antol, Zitnick, and Parikh 2014; Chen, Shrivastava, and Gupta 2013; Divvala, Farhadi, and Guestrin 2014; Vedantam et al. 2015). Discussion While visual question answering appears to be a promising approach to measuring machine intelligence for multimodal tasks, it may prove to have SPRING 2016 69 Articles How many glasses are on the table? What is the woman reaching for? Is this person expecting company? What is just under the tree? Do you think the boy on the ground has broken legs? Why is the boy on the right freaking out? Are the kids in the room the grandchildren of the adults? What is on the bookshelf? Figure 7. Example Abstract Scenes and Their Questions in the Visual Question-Answering Data Set. visualqa.org. unforeseen shortcomings. We’ve explored several baseline algorithms that perform poorly when compared to human performance. As the data set is explored, it is possible that solutions may be found that don’t require true AI. However, using proper analysis we hope to update the data set continuously to reflect the current progress of the field. As certain question or image types become too easy to answer we can add new questions and images. Other modalities may also be explored such as audio and text-based stories (Fader, Zettlemoyer, and Etzioni 2013a, 2013b; Weston et al. 2014, Richardson, Burges, and Renshaw 2013). In conclusion, we believe designing a multimodal challenge is essential for accelerating and measuring the progress of AI. Visual question answering offers one approach for designing such challenges that allows for easy evaluation while maintaining the difficultly of the task. As the field progresses our tasks and challenges should be continuously reevaluated to ensure they are of appropriate difficultly given the 70 AI MAGAZINE state of research. Importantly, these tasks should be designed to push the frontiers of AI research and help ensure their solutions lead us toward systems that are truly AI complete. Notes 1. visualqa.org. References Antol, S.; Agrawal, A.; Lu, J.; Mitchell, M.; Batra, D.; Zitnick, C. L.; and Parikh, D. 2015. VQA: Visual Question Answering. Unpublished paper deposited in The Computing Research Repository (CoRR) 1505.00468. Association for Computing Machinery. Antol, S.; Zitnick, C. L.; and Parikh, D. 2014. Zero-Shot Learning via Visual Abstraction. In Computer Vision-ECCV 2014: Proceedings of the 13th European Conference, Part IV. Lecture Notes in Computer Science Volume 8692. Berlin: Springer. dx.doi.org/10.1007/978-3-319-10593-2_27 Barnard, K., and Forsyth, D. 2001. Learning the Semantics of Words and Pictures. In Proceedings of the IEEE International Conference on Computer Vision (ICCV-01), 408–415. Los Articles Alamitos, CA: IEEE Computer Society. dx.doi.org/10.1109/ iccv.2001.937654 Conference on Computer Vision, Part IV. Lecture Notes in Computer Science Volume 6314. Berlin: Springer. Bigham, J.; Jayant, C.; Ji, H.; Little, G.; Miller, A.; Miller, R.; Miller, R.; Tatarowicz, A.; White, B.; White, S.; and Yeh, T. 2010. VizWiz: Nearly Real-Time Answers to Visual Questions. In Proceedings of the 23rd Annual ACM Symposium on User Interface Software and Technology. New York: Association for Computing Machinery. Fang, H.; Gupta, S.; Landola, F. N.; Srivastava, R.; Deng, L.; Doll, P.; Gao, J.; He, X.; Mitchell, M. Platt, J. C.; Zitnick, C. L.; and Zweig, G. 2015. From Captions to Visual Concepts and Back. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, NJ: Institute for Electrical and Electronics Engineers. dx.doi.org/ 10.1109/CVPR.2015.7298754 Chen, X., and Zitnick, C. L. 2015. Mind’s Eye: A Recurrent Visual Representation for Image Caption Generation. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, NJ: Institute for Electrical and Electronics Engineers. dx.doi.org/10.1109/CVPR.2015. 7298856 Chen, X.; Fang, H.; Lin, T. Y.; Vedantam, R.; Gupta, S.; Dollár, P.; and Zitnick, C. L. 2015. Microsoft COCO Captions: Data Collection and Evaluation Server. Unpublished paper deposited in The Computing Research Repository (CoRR) 1504.00325. Association for Computing Machinery. Chen, X.; Shrivastava, A.; and Gupta, A. 2013. NEIL: Extracting Visual Knowledge from Web Data. In Proceedings of the IEEE International Conference on Computer Vision, ICCV 2013. Piscataway, NJ: Institute for Electrical and Electronics Engineers. dx.doi.org/10.1109/iccv.2013.178 Crevier, D. 1993. AI: The Tumultuous History of the Search for Artificial Intelligence. New York: Basic Books, Inc. Devlin, J.; Gupta, S; Girshick, R.; Mitchell, M.; and Zitnick, C. L. 2015. Exploring Nearest Neighbor Approaches for Image Captioning. Unpublished paper deposited in The Computing Research Repository (CoRR) 1505.04467. Association for Computing Machinery. Divvala, S.; Farhadi, A.; and Guestrin, C. 2014. Learning Everything About Anything: Webly-Supervised Visual Concept Learning. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, NJ: Institute for Electrical and Electronics Engineers. dx.doi.org/10.1109/CVPR.2014.412 Donahue, J.; Hendricks, L. A.; Guadarrama, S.; Rohrbach, M.; Venugopalan, S.; Saenko, K.; and Darrell, T. 2015. LongTerm Recurrent Convolutional Networks for Visual Recognition and Description. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, NJ: Institute for Electrical and Electronics Engineers. dx.doi.org/10.1109/CVPR.2015.7298878 Elliott, D., and Keller, F. 2014. Comparing Automatic Evaluation Measures for Image Description. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics. Stroudsburg PA: Association for Computational Linguistics. x.doi.org/10.3115/v1/p14-2074 Fader, A.; Zettlemoyer, L.; and Etzioni, O. 2013a. Open Question Answering over Curated and Extracted Knowledge Bases. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: Assocation for Computing Machinery. Fader, A.; Zettlemoyer, L.; and Etzioni, O. 2013b. Paraphrase-Driven Learning for Open Question Answering. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA: Association for Computational Linguistics.Engineers. Farhadi, A.; Hejrati, M.; Sadeghi, M. A.; Young, P.; Rashtchian, C.; Hockenmaier, J.; and Forsyth, D. 2010. Every Picture Tells a Story: Generating Sentences from Images. In Computer Vision–ECCV 2010, Proceedings of the 11th European Gao, H.; Mao, J.; Zhou, J.; Huang, Z.; Wang, L.; and Xu, W. 2015. Are You Talking to a Machine? Dataset and Methods for Multilingual Image Question Answering. Unpublished paper deposited in The Computing Research Repository (CoRR) 1505.05612. Association for Computing Machinery. Geman, D.; Geman, S.; Hallonquist, N.; and Younes, L. 2015. A Visual Turing Test for Computer Vision Systems. Proceedings of the National Academy of Sciences 112(12): 3618–3623. dx.doi.org/10.1073/pnas.1422953112 Hodosh, M.; Young, P.; Hockenmaier, J. 2013. Framing Image Description as a Ranking Task: Data, Models and Evaluation Metrics. JAIR 47: 853–899. Karpathy, A., and Fei-Fei, L. 2015. Deep Visual-Semantic Alignments for Generating Image Descriptions. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, NJ: Institute for Electrical and Electronics Engineers. dx.doi.org/10.1109/CVPR.2015. 7298932 Kiros, R.; Salakhutdinov, R.; and Zemel, R. 2015. Unifying Visual-Semantic Embeddings with Multimodal Neural Language. Unpublished paper deposited in The Computing Research Repository (CoRR) 1411.2539. Association for Computing Machinery. Kulkarni, F.; Premraj, V.; Dhar, S.; Li, S.; Choi, Y.; Berg, A. C.; and Berg, T. L. 2011. Baby Talk: Understanding and Generating Simple Image Descriptions. In Proceedings of the 2011 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, NJ: Institute for Electrical and Electronics Engineers. dx.doi.org/10.1109/cvpr.2011.5995466 Lin, T. Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; and Zitnick, C. L. 2014. Microsoft COCO: Common Objects in Context. In Computer Vision– ECCV 2014: Proceedings of the 13th European Conference, Part V. Lecture Notes in Computer Science Volume 8693. Berlin: Springer. Malinowski, M., and Fritz, M. 2014. A Multi-World Approach to Question Answering about Real-World Scenes Based on Uncertain Input. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, 1682–1690. La Jolla, CA: Neural Information Processing Systems Foundation. Mao, J.; Xu, W.; Yang, Y.; Wang, J.; Huan, Z.; and Yuille, A. L. 2015. Deep Captioning with Multimodal Recurrent Neural Networks (m-RNN). Unpublished paper deposited in arXiv. arXiv preprint arXiv:1412.6632. Ithaca, NY: Cornell University. Markoff, J. 2014. Researchers Announce Advance in ImageRecognition Software. New York Times, Science Section (November 17). Mitchell, M.; Han, X.; Dodge, J.; Mensch, A.; Goyal, A.; Berg, A.; Yamaguchi, K.; Berg, T.; Stratos, K.; Daumé, H. 2012. Midge: Generating Image Descriptions from Computer Vision Detections. In Proceedings of the 13th Conference of the European Chapter of the Association for Computa- SPRING 2016 71 Articles ECCV 2014: Proceedings of the 13th European Conference, Part IV. Lecture Notes in Computer Science Volume 8692. Berlin: Springer. Zitnick, C. L.; Vedantam, R; and Parikh, D. 2015. Adopting Abstract Images for Semantic Scene Understanding. In IEEE Transactions on Pattern Analysis and Machine Intelligence. Issue 99. Visit AAAI on LinkedIn™ AAAI is on LinkedIn! If you are a current member of AAAI, you can join us! We welcome your feedback at [email protected]. tional Linguistics. Stroudsburg, PA: Association for Computational Linguistics. Richardson, M.; Burges, C.; Renshaw, E. 2013. MCTest: A Challenge Dataset for the Machine Comprehension of Text. In EMNLP 2013: Proceedings of the Empirical Methods in Natural Language Processing Conference. Stroudsburg, PA: Association for Computational Linguistics. Tu, K.; Meng, M.; Lee, M. W.; Choe, T. E.; and Zhu, S. C. 2014. Joint Video and Text Parsing for Understanding Events and Answering Queries. IEEE MultiMedia 21(2): 42– 70. dx.doi.org/10.1109/MMUL.2014.29 Vedantam, R.; Lin, X.; and Batra, T.; Zitnick, C. L.; and Parikh, D. 2015. Learning Common Sense through Visual Abstraction. In Proceedings of the IEEE International Conference on Computer Vision, ICCV 2015. Piscataway, NJ: Institute for Electrical and Electronics Engineers. Vedantam, R.; Zitnick, C. L.; and Parikh, D. 2015. CIDEr: Consensus-Based Image Description Evaluation. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, NJ: Institute for Electrical and Electronics Engineers. dx.doi.org/10.1109/CVPR.2015. 7299087 Vinyals, O.; Toshev, A.; Bengio, S.; and Erhan, D. 2015. Show and Tell: A Neural Image Caption Generator. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, NJ: Institute for Electrical and Electronics Engineers. dx.doi.org/10.1109/CVPR.2015. 7298935 Weston, J.; Bordes, A.; Chopra, S.; and Mikolov, T. 2015. Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks. Unpublished paper deposited in arXiv. arXiv preprint arXiv:1502.05698. Ithaca, NY: Cornell University. Zitnick, C. L., and Parikh, D. 2013. Bringing Semantics into Focus Using Visual Abstraction. In Proceedings of the 2013 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, NJ: Institute for Electrical and Electronics Engineers. dx.doi.org/10.1109/CVPR.2013.387 Zitnick, C. L.; Parikh, D.; and Vanderwende, L. 2013. ZeroShot Learning via Visual Abstraction. In Computer Vision- 72 AI MAGAZINE C. Lawrence Zitnick is interested in a broad range of topics related to visual recognition, language, and commonsense reasoning. He developed the PhotoDNA technology used by Microsoft, Facebook, Google, and various law enforcement agencies to combat illegal imagery on the web. He received the Ph.D. degree in robotics from Carnegie Mellon University in 2003. In 1996, he coinvented one of the first commercial portable depth cameras. Zitnick was a principal researcher in the Interactive Visual Media group at Microsoft Research, and an affiliate associate professor at the University of Washington at the time of the writing of this article. He is now a research mananager at Facebook AI Research. Aishwarya Agrawal is a graduate student in the Bradley Department of Electrical and Computer Engineering at Virginia Polytechnic Institute and State University. Her research interests lie at the intersection of machine learning, computer vision, and natural language processing. Stanislaw Antol is a Ph.D. student in the Computer Vision Lab at Virginia Polytechnic Institute and State University. His research area is computer vision — in particular, finding new ways for humans to communicate with vision algorithms. Margaret Mitchell is a researcher in Microsoft’s NLP Group. She works on grounded language generation, focusing on how to help computers communicate based on what they can process. She received her MA in computational linguistics from the University of Washington, and her Ph.D. from the University of Aberdeen. Dhruv Batra is an assistant professor at the Bradley Department of Electrical and Computer Engineering at Virginia Polytechnic Institute and State University, where he leads the VT Machine Learning and Perception group. He is a member of the Virginia Center for Autonomous Systems (VaCAS) and the VT Discovery Analytic Center (DAC). He received his M.S. and Ph.D. degrees from Carnegie Mellon University in 2007 and 2010, respectively. His research interests lie at the intersection of machine learning, computer vision, and AI. Devi Parikh is an assistant professor in the Bradley Department of Electrical and Computer Engineering at Virginia Polytechnic Institute and State University and an Allen Distinguished Investigator of Artificial Intelligence. She leads the Computer Vision Lab at VT, and is also a member of the Virginia Center for Autonomous Systems (VaCAS) and the VT Discovery Analytics Center (DAC). She received her M.S. and Ph.D. degrees from the Electrical and Computer Engineering Department at Carnegie Mellon University in 2007 and 2009, respectively. She received her B.S. in electrical and computer engineering from Rowan University in 2005. Her research interests include computer vision, pattern recognition, and AI in general, and visual recognition problems in particular. Articles Turing++ Questions: A Test for the Science of (Human) Intelligence Tomaso Poggio, Ethan Meyers I There is a widespread interest among scientists in understanding a specific and well defined form of intelligence, that is human intelligence. For this reason we propose a stronger version of the original Turing test. In particular, we describe here an open-ended set of Turing++ questions that we are developing at the Center for Brains, Minds, and Machines at MIT — that is questions about an image. For the Center for Brains, Minds, and Machines the main research goal is the science of intelligence rather than the engineering of intelligence — the hardware and software of the brain rather than just absolute performance in face identification. Our Turing++ questions reflect fully these research priorities. I t is becoming increasingly clear that there is an infinite number of definitions of intelligence. Machines that are intelligent in different narrow ways have been built since the 50s. We are entering now a golden age for the engineering of intelligence and the development of many different kinds of intelligent machines. At the same time there is a widespread interest among scientists in understanding a specific and well defined form of intelligence, that is human intelligence. For this reason we propose a stronger version of the original Turing test. In particular, we describe here an open-ended set of Turing++ questions that we are developing at the Center for Brains, Minds, and Machines (CBMM) at MIT — that is, questions about an image. Questions may range from what is there to who is there, what is this person doing, what is this girl thinking about this boy, and so on. The plural in questions is to emphasize that there are many different intelligent abilities in humans that have to be characterized, and possibly replicated in a machine, from basic visual recognition of objects, to the identification of faces, to gauge emotions, to social intelligence, to language, and much more. Recent advances in cognitive neuroscience have shown that even in the more limited domain of visual intelligence, answering these questions requires different competences and abilities, often rather independent from each other, often corresponding to separate modules in the brain. The term Turing++ is to emphasize that our goal is understanding human intelligence at all Marr’s levels — from the level of the computations to the level of the underlying circuits. Answers to the Turing++ questions should thus be given in terms of models that match human behavior and human physiology — the mind and the brain. These requirements are thus well beyond the original Turing test. A whole scientific field that we call the science of (human) intelligence is Copyright © 2016, Association for the Advancement of Artificial Intelligence. All rights reserved. ISSN 0738-4602 SPRING 2016 73 Articles required to make progress in answering our Turing++ questions. It is connected to neuroscience and to the engineering of intelligence but also separate from both of them. Definitions of Intelligence We may call a person intelligent and even agree among us. But what about a colony of ants and their complex behavior? Is this intelligence? Were the mechanical computers built by Turing to decode the encrypted messages of the German U-boats, actually intelligent? Is Siri intelligent? The truth is that the question of What is intelligence is kind of ill-posed as there are many different answers, an infinite numbers of different kinds of intelligence. This is fine for engineers who may be happy to build many different types of intelligent machines. The scientists among us may instead prefer to focus on a question that is well defined and can be posed in a scientific way, on the question of human intelligence. In the rest of the paper we use the term intelligence to mean human intelligence. Understanding Human Intelligence Consider the problem of visual intelligence. Understanding such a complex system requires understanding it at different levels (in the Marr sense; see Poggio [1981, 2012]), from the computations to the underlying circuits. Thus we need to develop algorithms that provide answers of the type humans do. But we really need to achieve more than just simulate the brain’s output, more than what Turing asked. We need to understand what understanding an image by a human brain means. We need to understand the algorithms used by the brain, but we also need to understand the circuits that run these algorithms. This may also be useful if we want to be sure that our model is not just faking the output of a human brain by using a giant lookup table of what people usually do in similar situations, as hinted at the end of the movie Ex Machina. Understanding a computer means understanding the level of the software and the level of the hardware. Scientific understanding of human intelligence requires something similar — understanding of the mind as well as of the brain. Using Behavior and Physiology as a Guide To constrain our search for intelligent algorithms, we are focusing on creating computational models that match human behavior and neural physiology. There are several reasons why we are taking this approach. The first reason, as hinted above, is to avoid superficial solutions that mimic intelligent behavior under very limited circumstances, but that do not capture the true 74 AI MAGAZINE essence of the problem. Such superficial solutions have been a prominent approach to the traditional Turing test going back to the ELIZA program written in the 1960’s (Weizenbaum 1966). While these approaches might occasionally fool humans, they do not address many of the fundamental issues and thus this approach will fail to match many aspects of human behavior. A second related reason is that algorithms might appear to perform well when tested under limited circumstances, but when compared to the full range of human abilities they might not do nearly as well. For example, deep neural networks work very well on object-recognition tasks, but also fail in simple ways that would never be seen in human behavior (Szegedy et al. 2006). By directly comparing computer systems’ results to human behavioral results we should be able to assess whether a system that is displaying intelligent behavior is truly robust (Sinha et al. 2006). A final reason is that studying primate physiology can give us guidance about how to approach the problem. For example, to recognize people based on their faces appears to occur in discrete face patches in the primate brain (see Freiwald and Tsao [2010], and the section below). By understanding the computational roles of these patches we aim to understand the algorithms that are used by primates to solve these tasks (Meyers et al. 2015). Intelligence Is One Word but Many Problems Recent advances in cognitive neuroscience have shown that different competencies and abilities are needed to solve visual tasks, and that they seem to correspond to separate modules in the brain. For instance, the apparently similar questions of object and face recognition (what is there versus who is there) involve rather distinct parts of the visual cortex (for example, the lateral occipital cortex versus a section of the fusiform gyrus). The word intelligence can be misleading in this context, like the word life was during the first half of the last century when popular scientific journals routinely wrote about the problem of life, as if there was a single substratum of life waiting to be discovered to unveil the mystery completely. Of course, speaking today about the problem of life sounds amusing: biology is a science dealing with many different great problems, not just one. Thus we think that intelligence is one word but many problems, not one but many Nobel prizes. This is related to Marvin Minsky’s view of the problem of thinking, well captured by the slogan Society of Minds. In the same way, a real Turing test is a broad set of questions probing the main aspects of human thinking. Because “intelligence” encompasses a large set of topics, we have chosen visual intelligence in human and nonhuman primates as a primary focus. Our approach at the Center for Brains, Minds, and Machines to visual intelligence includes connections Articles Figure 1. Street Fair. Courtesy of Boris Katz, CBMM, from the LableMe database. to some developmental, spatial, linguistic, and social questions. To further sharpen our focus, we are emphasizing measuring our progress using questions, described in more detail below, that might be viewed as extensions of the Turing test. We have dubbed these Turing++ questions. Computational models we develop will be capable of responding to queries about visual scenes and movies — who, what, why, where, how, with what motives, with what purpose, and with what expectations. Unlike a conventional engineering enterprise that tests only absolute (computational) performance, we will require that our models exhibit consistency with human performance/behavior, with human and primate physiology, and with human development. The term Turing++ refers to these additional levels of understanding that our models and explanations must satisfy. The Turing++ Questions Our choice of questions follows in part from our understanding of human intelligence grounded in the neuroscience of the brain. Each question roughly corresponds to a distinct neural module in the brain. We have begun defining an initial set of such problems/questions about visual intelligence, since vision is our entry point into the problem of intelli- gence. We call such questions Turing++ questions because they are inspired by the classical Turing test but go well beyond it. Traditional Turing tests permit counterfeiting and require matching only a narrowly defined level of human performance. Successfully answering Turing++ questions will require us not only to build systems that emulate human performance, but also to ensure that such systems are consistent with our data on human behavior, brains, neural systems, and development. An open-ended set of Turing++ questions can be effectively used to measure progress in studying the brain-based intelligence needed to understand images and video. As an example consider the image shown in figure 1. A deep-learning network might locate faces and people. One could not interrogate such a network, however, with a list of Turing++ questions such as What is there? Who is there? What are they doing? How, in detail, are they performing actions? Are they friends or enemies or strangers? Why are they there? What will they do next? Have you seen anything like this before? We effortlessly recognize objects, agents, and events in this scene. We, but not a computer program, could recognize that this is a street market; several people are shopping; three people are conversing around a stroller; a woman is shopping for a shirt; SPRING 2016 75 Articles ML ML PL ML MF AF AL AM orientation AL AL orientation AM AM orientation Figure 2. Macque Visual Cortex Patches Involved in Face Perception. Courtesy of Winrich Freiwald, CBMM. Modified from Tsao, D. Y.; Moeller, S.; Freiwald, W. A. Comparing Face Patch Systems in Macaques and Humans. In Proceedings of the National Academy of Sciences of the United States of America. 2008;105(49): 19514–9. although the market takes place on a street, clearly no cars are permitted to drive down it; we can distinguish between the pants that are for sale and the pants that people are wearing. We, but not a computer program, could generate a narrative about the scene. It’s a fairly warm, sunny day at a weekend market. The people surrounding the stroller are a mother and her parents. They are deciding where they would like to eat lunch. We would assess the performance of a model built to answer questions like these by evaluating (1) how similarly to humans our neural models of the brain answer the questions, and (2) how well their implied physiology correlates with human and primate data obtained by using the same stimuli. Our Turing++ questions require more than a good imitation of human behavior; our computer models should also be humanlike at the level of the implied physiology and development. Thus the CBMM test of models uses Turing-like questions to check for humanlike performance/behavior, humanlike physiology, and humanlike development. Because we aim to understand the brain and the mind and to replicate human intelligence, the challenge intrinsic to the testing is not to achieve best absolute performance, but performance that correlates strongly with human intelligence measured in terms of behavior and physiology. We will compare models and theories with fMRI and MEG recordings, and will use data from the latter to inform our models. Physiological recordings in human patients and monkeys will allow us to probe neural circuitry during some of the tests at the level of individual neurons. We will carry out some of the tests in 76 AI MAGAZINE babies to study the development of intelligence. The series of tests is open ended. The initial ones, such as face identification, are tasks that computers are beginning to do and where we can begin to develop models and theories of how the brain performs the task. The later ones, such as generating stories explaining what may have been going on in the videos and answering questions about previous answers, are goals for the next few years of the center and beyond. The modeling and algorithm development will be guided by scientific concerns, incorporating constraints and findings from our work in cognitive development, human cognitive neuroscience, and systems neuroscience. These efforts likely would not produce the most effective AI programs today (measuring success against objectively correct performance); the core assumption behind this challenge is that by developing such programs and letting them learn and interact, we will get systems that are ultimately intelligent at the human level. An Example of a Turing++ Question: Who Is There (Face Identification) The Turing++ question that is most ripe, in the sense of possibility to answer it at all the required levels, is face identification. We have data about human performance in face identification — from a field that is called psychophysics of face recognition. We know which patches of visual cortex in humans are involved in face perception by using fMRI techniques. We can identify the homologue areas in the visual Articles cortex of the macaque where there is a similar network of interconnected patches shown in figure 3. In the monkey it is possible to record from individual neurons in the various patches and characterize their properties. Neurons in patch ML are view- and identity-tuned, neurons in AM are identityspecific but more view-invariant. Neurons in the intermediate patch AL tend to be mirror-symmetric: if they are tuned to a view they are also likely to be tuned to the symmetric one. We begin to have models that perform face identification well and are consistent with the architecture and the properties of face patches (that is, we can make a correspondence between stages in the algorithm and properties of different face patches). The challenge is to have performance that correlates highly with human performance on the same data sets of face images and that predict the behavior of neurons in the face patches for the same stimuli. In September of 2015, CBMM organized the first Turing++ questions workshop, focused on face identification. The title of the workshop is A Turing++ Question: Who Is There? The workshop introduced databases and reviewed the states of existing models to answer the question who is there at the levels of performance and neural circuits. The Science of Intelligence For the Center for Brains, Minds, and Machines the main research goal is the science of intelligence rather than the engineering of intelligence — the hardware and software of the brain rather than just absolute performance in face identification. Our Turing++ questions reflect fully these research priorities. The emphasis on answers at the different levels of behaviour and neural circuits reflects the levels-of-understanding paradigm (Marr 2010). The argument is that a complex system — like a computer and like the brain/mind — must be understood at several different levels, such as hardware and algorithms/computations. Though Marr emphasizes that explanations at different levels are largely independent of each other, it has been argued (Poggio 2012) that it is now important to reemphasize the connections between levels, which was described in the original paper about levels of understanding (Marr and Poggio 1977). In that paper we argued that one ought to study the brain at different levels of organization, from the behavior of a whole organism to the signal flow, that is, the algorithms, to circuits and single cells. In particular, we expressed our belief that (1) insights gained on higher levels help to ask the right questions and to do experiments in the right way on lower levels and (2) it is necessary to study nervous systems at all levels simultaneously. Otherwise there are not enough constraints for a unique solution to the problem of human intelligence. References Freiwald, W. A., and Tsao, D. Y. 2010. Functional Compartmentalization and Viewpoint Generalization Within the Macaque Face-Processing System. Science 330: 845– 851. dx.doi.org/10.1126/science.1194908 Marr, D. 2010. Vision. Cambridge, MA: The MIT Press. dx.doi.org/10.7551/mitpress/ 9780262514620.001.0001 Marr, D., and Poggio, T. 1977. From Understanding Computation to Understanding Neural Circuitry. In Neuronal Mechanisms in Visual Perception, ed. E. Poppel, R. Held, and J. E. Dowling. Neurosciences Research Program Bulletin 15: 470–488. Meyers, E.; Borzello, M.; Freiwald, W.; Tsao, D. 2015. Intelligent Information Loss: The Coding of Facial Identity, Head Pose, and Non-Face Information in the Macaque Face Patch System. Journal of Neuroscience 35(18): 7069–81. dx.doi.org/10.1523/JNEUROSCI. 3086-14.2015 Poggio, T. 1981. Marr’s Computational Approach to Vision. Trends in Neurosciences 10(6): 258–262. dx.doi.org/10.1016/01662236(81)90081-3 Poggio, T. 2012. The Levels of Understanding Framework, Revised. Perception 41(9): 1017–1023. dx.doi.org/10.1068/p7299 94(11): 1948–1962. dx.doi.org/10.1109/ JPROC.2006.884093 Szegedy, C.; Zaremba, W.; Sutskever, I.; Bruna, J.; Erhan, D.; Goodfellow, I. J.; and Fergus, R. 2013. Intriguing Properties of Neural Networks. CoRR (Computing Research Repository), abs/1312.6199. Association for Computing Machinery. Weizenbaum, J. 1966. ELIZA—A Computer Program for the Study of Natural Language Communication Between Man and Machine. Communications of the ACM 9(1): 36–45. dx.doi.org/10.1145/365153.365168 Tomaso A. Poggio is the Eugene McDermott Professor in the Department of Brain and Cognitive Sciences at the Massachusetts Institute of Technology, and the director of the new National Science Foundation Center for Brains, Minds, and Machines at the Massachusetts Institute of Technology, of which MIT and Harvard are the main member Institutions. He is a member of both the Computer Science and Artificial Intelligence Laboratory and of the McGovern Brain Institute. He is an honorary member of the Neuroscience Research Program, a member of the American Academy of Arts and Sciences, a Founding Fellow of AAAI, and a founding member of the McGovern Institute for Brain Research. Among other honors he received the Laurea Honoris Causa from the University of Pavia for the Volta Bicentennial, the 2003 Gabor Award, the Okawa Prize 2009, the AAAS Fellowship, and the 2014 Swartz Prize for Theoretical and Computational Neuroscience. His research has always been interdisciplinary, between brains and computers. It is now focused on the mathematics of learning theory, the applications of learning techniques to computer vision, and especially on computational neuroscience of the visual cortex. Ethan Meyers is an assistant professor of statistics at Hampshire College. He received his BA from Oberlin College in computer science, and his Ph.D. in computational neuroscience from MIT. His research examines how information is coded in neural activity, with a particular emphasis on understanding the processing that occurs in high-level visual and cognitive brain regions. To address these questions, he develops computational tools that can analyze high-dimensional neural recordings. Reichardt, W., and Poggio, T. 1976. Visual Control of Orientation Behavior in the Fly: A Quantitative Analysis. Quarterly Review of Biophysics 9(3): 311–375. dx.doi.org/10. 1017/S0033583500002523 Sinha, P.; Balas, B.; Ostrovsky, Y.; and Russell, R. 2006. Face Recognition by Humans: 19 Results All Computer Vision Researchers Should Know About. Proceedings of the IEEE SPRING 2016 77 Articles I-athlon: Toward a Multidimensional Turing Test Sam S. Adams, Guruduth Banavar, Murray Campbell I While the Turing test is a wellknown method for evaluating machine intelligence, it has a number of drawbacks that make it problematic as a rigorous and practical test for assessing progress in general-purpose AI. For example, the Turing test is deception based, subjectively evaluated, and narrowly focused on language use. We suggest that a test would benefit from including the following requirements: focus on rational behavior, test several dimensions of intelligence, automate as much as possible, score as objectively as possible, and allow incremental progress to be measured. In this article we propose a methodology for designing a test that consists of a series of events, analogous to the Olympic Decathlon, which complies with these requirements. The approach, which we call the I-athlon, is intended ultimately to enable the community to evaluate progress toward machine intelligence in a practical and repeatable way. 78 AI MAGAZINE T he Turing test, as originally described (Turing 1950), has a number of drawbacks as a rigorous and practical means of assessing progress toward human-level intelligence. One major issue with the Turing test is the requirement for deception. The need to fool a human judge into believing that a computer is human seems to be peripheral, and even distracting, to the goal of creating human-level intelligence. While this issue can be sidestepped by modifying the test to reward rational intelligent behavior (rational Turing test) rather than humanlike intelligent behavior, there are additional drawbacks to the original Turing test, including its language focus, complex evaluation, subjective evaluation, and the difficulty in measuring incremental progress. Copyright © 2016, Association for the Advancement of Artificial Intelligence. All rights reserved. ISSN 0738-4602 Articles Figure 1. The Olympic Decathalon. Language focused: While language use is perhaps the most important dimension of intelligence, there are many other dimensions that are relevant to intelligence, for example, visual understanding, creativity, reasoning, planning, and others. Complex evaluation: The Turing test, if judged rigorously, is expected to require extensive human input to prepare, conduct, and evaluate.1 Subjective evaluation: Tests that can be objectively evaluated are more useful in a practical sense, requiring less testing to achieve a reliable result. Difficult to measure incremental progress: In an unrestricted conversation, it is difficult to know the relative importance of various kinds of successes and failures. This adds an additional layer of subjectivity in trying to judge the degree of intelligence. In this article we propose an approach to measuring progress toward intelligent systems through a set of tests chosen to avoid some of the drawbacks of the Turing test. In particular, the tests (1) reward rational behavior (as opposed to humanlike behavior); (2) exercise several dimensions of intelligence in various combinations; (3) limit the requirement for human input in test creation and scoring; (4) use objective scoring to the extent possible; (5) permit measuring of incremental progress; (6) make it difficult to engineer a narrow task-specific system; and (7) eliminate, as much as possible, the possibility of gaming the system, as in the deception scenarios for the classic Turing test. The proposed approach, called here the I-athlon, by analogy with the Olympic Decathlon2 (figure 1), is intended to provide a framework for constructing a set of tests that require a system to demonstrate a wide variety of intelligent behaviors. In the Olympics, 10 events test athletes across a wide variety of athletic abilities as well as learned skills. In addition, the Decathlon tests their stamina and focus as they move among the 10 events over the two days of the competition. In all events, decathletes compete against specialist athletes, so it is not uncommon for them to fail to win any particular event. It is their aggregate score that declares them the World’s Greatest Athlete. One of the values of this approach for the field of artificial intelligence is that it would be inclusive of specialist systems that might achieve high levels of proficiency, and be justly recognized for the achievement, while still encouraging generalist systems to compete on the same level playing field. Principles for Constructing a Set of Tests Given our desire for broad-based, automated, objectively scored tests that can measure incremental progress and compare disparate systems on a common ground, we propose several principles for the construction of I-athlon events: Events Should Focus on Testing Proficiency in a Small Number of Dimensions. Testing a single dimension at a time could fall prey to a switch system, where a number of narrow systems are loosely coupled through a switch that selects the appropriate system for the current event. While events should be mostly self-contained, it may make sense to use the results of one event as the input for another. Events Should All Be Measured Against a Common, Simple Model of Proficiency. A common scoring model supports more direct comparisons and categorizations of systems. We propose a simple five-level rating system for use across all events. Levels one through four will represent levels of human proficiency based on baseline data gath- SPRING 2016 79 Articles ered from crowdsourced human competitions. Level five will represent superhuman proficiency, an X-factor over human level four, so there is a clear, unambiguous measure of achievement above human level. Levels one through four could be mapped to human age ranges or levels of proficiency, though some tests will not map to human development and proficiency but to domain expertise. It will be the responsibility of the developers of each event to map their scoring algorithms to these levels, and the overall I-athlon score for any competing system will be a standard formula applied to attainment of these levels. Multiple Events. The overall goal of this effort is to create broadly intelligent systems rather than narrow savants. As in the Olympic Decathlon, the total score across events should be more important than the score in any one event. Relative value of proficiency level achievement needs to recognize that all events are not equal in intelligence value. This might be difficult to agree on, and even the Olympic Decathlon scoring system has evolved over time to reflect advances in performance.3 Event Tests Should Be Automatically Generated Without Significant Human Intervention. One of the major drawbacks to the current Turing test is its requirement for extensive human involvement in performing and evaluating the test. This requirement for direct human involvement effectively rules out highly desirable approaches to developing solutions that operate much faster than humans can interact with effectively. Another challenge in designing a good replacement for the Turing test is eliminating, as much as possible, the potential for someone to game the system. At the very least this means that specific test instances must not be reused except for repeatability and validation. Automatic generation of repeatable high-quality tests is a significant research area on its own, and this approach allows for more efficient division of labor across the AI research community. Some researchers may focus on defining or improving events, possibly in collaboration with other disciplines like psychology or philosophy. Some may focus on developing test generators and scoring systems. Others may develop systems to compete in existing I-athlon events themselves. Generators should be able to reproduce specific tests using the same pseudorandom seed value so tests can be replayed for head-to-head competition and to allow massively parallel search and simulation of the solution space. Human intelligence has many facets and comes in many varieties and combinations. Philosophers, psychologists, cognitive and computer scientists have debated the definition of intelligence for centuries, and there are many different factorings of what we here call the “dimensions of intelligence.” Our goal in this article is not to declare a definitive set of dimensions or even claim complete coverage of the various aspects of human intelligence. We take up this terminology to enable us to identify aspects of intelligence that might be tested separately and in combinations for the purpose of evaluating the capabilities of AI systems compared to humans. The dimensions listed below are not all at the same level of abstraction; indeed, proficiency at some dimensions will require proficiency at several others. We fully expect there to be debate over which aspects of intelligence should be tested for separately or in concert with others. Our goal here is to define an approach that moves the AI research community in the positive direction of coordinated effort toward achieving human-level AI in computer systems. As stated earlier, we believe reaching this goal will require such a coordinated effort, and a key aspect of coordination is the ability to assess incremental progress toward the goal in a commonly accepted manner. What follows is a brief description of what we consider good candidates for I-athlon events (figure 2). Image Understanding — Identify both the content and context of a given image, the objects, their attributes and relationships to each other in the image, implications of scene background and object arrangement. Diagram Understanding — Given a diagram, describe each of the elements and their relationships, identify the intended purpose/message of the diagram (infographic, instructional, directional, design, and others). Speech Generation — Given a graph of concepts describing a situation, deliver an appropriate verbal/auditory presentation of the situation. Natural Language Generation — Given nonverbal information, provide a natural language description sufficient to identify the source information among alternatives. Event Tests Should Be Automatically Scored Without Significant Human Intervention. Deception of human judges became the primary strategy for the classic Turing test instead of honest attempts at demonstrating true artificial intelligence. Human bias on the part of the panel of judges also made the results of each run of the Turing test highly unpredictable and even suspect. To the degree possible, scoring should be consistent and unambiguous, with clearly defined performance criteria aligning with standard proficiency level scoring. These scoring constraints should also significantly influence test design and generation itself. To prevent tampering and other fakery, all test generators and scoring systems should run in a common secure cloud, and all tests and results should be immutably archived there for future validation. The Scoring System Should Reward Proficiency over 80 AI MAGAZINE Dimensions of Intelligence Articles CAT Theory of Mind A, A B B Image Understanding A B Reasoning ∑P(X)=1 Diagram Understanding “a black cat with a white face” “Cat” Speech Generation “a black cat with a white face” Natural Language Natural Language Understanding Generation Collaboration Competition A Steps B 1. C 2. B 3. A or “Spain is about to win that match” Creativity Uncertainty CAT Common Sense CAT Cat Gato Chat Language Translation 1 Emotion Video Understanding Initiative Learning Embodiment Audio Understanding Diagram Generation Imagination A B C Planning 2 3 Interaction Figure 2. Good Candidates for the I-athlon Events. Natural Language Understanding — Given a verbal description of a situation, select the image that best describes the situation. Vary the amount of visual distraction. Collaboration — Given descriptions of a collection of agents with differing capabilities, describe how to achieve one of more goals within varying constraints such as time, energy consumption, and cost. Competition — Given two teams of agents, their capabilities and a zero-sum goal, describe both offensive and defensive strategies for each team for winning, initially based on historical performance but eventually in near real time. Reasoning — Given a set of states, constraints, and rules, answer questions about inferred states and relationships. Explain the answers. Variations require use of different logics and combinations of them. Reasoning Under Uncertainty — Given a set of probable states, constraints, and rules, answer questions about inferred states and relationships. Explain the answers. Creativity — Given a goal and a set of assets, construct a solution. Vary by number and variety of assets, complexity of goals, environmental constraints. Alternatively, provide a working solution and attempt to improve it. Explain your solution. Video Understanding — Given a video sequence, describe its contents, context, and flow of activity. Identify objects and characters, their degree of agency and theory of mind. Predict next activity for characters. Identify purpose of video (educational, how-to, sporting event, storytelling, news, and others). Answer questions about the video and explain answers. Initiative — Given a set of agents with different capabilities, goals, and attitudes, organize and direct a collaborative effort to achieve a goal. Key here is utilizing theory of mind to build and maintain the team throughout the activity. Learning — Given a collection of natural language documents, successfully answer a series of questions about the information expressed in the documents. Vary question complexity and corpora size for different levels. Similar tests for nonverbal or mixed media sources. Planning — Given a situation in an initial state, describe a plan to achieve a desired end state. Vary the number and variety of elements, and the complexity of initial and end states, as well as the constraints to be obeyed in the solution (for example, time limit). Common Sense Physics — Given a situation and a proposed change to the situation, describe the reactions to the change and the final state. Vary the complexity of the situation and the number of changes and their order. Language Translation — Given text/speech in one language, translate it to another language. Vary by simplicity of text, number of idioms used, slang, and dialect. Interaction — Given a partial dialogue transcript between two or more agents, predict what will be the next interactions in the exchange. Alternatively, given an anonymous collection of statements and a description of multiple agents, assign the statements to each agent and order the dialogue in time. Embodiment — Given an embodiment with a collection of sensors and effectors, and an environment SPRING 2016 81 Articles surrounding that body, perform a given task in the environment. Vary the number and sophistication of sensors and effectors and tasks, the complexity of the environment, the time allowed. Added bonus for adapting to sensors/effectors added or disabled during the test. Audio Understanding — Given an audio sequence, describe the scene with any objects, actions, and implications. Vary length and clarity, along with complexity of audio sources in the scene. Diagram Generation — Given a verbal description of a process, generate a series of diagrams describing the process. Alternatively use video input. Imagination — Given a set of objects and agents from a common domain along with their attributes and capabilities, construct and describe a plausible scenario. Score higher for richer, more complex interactions involving more agents and objects. Alternatively, provide two or more sets of objects and agents from different domains and construct a plausible scenario incorporating both sets. Score higher for more interaction across domains. Approach for Designing I-athlon Events Given the requirement for automatic test generation and scoring, we have explored applying the CAPTCHA (von Ahn et al. 2003) approach to the general design of I-athlon events, and the results are intriguing. CAPTCHA, which stands for “Completely Automated Public Turing test to tell Computers and Humans Apart,” was originally conceived as a means to validate human users of websites while restricting programmatic access by bots. By generating warped or otherwise obscured images of words or alphanumeric sequences, the machine or human desiring to access the website had to correctly declare the original sequence of characters that was used to generate the test image, a task that was far beyond the ability of current optical character recognition (OCR) programs or other known image processing algorithms. Over time, an arms race of sorts has evolved, with systems learning to crack various CAPTCHA schemes, which in turn has driven the development of more difficult CAPTCHA images. The effectiveness or security of CAPTCHA-based human interaction proofs (HIPs) is not our interest here, but an explicit side effect of the evolution of CAPTCHA technology is: once an existing CAPTCHA-style test is passed by a system, an advance has been achieved in AI. We feel that by applying this approach to other dimensions of intelligence we can motivate and sustain continual progress in achieving human-level AI and beyond. There are several keys to developing a good CAPTCHA-style test, many of which have to do with its application as a cryptographic security measure. For our purposes, however, we are only concerned with the generalization of the approach for automat- 82 AI MAGAZINE ed test generation and scoring where both humans and machines can compete directly, not for any security applications. For the original CAPTCHA images consisting of warped and obscured text, the generation script was designed to create any number of testable images, and the level of obscuration was carefully matched to what was relatively easy for most humans while being nearly impossible for machines. This pattern can be followed to develop Iathlon event tests by keeping the test scenario the same each time but varying the amount of information provided or the amount of noise in that information for each level of proficiency. This approach could be adapted for many of the dimensions of intelligence described above. For I-athlon events, the generation algorithms must also be able to produce any number of distinct test scenarios, but at different levels of sophistication that will require different levels of intelligence to succeed, four levels for human achievement and a fifth for superhuman. It would also be important for the generation algorithms to produce identical tests based from a given seed value. This would allow for efficient documentation of the tests generated as well as provide for experimental repeatability by different researchers. We anticipate that both the definition of each event, the design of its standard test generator, and the scoring system and levels will be active areas of research and debate. We include in this article a brief outline for several events to demonstrate the idea. Since the goal of the I-athlon is continual coordinated progress toward the goals of AI, all this effort adds significantly to our understanding of intelligence as well as our ability to add intelligence to computer systems. To support automatic test generation and scoring for an event, the key is to construct the test so that a small number of variables can programmatically drive a large number of variant test cases that directly map to clear levels of intelligent human ability. Providing human baselines for these events can be obtained through crowdsourcing, incentivizing large numbers of humans to take the tests, probably through mobile apps. This raises the requirement for an I-athlon event to provide appropriate interfaces for both human and machine contestants. Examples Some examples include events that involve simple planning, video understanding, embodiment, and object identification. A Simple Planning Event For example, consider an I-athlon event for planning based on a blocks world. An entire genre of twodimensional physics-based mobile apps already generates puzzles of this type for humans.4 Size, shape, initial location, and quantity of blocks for each test Articles can be varied, along with the complexity of the environment (gravity, wind, earthquakes) of the goal state. For a blocks world test, the goal would likely be reaching a certain height or shape with the available blocks, with extra points given for using fewer blocks to reach the goal in fewer attempts. Providing a completed structure as a goal might be too easy, unless ability to manipulate blocks through some virtual device is also a part of the test. Automatic scoring could be based on the test environment reaching a state that passes the constraints of the goal, which could be straightforward programming for a blocks world but likely more challenging for other aspects of intelligence. The test interface could be a touchbased graphical interface for humans and a REST API for machines. A Video Understanding Event Given a set of individual video frames in random order, discover the original order by analyzing content and context. Vary the “chunk size” of ordered frames randomized to produce the test. Decimate the quality of the video by masking or adding noise. Scoring could be based on the fraction of frames correctly assembled in order within a time limit, or the total time to complete the task. An Embodiment Event Given a sensor/effector API to an embodied agent in a virtual environment, complete a task in the environment using the sensory/motor abilities of the agent. Vary the number and kinds of sensors and effectors. Vary the complexity of the task and the nature of the environment. Environments could be a limited as ChipWits5 or as open ended as MineCraft.6 A more sophisticated event would include potential identification and use of tools, or the ability to adapt to gaining or losing sensors and effectors during the test. An Object-Identification Event Given recent advances applying DNNs to object recognition, one might think this event would not be interesting. But human visual intelligence allows us to recognize millions of distinct objects in many thousands of classes, and the breadth of this ability is important for general intelligence. This event would generate test images by mixing and overlaying partial images from a very large collection of sources. Scoring would be based on the number of correctly identified objects per image and the time required per image and per test. Competition Framework and Ecosystem Our goal to motivate coordinated effort toward the goal of AI requires not only a standard set of events, test generators, and scorers, but also an overall frame- work for public competition and comparison of results in an unbiased manner. Given the large number of successful industrywide competitions in different areas of computer science and engineering, we propose taking key aspects of each and combining them into a shared platform of ongoing I-athlon competitions. Sites like Graph 5007 provide an excellent model for test generation and common scoring. A common cloud-hosted platform for developing and running events and for archiving tests and results will be required, even if competitors run their systems on their own resources. A central location for running the competitions would help limit bias and would also provide wider publicity for successes. Having such a persistent platform along with automated test generation and scoring would support the concept of continuous competition, allowing new entrants at any time with an easy on-ramp to the AI research community. Continuous competitions can prequalify participants in head-to-head playoffs held concurrently with major AI conferences, similar to the RoboCup8 competitions. In addition to the professional and graduate-level research communities, such a framework could support competitions at undergraduate and secondary school levels. Extensive programming and engineering communities have been created using this approach, with TopCoder9 and First Robotics10 as prime examples. These not only serve a valuable mentoring role in the development of skills, but also recruit high-potential students into the AI research effort. Incentives beyond eminence and skill building also have proven track records for motivating progress. The X-Prize11 approach has proven to be highly successful in focusing research attention, as have the DARPA Challenges12 for self-driving vehicles and legged robots. Presenting a unified, organized framework for progress in AI would go a long way to attract this kind of incentive funding. The division of labor made possible by the proposed approach could fit nicely within the research agendas of numerous universities at all levels, supporting common curricula development in AI and supporting research programs targeted at different aspects of the I-athlon ecosystem. Call to Action We welcome feedback and collaboration from the broad research community to develop and administer a continuing series of I-athlon events according to the model proposed in this article. Our ultimate goal is to motivate the AI research community to understand and develop research agendas that get to the core of general machine intelligence. As we know from the history of AI, this is such a complex problem with so many yet-unknown dimensions, that SPRING 2016 83 Articles 8. www.robocup.org. 9. www.topcoder.com. 10. www.usfirst.org. 11. www.xprize.org. 12. www.darpa.mil/about/history/archives.aspx. References Turing, A. 1950. Computing Machinery and Intelligence. Mind 59(236): 433–460. dx.doi.org/10.1093/mind/ LIX.236. 433 Visit the AAAI Member Site and Create Your Own Circle! We encourage you to explore the features of the AAAI Member website, where you can renew your membership in AAAI and update your contact information directly. In addition, you are directly connected with other members of the largest worldwide AI community via the AAAI online directory and other social media features. Direct links are available for AI Magazine features, such as the online and app versions. Finally, you will receive announcements about all AAAI upcoming events, publications, and other exciting initiatives. Be sure to spread the word to your colleagues about this unique opportunity to tap into the premier AI society! aaai.memberclicks.net the only way to make measurable progress is to develop rigorous, practical, yet flexible tests that require the use of multiple dimensions. The tests themselves can evolve, as we understand the nature of intelligence. We look forward to making progress in the AI field through such an activity. Notes 1. See, for example, the Kapor-Kurzweil bet: longbets.org/1/ #terms. 2. www.olympic.org/athletics-decathlon-men. 3. www.decathlon2000.com/upload/file/pdf/scoringtables. pdf. 4. For example, en.m.wikipedia.org/wiki/The_Incredible_Machine_%28series%29, www.crayonphysics.com. 5. www.chipwits.com/. 6. minecraft.net. 7. www.graph500.org. 84 AI MAGAZINE von Ahn, L.; Blum, M.; Hopper, N.; and Langford, J. 2003. CAPTCHA: Using Hard AI Problems for Security. In Proceedings of the Annual International Conference on the Theory and Applications of Cryptographic Techniques (EUROCRYPT-03). Carson City, NV: International Association for Cryptologic Research. dx.doi.org/10.1007/3-540-39200-9_18 Sam S. Adams ([email protected]) works for IBM Research and was appointed one of IBM’s first distinguished engineers in 1996. His far-ranging contributions include founding IBM’s first object technology practice, authoring IBM’s XML technical strategy, originating the concept of service-oriented architecture, pioneering work in self-configuring and autonomic systems, artificial general intelligence, end-user mashup programming, massively parallel many-core programming, petascale analytics, and data-centered systems. Adams is currently working on cloud-scale cognitive architectures for the Internet of Things, and has particular interests in artificial consciousness and autocognitive systems. Guruduth Banavar, as vice president of cognitive computing at IBM Research, currently leads a global team of researchers creating the next generation of IBM’s Watson systems — cognitive systems that learn, reason, and interact naturally with people to perform a variety of knowledgebased tasks. Previously, as the chief technical officer of IBM’s Smarter Cities initiative, he designed and implemented big data and analytics-based systems to make cities more livable and sustainable. Prior to that, he was the director of IBM Research in India, which he helped establish as a preeminent center for services research and mobile computing. He has published extensively, holds more than 25 patents, and his work has been featured in the New York Times, the Wall Street Journal, the Economist, and other international media. He received a Ph.D. from the University of Utah before joining IBM’s Thomas J. Watson Research Center in 1995. Murray Campbell is a principal research staff member at the IBM Thomas J. Watson Research Center in Yorktown Heights, NY. He was a member of the team that developed Deep Blue, the first computer to defeat the human world chess champion in a match. Campbell has conducted research in artificial intelligence and computer chess, with numerous publications and competitive victories, including eight computer chess championships. This culminated in the 1997 victory of the Deep Blue chess computer, for which he was awarded the Fredkin Prize and the Allen Newell Research Excellence Medal. He has a Ph.D. in computer science from Carnegie Mellon University, and is an ACM Distinguished Scientist and a Fellow of the Association for the Advancement of Artificial Intelligence. He currently manages the AI and Optimization Department at IBM Research. Articles Software Social Organisms: Implications for Measuring AI Progress Kenneth D. Forbus I In this article I argue that achieving human-level AI is equivalent to learning how to create sufficiently smart software social organisms. This implies that no single test will be sufficient to measure progress. Instead, evaluations should be organized around showing increasing abilities to participate in our culture, as apprentices. This provides multiple dimensions within which progress can be measured, including how well different interaction modalities can be used, what range of domains can be tackled, what human-normed levels of knowledge they are able to acquire, as well as others. I begin by motivating the idea of software social organisms, drawing on ideas from other areas of cognitive science, and provide an analysis of the substrate capabilities that are needed in social organisms in terms closer to what is needed for computational modeling. Finally, the implications for evaluation are discussed. T oday’s AI systems can be remarkably effective. They can solve planning and scheduling problems that are beyond what unaided people can accomplish, sift through mountains of data (both structured and unstructured) to help us find answers, and robustly translate speech and handwriting into text. But these systems are carefully crafted for specific purposes, created and maintained by highly trained personnel who are experts in artificial intelligence and machine learning. There has been much less progress on building general-purpose AI systems, which could be trained and tasked to handle multiple jobs. Indeed, in my experience, today’s general-purpose AI systems tend to skate a very narrow line between catatonia and attention deficit disorder. People and other mammals, by contrast, are not like that. Consider dogs. A dog can be taught to do tasks like shaking hands, herding sheep, guarding a perimeter, and helping a blind person maneuver through the world. Instructing dogs can be done by people who don’t have privileged access to the internals of their minds. Dogs don’t blue screen. What if AI systems were as robust, trainable, and taskable as dogs? That would be a revolution in artificial intelligence. In my group’s research on the companion cognitive architecture (Forbus et al. 2009), we are working toward such a revolution. Our approach is to try to build software social organisms. By that we mean four things: First, companions should be able to work with people using natural interaction modalities. Our focus so far has been on natural language (for example, learning by reading [Forbus et al. 2007; Barbella and Forbus 2011]) and sketch understanding (Forbus et al. 2011). Second, companions should be able to learn and adapt over extended periods of time. This includes formulating their own learning goals and pursuing them, in order to improve themselves. Third, companions should be able to maintain themselves. This does not mean a 24-hour, 7-day-a-week operation — even people need to sleep, to consolidate learning. But Copyright © 2016, Association for the Advancement of Artificial Intelligence. All rights reserved. ISSN 0738-4602 SPRING 2016 85 Articles they should not need AI experts peering into their internal operations just to keep them going. Fourth, people should be able to relate to companions as collaborators, rather than tools. This requires companions to learn about the people that they are working with, and build relationships with them that are effective over the long term. Just to be clear, our group is a long way from achieving these goals. And this way of looking at the problems is far from standard in AI today. Consider for example IBM’s Watson. While extremely adept at factoid question and answering, Watson would not be considered an organism by these criteria. It showed a groundbreaking ability to do broad natural language processing, albeit staying at a fairly shallow, syntactic level much of the time. But it did not formulate its own learning goals nor maintain itself. It required a team of AI experts inspecting its internals constantly through development, adding and removing by hand component algorithms and input texts (Baker 2011). Another example are cognitive architectures that started as models of skill learning, like ACT-R (Anderson and Lebiere 1998) or SOAR (Laird 2012). Such architectures have done an impressive job at modeling a variety of psychological phenomena, and have also been used successfully in multiple performance-oriented systems. However, using them typically involves generating by hand a model of a specific cognitive phenomenon, such as learning to solve algebraic equations. The model is typically expressed in the rule language of the architecture, although for some experiments simplified English is used to provide declarative knowledge that the system itself proceduralizes. The model is run multiple times to satisfy the conditions of the experiment, and then is turned off. More ambitious uses (for example, as pilots in simulated training exercises [Laird et al. 1998], or as coaches/docents [Swartout et al. 2013]) work in narrow domains, for short periods of time, and with most of the models being generated by hand. Creating systems that live and learn over extended periods of time on their own is beyond the state of the art today. Recently, more people are starting to work on aspects of this. Research on interactive task learning (Hinrichs and Forbus 2014, Kirk and Laird 2014) is directly concerned with the first two criteria above, and to some degree the third. Interactive task learning is a sweet spot in this research path. But I think the importance of the shift from treating software as tools versus collaborators should not be underestimated, both for scientific and for practical reasons. The scientific reasons are explained below. As a practical matter, the problems humanity faces are growing more complex, while human cognitive capacities remain constant. Working together fluently in teams with systems that are enough like us to be trusted, and have complementary strengths and weaknesses, could help us solve problems that are beyond our reach today. 86 AI MAGAZINE The companion cognitive architecture incorporates two other scientific hypotheses. The first is that analogical reasoning and learning, over structured, relational representations, is ubiquitous in human cognition (Gentner 2003). There is evidence that the comparison process defined by Gentner’s structuremapping theory (Gentner 1983) operates across a span of phenomena that includes high-level vision and auditory processing, inductive reasoning, problem solving, and conceptual change. The second hypothesis is that qualitative representations are central in human cognition. They provide a level of description that is appropriate for commonsense reasoning, grounding for professional knowledge of continuous systems (for example, scientists, engineers, analysts), and a bridge between perception and cognition (Forbus 2011). These two hypotheses are synergistic, for example, qualitative representations provide excellent grist for analogical learning and reasoning. These two specific hypotheses might be correct or might be wrong. But independent of them, I think the concept of software social organisms is crucial, a way of reframing what we mean by human-level AI, and does so in a way that suggests better measurements than we have been using. So let us unpack this idea further. Why Software Social Organisms? I claim that human-level AI is equivalent to sufficiently smart software social organisms. I start by motivating the construction of organisms, then argue that they need to be social organisms. A specification for the substrate capabilities that are needed to be a social organism is proposed, based on evidence from the cognitive science literature. Why Build Organisms? There are two main reasons for thinking about building AI systems in terms of constructing software organisms. The first is autonomy. We have our own goals to pursue, in addition to those provided by others. We take those external goals as suggestions, rather than as commands that we run as programs in our heads. This is a crucial difference between people and today’s AI systems. Most AI systems today can’t be said to have an inner life, a mix of internally and externally generated plans and goals, whose pursuit depends on its estimation of what it should be doing. The ability to punt on an activity that is fruitless, and to come up with better things to do, is surely part of the robustness that mammals exhibit. There has been some promising work on metacognition that is starting to address these issues (Cox and Raja 2011), but the gap between human abilities and AI systems remains wide.1 Another aspect of autonomy is the separation of internal versus external representations. We do not Articles have direct access to the internal representations of children or our collaborators. (Cognitive science would be radically simpler if we did.) Instead, we communicate through a range of modalities, including natural language, sketching, gesture, and physical demonstrations. These work because the recipient is assumed to have enough smarts to figure them out. The imperfections of such communications are well known, that is, the joint construction of context in natural language dialogue involves a high fraction of exchanges that are diagnosing and repairing miscommunications. To be sure, there are strong relationships between internal and external representations: Vygotsky (1962), for example, argues that much of thought is inner speech, which is learned from external speech. But managing that relationship for itself is one of the jobs of an intelligent organism. The second reason for building organisms is adaptation. Organisms adapt. We learn incrementally and incidentally in everyday life constantly. We learn about the world, including learning on the job. We learn things about the people around us, both people we work and play with and people who are part of our culture that we have never interacted with and likely never will (for example, political figures, celebrities). We learn about ourselves as well: what we like and dislike, how to optimize our daily routines, what we are good at, bad at, and where we’d like to improve. We build up this knowledge over days, weeks, months, and years. We are remarkably good at this, adapting stably — very few people go off the rails into insanity. I know of no system that learns in a broad range of domains over even days without human supervision by people who understand its internals. That is radically different from people, who get by with feedback from the world and from other people who have no privileged access to their internals. Having autonomy and adaptability covers the second and third desiderata, and can be thought of as an elaboration of what is involved in achieving them. Communication through natural modalities is implied by both, thereby covering the first at least partly. But to complete the argument for the first, and to handle the fourth (collaborators), we need to consider why we want social organisms. Why Social Organisms? People are social animals. It has been proposed (for example, Tomasello [2001]) that, in evolutionary terms, being social provides a strong selection bias toward intelligence. Social animals have to track the relationships between themselves and others of their species. Being social requires figuring out who are your friends and allies, versus your competitors and enemies. Relationships need to be tracked over time, which involves observing how others are interacting to build and maintain models of their relationships. Sociality gives rise to friendship and helping, as well as to deceit and competition. These cognitive challenges seem to be strong drivers toward intelligence, as most social creatures tend to be more intelligent than those that are not, with dolphins, crows, and dogs being well-known examples. A second reason for focusing on social organisms is that much of what people learn is from interactions with other people and their culture (Vygotsky 1962). To be sure, we learn much about the basic properties of materials and objects through physical manipulation and other experiences in the world. But we can all think about things that we have never experienced. None reading this lived through the American Revolutionary War, for example, nor did they watch the Galápagos Islands form with their own eyes. Yet we all can have reasonably good models of these things. Moreover, even our knowledge of the physical world has substantial contributions from our culture: how we carve the mechanisms underlying events into processes is enshrined in natural language, as well as aspects of how we carve visual scenes up into linguistic descriptions (for example, Coventry and Garrod [2004]). A number of AI researchers have proposed that stories are central to human intelligence (Schank 1996, Winston 2012). The attraction and power of stories is that they can leverage the same cognitive capacities that we use to understand others, and provide models that can be used to handle novel situations. Moral instruction, for example, often relies on stories. Other AI researchers have directly tackled how to build systems that can cooperate and collaborate with people (Allen et al. 2007; Grosz, Hunsberger, and Kraus 1999). These lines of research provide important ingredients for building social organisms, but much work remains to be done. Hence my claim that human-level AI systems will simply be sufficiently smart software social organisms. By sufficiently smart, I mean capable of learning to perform a broad range of tasks that people perform, with similar amounts of input data and instruction, arriving at the same or better levels of performance. Does it have to be social? If not, it could not discuss its plans, goals, or intentions, and could not learn from people using natural interaction modalities. Does it have to be an organism? If not, it will not be capable of maintaining itself, which is something that people plainly do. Substrate Capabilities for Social Organisms This equivalence makes understanding what is needed to create social organisms more urgent. To that end, here is a list of substrate capacities that I believe will be needed to create human-level social organisms. These are all graded dimensions, which means that incremental progress measures can be formulated and used as dimensions for evaluation. (1) Autonomy. They will have their own needs, drives, SPRING 2016 87 Articles and capabilities for acting and learning. What should those needs and drives be? That will vary, based on the niche that an organism is operating in. But if we are wise, we will include in their makeup the desire to be good moral actors, as determined by the culture they are part of, and that they will view having good relationships with humans as being important to their own happiness. (2) Operates in environments that support shared focus. That is, each participant has some information about what others can sense, and participants can make their focus of attention known to each other easily. People have many ways of drawing attention to people, places, or things, such as talking, pointing, gesturing, erecting signs, and winking. But even with disembodied software, there are opportunities for shared focus, for example, selection mechanisms commonly used in GUIs, as well as speech and text. Progress in creating virtual humans (for example, Bohus and Horvitz [2011] and Swartout et al. [2013]) is increasing the interactive bandwidth, as is progress in humanrobotics interaction (for example, Scheutz et al. [2013]). (3) Natural language understanding and generation capabilities sufficient to express goals, plans, beliefs, desires, and hypotheticals. Without this capability, building a shared understanding of a situation and formulating joint plans becomes much more difficult. (4) Ability to build models of the intentions of others. This implies learning the types of goals they can have, and how available actions feed into those goals. It also requires models of needs and drives as the wellsprings of particular goals. This is the basis for modeling social relationships. (5) Strong interest in interacting with other social organisms (for example, people), especially including helping and teaching. Teaching well requires building up models of what others know and tracking their progress. There is ample evidence that other animals learn by observation and imitation. The closest thing to teaching in other animals found so far is that, in some species, parents bring increasingly more challenging prey to their young as they grow. By contrast, human children will happily help adults, given the opportunity (for 88 AI MAGAZINE example, Liszkowski, Carpenter, and Tomasello [2008]). This list provides a road map for developing social organisms of varying degrees of complexity. Simpler environmental niches require less in terms of reference to shared focus, and diminished scope for beliefs, plans, and goals, thereby providing more tractable test beds for research. I view Allen’s Trips system (Ferguson and Allen 1998), along with virtual humans research (Bohus and Horvitz 2011, Swartout et al. 2013), as examples of such test beds. As AI capabilities increase, so can the niches, until ultimately the worlds they operate in are coextensive with our own. Implications for Measuring Progress This model for human-level AI has several implications for measuring progress. First, it should be clear that no single test will work. No single test can measure adaptability and breadth. Single tests can be gamed, by systems that share few of the human characteristics above. Believability, which is what the Turing test is about, is particularly problematic since people tend to treat things as social beings (Reeves and Nass 2003). What should we do instead? I believe that the best approach is to evaluate AI systems by their ability to participate in our culture. This means having AI systems that are doing some form of work, with roles and responsibilities, interacting with people appropriately. While doing this, it needs to adapt and learn, about its work, about others, and about itself. And it needs to do so without AI experts constantly fiddling with its internals. I believe the idea of apprenticeship is an extremely productive approach for framing such systems. Apprenticeship provides a natural trajectory for bringing people into a role. They start as a student, with lots of book learning and interaction. There are explicit lessons and tests to gauge learning. But there is also performance, at first with simple subtasks. As an apprentice learns, their range of responsibilities is expanded to include joint work, where roles are negotiated. Finally, the apprentice graduates to autonomous operation within a community, performing well on its own, but also interacting with others at the same level. Apprentices do not have to be perfect: They can ask for help, and help others in turn. And in time, they start training their own apprentices. Apprenticeship can be used in a wide variety of settings. For example, we are using this approach in working with companions in a strategy game, where the game world provides a rich simulation and source of problems and decisions to make (Hinrichs and Forbus 2015). Robotics-oriented researchers might use assembly tasks or flying survey or rescue drones in environments of ever-increasing complexity. An example of a challenge area for evaluating AIs is science learning and teaching. The scientific method and its products are one of the highest achievements of human culture. Ultimately, one job of AIs should be helping people learn science, in any domain and at any level. The Science Test working group2 has proposed the following trajectory, as a way of incrementally measuring progress. First, evaluate the ability of AI systems to answer questions about science, using standardized human-normed tests, such as the New York Regent’s Science Tests, which are available for multiple years and multiple levels. Second, evaluate the ability of AI systems to learn new scientific concepts, by reading, watching videos, and interacting with people. Third, evaluate the ability of AI systems to communicate what they know about science across multiple domains and at multiple levels. We conjecture that this provides a scalable trajectory for evaluating AI systems, with the potential for incremental and increasing benefits for society as progress is made. This challenge illustrates how useful the apprenticeship approach can be for evaluation. The first phases are aimed at evaluating systems as students, ensuring that they know enough to contribute. The middle phase focuses on being able to contribute, albeit in a limited way. The final phase is focused on AIs becoming practitioners. Notice that in each phase there are multiple Articles dimensions of scalability: number of domains, level of knowledge (for example, grade level), and modalities needed to communicate. (We return to the question of scalable evaluation dimensions more generally below.) Progress across these dimensions need not be uniform: some groups might focus entirely on maximizing domain coverage, while others might choose to stick with a single domain but start to focus early on tutoring within that domain. This provides a rich tapestry of graded challenges. Moreover, incremental progress will lead to systems that could improve education. Scalable Evaluation Dimensions A productive framework should provide a natural set of dimensions along which progress can be made and measured. Here are some suggestions implied by the software social organism approach. Natural Interaction Modalities Text, speech, sketching, vision, and mobility are all capabilities that can be evaluated. Text can be easier than speech, and sketching can be viewed as a simplified form of vision. Initial Knowledge Endowment How much of what a system knows is learned by the system itself, versus what it has to begin with? What the absolute minimum initial endowment might be is certainly a fascinating scientific question, but it is probably best answered by starting out with substantially more knowledge and learning how human-level capabilities can be reached. Understanding those pathways should better enable us to understand what minimal subsets can work. It is very seductive to start from scratch, and perhaps easier, if it could be made to work. But the past 50 years of research suggests that this is much harder than it seems: Look at the various “robot baby” projects that have tried that. Arguably, given that IBM’s Watson used more than 900 million syntactic frames as part of its knowledge base, the 5 million facts encoded in ResearchCyc might well be considered a small starting endowment. Level of Domain Knowledge and Skill Prior work on learning apprentices (for example, Mitchell et al. [1994]) focused on systems that helped people perform better in particular domains. They started with much of the domain knowledge that they would need, and learned more about how to operate in that domain. In qualitative reasoning, many systems have been built that incorporate expert-level models for particular domains (Forbus 2011). Breadth is now the challenge. Consider what fourth graders know about science (Clark 2015), and the kinds of social interactions they can have with people. AI systems are still far from that level of accomplishment, nor can they grow into expertise by building on their everyday knowledge, as people seem to do (Forbus and Gentner 1997). Range of Tasks the System Is Responsible For Most AI systems have focused on single tasks. Being able to accomplish multiple tasks with the same system has been one of the goals of research on cognitive architecture, and with interactive task learning, the focus is shifting to being able to instruct systems in new tasks, an important step toward building systems capable enough to be apprentices. Learning Abilities Software social organisms need to learn about their jobs, the organisms (people and machines) that they work with, and about themselves. While some problems may well require massive amounts of data and deep learning (for example, speech recognition [Graves, Mohamed, and Hinton 2013]), people are capable of learning many things with far fewer examples. Office assistants who required, for example, 10,000 examples of how to fill out a form before being able to do it themselves would not last long in any reasonable organization. There are many circumstances where children learn rapidly (for example, fast mapping in human word learning [Carey 2010]), and understanding when this can be done, and how to do it, is an important question. Summary I have argued that the goal of humanlevel AI can be equivalently expressed as creating sufficiently smart software social organisms. This equivalence is useful because the latter formulation makes strong suggestions about how such systems should be evaluated. No single test is enough, something which has become very apparent from the limitations of Turing’s test, which brought about the workshop that motivated the talk that this article was based on. More positively, it provides a framework for organizing a battery of tests, namely the apprenticeship trajectory. An apprentice is initially a student, learning from instructors through carefully designed exercises. Apprentices start working as assistants to a mentor, with increasing responsibility as they learn. Eventually they start working autonomously, communicating with others at their same level, and even taking on their own apprentices. If we can learn how to build AI systems with these capabilities, it would be revolutionary. I hope the substrate capabilities for social organisms proposed here will encourage others to undertake this kind of research. The fantasy of the Turing test, and many of its proposed replacements, is that a single simple test can be found for measuring progress toward humanlevel AI. Part of the attraction of this view is that the alternative is both difficult and expensive. Many tests, involving multiple capabilities and interactions over time with people, all require substantial investments in research, engineering, and evaluation. But given that we are tackling one of the deepest questions ever asked by humanity, that is, what is mind, this should not be too surprising. And I believe it will be an extraordinarily productive investment. Acknowledgements I thank Dedre Gentner, Tom Hinrichs, Mike Tomasello, Herb Clark, and the Science Test Working Group for many helpful discussions and suggestions. This research is sponsored by the Socio-Cognitive Architectures and the Machine Learning, Reasoning, and Intelligence Programs of the Office of Naval Research and by the Computational and Machine Intelligence Program of the Air Force Office of Scientific Research. Notes 1. Part of the gap, I believe, is the dearth of SPRING 2016 89 Articles broad and rich representations in most AI systems, exacerbated by our failure as a field to embrace existing off-the-shelf resources such as ResearchCyc. 2. The Science Test Working Group includes Peter Clark, Barbara Grosz, Dragos Margineantu, Christian Lebiere, Chen Liang, Jim Spohrer, Melanie Swan, and myself. It is one of several groups working on tests that, collectively, should provide better ways of measuring progress in AI. References Allen, J.; Chambers, N.; Ferguson, G.; Galescu, L.; Jung, H.; Swift, M.; and Tayson, W. 2007. PLOW: A Collaborative Task Learning Agent. In Proceedings of the Twenty-Second AAAI Conference on Artificial Intelligence. Palo Alto, CA: AAAI Press. Anderson, J. R., and Lebiere, C. 1998. The Atomic Components of Thought. Mahwah, NJ: Erlbaum. Baker, S. 2011. Final Jeopardy! Man Versus Machine and the Quest to Know Everything. New York: Houghton Mifflin Harcourt. Barbella, D., and Forbus, K. 2011. Analogical Dialogue Acts: Supporting Learning by Reading Analogies in Instructional Texts. In Proceedings of the Twenty-Fifth AAAI Conference on Artificial Intelligence, 1429–1435. Palo Alto, CA: AAAI Press. Bohus, D., and Horvitz, E. 2011 Multiparty Turn Taking in Situated Dialog: Study, Lessons, and Directions. In Proceedings of the SIGDIAL 2011 Conference, The 12th Annual Meeting of the Special Interest Group on Discourse and Dialogue. Stroudsburg, PA: Association for Computational Linguistics. Carey, S. 2010. Beyond Fast Mapping. Language, Learning, and Development. 6(3): 184– 205. dx.doi.org/10.1080/15475441.2010. 484379 Clark, P. 2015. Elementary School Science and Math Tests as a Driver for AI: Take the Aristo Challenge! In Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, 4019–4021. Palo Alto, CA: AAAI Press. Coventry, K., and Garrod, S. 2004. Saying, Seeing, and Acting: The Psychological Semantics of Spatial Prepositions. London: Psychology Press. Cox, M., and Raja, A. 2011. Metareasoning: Thinking about Thinking. Cambridge, MA: The MIT Press. dx.doi.org/10.7551/mitpress/9780262014809.001.0001 Ferguson, G., and Allen, J. 1998 TRIPS: An Intelligent Integrated Problem-Solving Assistant. In Proceedings of the Fifteenth National Conference on Artificial Intelligence and Tenth Innovative Applications of Artificial Intelligence Conference. 567–573. Menlo Park, CA: AAAI Press.. Forbus, K. 2011. Qualitative Modeling. 90 AI MAGAZINE Wiley Interdisciplinary Reviews: Cognitive Science 24 (July/August): 374–391. dx.doi.org/ 10.1002/wcs.115 Forbus, K., and Gentner, D. 1997. Qualitative Mental Models: Simulations or Memories? Paper presented at the Eleventh International Workshop on Qualitative Reasoning, Cortona, Italy, June 3–6. Forbus, K.; Riesbeck, C.; Birnbaum, L.; Livingston, K.; Sharma, A.; and Ureel, L. 2007. Integrating Natural Language, Knowledge Representation and Reasoning, and Analogical Processing to Learn by Reading. In Proceedings of the Twenty-Second Conference on Artificial Intelligence. Palo Alto, CA: AAAI Press. Forbus, K. D.; Klenk, M.; and Hinrichs, T. 2009. Companion Cognitive Systems: Design Goals and Lessons Learned So Far. IEEE Intelligent Systems 24(4): 36–46. dx.doi.org/10.1109/MIS.2009.71 Forbus, K. D.; Usher, J.; Lovett, A.; Lockwood, K.; and Wetzel, J. 2011. Cogsketch: Sketch Understanding for Cognitive Science Research and for Education. Topics in Cognitive Science 3(4): 648–666. dx.doi.org/ 10.1111/j.1756-8765.2011.01149.x Gentner, D. 1983. Structure-Mapping: A Theoretical Framework for Analogy. Cognitive Science 72(2): 155–170. Gentner, D. 2003. Why We’re So Smart. In Language in Mind, ed. D. Gentner and S. Goldin-Meadow. Cambridge, MA: The MIT Press. dx.doi.org/10.1207/s15516709cog 0702_3 Graves, A.; Mohamed, A.; and Hinton, G. 2013. Speech Recognition with Deep Recurrent Neural Networks. In Proceedings of the 2013 IEEE International Conference on Acoustic Speech and Signal Processing. Piscataway, NJ: Institute for Electrical and Electronics Engineers. dx.doi.org/10.1109/ ICASSP.2013.6638947 Grosz, B.; Hunsberger, L.; and Kraus, S. 1999. Planning and Acting Together. AI Magazine 20(4): 23–34 Hinrichs, T., and Forbus, K. 2014. X Goes First: Teaching Simple Games through Multimodal Interaction. Advances in Cognitive Systems 3: 31–46. Hinrichs, T., and Forbus, K. 2015. Qualitative Models for Strategic Planning. Proceedings of the Third Annual Conference on Advances in Cognitive Systems, Atlanta, May. Kirk, J., and Laird, J. 2014 Interactive Task Learning for Simple Games. Advances in Cognitive Systems 3: 13–30. Laird, J. 2012. The SOAR Cognitive Architecture. Cambridge, MA: The MIT Press Laird, J.; Coulter, K.; Jones, R.; Kenney, P.; Koss, F.; Nielsen, P. 1998. Integrating Intelligent Computer Generated Forces in Dis- tributed Simulations: TacAir-SOAR in STOW-97. Paper presented at the Simulation Interoperability Workshop, 9–13 March, Orlando, FL. Lizkowski, U.; Carpenter, M.; and Tomasello, M. 2008. Twelve-Month-Olds Communicate Helpfully and Appropriately for Knowledgeable and Ignorant Partners. Cognition 108(3): 732–739. dx.doi.org/10.1016/ j.cognition.2008.06.013 Mitchell, T.; Caruana, R.; Freitag, D.; McDermott, J.; and Zabowski, D. 1994. Experience with a Personal Learning Assistant. Communications of the ACM 37(7): 80–91. dx.doi.org/10.1145/176789.176798 Reeves, B., and Nass, C. 2003. The Media Equation: How People Treat Computers, Television, and New Media Like Real People and Places. Palo Alto, CA: Center for the Study of Language and Information, Stanford University. Schank, R. 1996. Tell Me a Story: Narrative and Intelligence. Evanston, IL: Northwestern University Press. Scheutz, M.; Briggs, G.; Cantrell, R.; Krause, E.; Williams, T.; and Veale, R. 2013. Novel Mechanisms for Natural Human-Robot Interactions in the DIARC Architecture. In Proceedings of the Twenty-Seventh AAAI Conference on Artificial Intelligence. Palo Alto, CA: AAAI Press. Swartout, W.; Artstein, R.; Forbell, E.; Foutz, S.; Lane, H.; Lange, B.; Morie, J.; Noren, D.; Rizzo, A.; Traum, D. 2013. Virtual Humans for Learning. AI Magazine 34(4): 13–30. Tomasello, M. 2001. The Cultural Origins of Human Cognition. Cambridge, MA: Harvard University Press. Vygotsky, L. 1962. Thought and Language. Cambridge, MA: The MIT Press. dx.doi.org/ 10.1037/11193-000 Winston, P. 2012. The Right Way. Advances in Cognitive Systems 1: 23–36. Kenneth D. Forbus is the Walter P. Murphy Professor of Computer Science and Professor of Education at Northwestern University. His research interests include qualitative reasoning, analogy, spatial reasoning and learning, sketch understanding, natural language understanding, cognitive architecture, reasoning system design, intelligent educational software, and the use of AI in interactive entertainment. He is a fellow of AAAI, ACM, and the Cognitive Science Society. Articles Principles for Designing an AI Competition, or Why the Turing Test Fails as an Inducement Prize Stuart M. Shieber I If the artificial intelligence research community is to have a challenge problem as an incentive for research, as many have called for, it behooves us to learn the principles of past successful inducement prize competitions. Those principles argue against the Turing test proper as an appropriate task, despite its appropriateness as a criterion (perhaps the only one) for attributing intelligence to a machine. T here has been a spate recently of calls for replacements for the Turing test. Gary Marcus in The New Yorker asks “What Comes After the Turing Test?” and wants “to update a sixty-four-year-old test for the modern era” (Marcus 2014). Moshe Vardi in his Communications of the ACM article “Would Turing Have Passed the Turing Test?” opines that “It’s time to consider the Imitation Game as just a game” (Vardi 2014). The popular media recommends that we “Forget the Turing Test” and replace it with a “better way to measure intelligence” (Locke 2014). Behind the chorus of requests is an understanding that the test has served the field of artificial intelligence poorly as a challenge problem to guide research. This shouldn’t be surprising: The test wasn’t proposed by Turing to serve that purpose. Turing’s Mind paper (1950) in which he defines what we now call the Turing test concludes with a short discussion of research strategy toward machine intelligence. What he says is this: We may hope that machines will eventually compete with men in all purely intellectual fields. But which are the best ones to start with? Even this is a difficult decision. . . . I do not know what the right answer is, but I think [various] approaches should be tried. (Turing [1950], page 460) What he does not say is that we should be running Turing tests. Copyright © 2016, Association for the Advancement of Artificial Intelligence. All rights reserved. ISSN 0738-4602 SPRING 2016 91 Articles Perhaps Turing saw that his test is not at all suitable for this purpose, as I will argue in more detail here. But that didn’t stop some with an entrepreneurial spirit in staging Turing-test–inspired competitions. Several, including myself (Shieber 1994) and Hayes and Ford (1995), argued such stunts to be misguided and inappropriate. The problem with misapplication of the Turing test in this way has been exacerbated by the publicity around a purported case of a chatbot in June 2014 becoming “the first machine to pass the Turing test” (The Guardian 2014), when of course no such feat took place (Shieber 2014a). (It is no coincidence that all of the articles cited in the first paragraph came out in June 2014.) It is, frankly, sad to see the Turing test besmirched by its inappropriate application as a challenge problem for AI. But at least this set of events has had the salutary effect of focusing the AI research community on the understanding that if the Turing test isn’t a good challenge problem for guiding research toward new breakthroughs, we should attend to devising more appropriate problems to serve that role. These calls to replace the pastime of Turing-test-like competitions are really pleas for a new inducement prize contest. Inducement prize contests are award programs established to induce people to solve a problem of importance by directly rewarding the solver, and the idea has a long history in other research fields — navigation, aviation, and autonomous vehicles, for instance. If we are to establish an inducement prize contest for artificial intelligence, it behooves us to learn from the experience of the previous centuries of such contests to design our contest in a way that is likely to have the intended effect. In this article, I adduce five principles that an inducement prize contest for AI should possess: occasionality of occurrence, flexibility of award, transparency of result, absoluteness of criteria, and reasonableness of goal. Any proposal for an alternative competition, moving “beyond the Turing test” in the language of the January 2015 Association for the Advancement of Artificial Intelligence workshop,1 ought to be evaluated according to these principles. The Turing test itself fails the reasonableness principle, and its implementations to date in various competitions have failed the absoluteness, occasionality, flexibility, and transparency principles, a clean sweep of inappropriateness for an AI inducement prize contest. Creative thinking will be needed to generate a contest design satisfying these principles. Inducement Prize Contests There is a long history of inducement prizes in a broad range of areas, including: navigation (the 1714 Longitude Prize), chemistry (the 1783 French Academy of Sciences prize for soda ash production), automotive transportation (the 1895 Great Chicago Auto 92 AI MAGAZINE Race), aviation (numerous early 20th century prizes culminating in the 1919 Orteig Prize for nonstop transatlantic flight; the 1959 Kremer Prize for human-powered flight), space exploration (the 1996 Ansari X Prize for reusable manned spacecraft), and autonomous vehicles (the 2004 DARPA Grand Challenge). Inducement prizes are typically offered on the not unreasonable assumption that they provide a highly financially leveraged method for achieving progress in the award area. Estimates of the leverage have ranged up to a factor of 50 (Schroeder 2004). There have been two types of competitions related to AI,2 though neither type serves well as an inducement prize contest. The first type of competition comprises regularly scheduled enactments of (or at least inspired by) the Turing test. The most well known is the Loebner Prize Competition, held annually, though other similar competitions have been held, such as the June 2014 Royal-Society-sponsored competition in London, whose organizers erroneously claimed that entrant Eugene Goostman had passed the Turing test (Shieber 2014a). Although Hugh Loebner billed his eponymous prize as a curative for the astonishing claim that “People had been discussing AI, but nobody was doing anything about it” (Lindquist 1991), his competition is not set up to provide appropriate incentives and has not engendered any progress in the area so far as I can tell (Shieber 1994). In the second type of competition research funders, especially U.S. government funders, like DARPA, NSF, and NIST, have funded regular (typically annual) “bakeoffs” among funded research groups working on particular applications — speech recognition, message understanding, question answering, and so forth. These competitions have been spectacularly successful at generating consistent incremental progress on the measured objectives, speech recognition error rate reduction, for instance. Such competitions are evidently effective at generating improvements on concrete engineering tasks over time. They have, however, had the perverse effect of reducing the diversity of approaches pursued and generally increasing risk aversion among research projects. Principles An inducement prize contest for AI has the potential to promote research on hard AI problems without the frailties of these previous competitions. We, the AI community, would like a competition to promote creativity, reward risk, and curtail incrementalism. This requires careful attention to the principles underlying the competition, and it behooves us to attend to history. We should look to previous successful inducement prize contests in other research fields in choosing a task and competition structure that obey the principles that made those competitions successful. These principles include the follow- Articles ing: (1) The competition should be occasional, occurring only when plausible entrants exist. (2) The awarding process should be flexible, so awards follow the spirit of the competition rather than the letter of the rules. (3) The results should be transparent, so that any award is given only for systems that are open and replicable in all aspects. (4) The criteria for success should be based on absolute milestones, not relative progress. (5) The milestones should be reasonable, that is, not so far beyond current capability that their achievement is inconceivable in any reasonable time. The first three of these principles concern the rules of the contest, while the final two concern the task being posed. I discuss them seriatim, dispensing quickly with the rule-oriented principles to concentrate on the more substantive and crucial task-related ones. Occasionality The competition should be occasional, occurring only when plausible entrants exist. The frequency of testing entrants should be determined by the availability of plausible entrants, not by an artificially mandated schedule. Once one stipulates that a competition must be run, say, every year, one is stuck with the prospect of awarding a winner whether any qualitative progress has been made or not, essentially forcing a quantitative incremental notion of progress that leads to the problems of incrementalism noted above. Successful inducement prize contests are structured so that actual tests of entrants occur only when an entrant has demonstrated a plausible chance of accomplishing the qualitative criterion. The current Kremer Prize (the 1988 Kremer International Marathon Competition) stipulates that it is run only when an entrant officially applies to make an attempt under observation by the committee. Even then, any successful attempt must be ratified by the committee based on extensive documentation provided by the entrant. Presumably to eliminate frivolous entries, entrants are subject to a nominal fee of £100, as well as the costs to the committee of observing the attempt (The Royal Aeronautical Society 1988). This principle is closely connected to the taskrelated principle of absoluteness, which will be discussed a little later. Flexibility The awarding process should be flexible, so awards follow the spirit of the competition rather than the letter of the rules. The goal of an inducement prize contest is to generate real qualitative progress. Any statement of evaluative criteria is a means to that end, not the end in itself. It is therefore useful to include in the process flexibility in the criteria, to make sure that the spirit, and not the letter, of the law are followed. For instance, the DARPA Grand Challenge allowed for disqualifying entries “that cannot demonstrate intelligent autonomous behavior” (Schroeder [2004], p. 14). Such flexibility in determining when evaluation of an entrant is appropriate and successful allows useful wiggle room to drop frivolous attempts or gaming of the rules. For this reason, the 1714 Longitude Prize placed awarding of the prize in the hands of an illustrious committee chaired by Isaac Newton, Lucasian Professor of Mathematics. Similarly, the Kremer Prize places “interpretation of these Regulations and Conditions . . . with the Society’s Council on the recommendation of the Organisers” (The Royal Aeronautical Society [1988], p. 6). Transparency The results should be transparent, so that any award is given only for systems that are open and replicable in all aspects. The goal of establishing an inducement prize in AI is to expand knowledge for the public good. We therefore ought to require entrants (not to mention awardees) to make available sufficient information to allow replication of their awarded event: open-source code and any required data, open access to all documentation. It may even be useful for any award to await an independent party replicating and verifying the award. There should be no award for secret knowledge. The downside of requiring openness is that potential participants may worry that their participation could poison the market for their technological breakthroughs, and therefore they would avoid participation. But to the extent that potential participants believe that there is a large market for their satisfying the award criteria, there is no reason to motivate them with the award in the first place. Absoluteness The criteria for success should be based on absolute milestones, not relative progress. Any competition should be based on absolute rather than relative criteria. The criterion for awarding the prize should be the satisfaction of specific milestones rather than mere improvement on some figure of merit. For example, the 1714 Longitude Act established three separate awards based on specific milestones: That the first author or authors, discover or discoverers of any such method, his or their executors, administrators, or assigns, shall be entitled to, and have such reward as herein after is mentioned; that is to say, to a reward or sum of ten thousand pounds, if it determines the said longitude to one degree of a great circle, or sixty geographical miles; to fifteen thousand pounds if it determines the same to two thirds of that distance; and to twenty thousand pounds, if it determines the same to one half of that same distance. (British Parliament 1714) Aviation and aeronautical prizes specify milestones SPRING 2016 93 Articles as well. The Orteig prize, first offered in 1919, specified a transatlantic crossing in a single airplane flight, achieved by Charles Lindbergh in 1927. The Ansari X Prize required a nongovernmental organization to perform two launches to 100 kilometers within two weeks of a reusable manned spacecraft, a requirement fulfilled by Burt Rutan’s SpaceShipOne eight years after the prize’s 1996 creation. If a winner is awarded merely on the basis of having the best current performance on some quantitative metric, entrants will be motivated to incrementally outperform the previous best, leading to “hill climbing.” This is exactly the behavior we see in funder bakeoffs. If the prevailing approach sits in some mode of the research search space with a local optimum, a strategy of trying qualitatively different approaches to find a region with a markedly better local optimum is unlikely to be rewarded with success the following year. Prospective entrants are thus given incentive to work on incremental quantitative progress, leading to reduced creativity and low risk. We see this phenomenon as well in the Loebner Competition; some two decades of events have used exactly the same techniques, essentially those of Weizenbaum’s (1966) Eliza program. If, by contrast, a winner is awarded only upon hitting a milestone defined by a sufficiently large quantum of improvement, one that the organizers believe requires a qualitatively different approach to the problem, local optimization ceases to be a winning strategy, and examination of new approaches becomes more likely to be rewarded. Reasonableness The milestones should be reasonable, that is, not so far beyond current capability that their achievement is inconceivable in any reasonable time. Although an absolute criterion requiring qualitative advancement provides incentive away from incrementalism, it runs the risk of driving off participation if the criterion is too difficult. We see this in the qualitative part of the Loebner Prize Competition. The competition rules specify that (in addition to awarding the annual prize 94 AI MAGAZINE to whichever computer entrant performs best on the quantitative score) a gold medal would be awarded and the competition discontinued if an entrant passes a multimodal extension of the Turing test. The task is so far beyond current technology that it is safe to say that this prize has incentivized no one. Instead, the award criterion should be beyond the state of the art, but not so far that its achievement is inconceivable in any reasonable time. Here again, successful inducement prizes are revealing. The first Kremer prize specified a human-powered flight over a figure eight course of half a mile. It did not specify a transatlantic flight, as the Orteig Prize for powered flight did. Such a milestone would have been unreasonable. Frankly, it is the difficulty of designing a criterion that walks the fine line between a qualitative improvement unamenable to hill climbing and a reasonable goal in the foreseeable future that makes designing an inducement prize contest so tricky. Yet without finding a Goldilocks-satisfying test (not too hard, not too easy, but just right), it is not worth running a competition. The notion of reasonableness is well captured by the XPRIZE Foundation’s target of “audacious but achievable” (The Economist 2015.) The reasonableness requirement leads to a further consideration in choosing tasks where performance is measured on a quantitative scale. The task must have headroom. Consider again human-powered flight, measured against a metric of staying aloft over a prescribed course for a given distance. Before the invention of the airplane, human-powered flight distances would have been measured in feet, using technologies like jumping, poles, or springs. True human-powered flight — at the level of flying animals like birds and bats — is measured in distances that are, for all practical purposes, unlimited when compared to that human performance. The task of human-powered flight thus has plenty of headroom. We can set a milestone of 50 feet or half a mile, far less than the ultimate goal of full flight, and still expect to require qualitative progress on human-powered flight. By comparison, consider the task of speech recognition as a test for intelligence. It has long been argued that speech recognition is an AI-complete task. Performance at human levels can require arbitrary knowledge and reasoning abilities. The apocryphal story about the sentence “It’s hard to wreck a nice beach” makes an important point: The speech signal underdetermines the correct transcription. Arbitrary knowledge and reasoning — real intelligence — may be required in the most subtle cases. It might be argued, then, that we could use speech transcription error rate in an inducement prize contest to promote breakthroughs in AI. The problem is that the speech recognition task has very little headroom. Although human-level performance may require intelligence, near-human-level performance does not. The difference in error rate between human speech recognition and computer speech recognition may be only a few percentage points. Using error rate is thus a fragile compass for directing research. Indeed, this requirement of reasonableness may be the hardest one to satisfy for challenges that incentivize research that leads to machine intelligence. Traditionally, incentive prize contests have aimed at breakthroughs in functionality, but intelligence short of human level is notoriously difficult to define in terms of functionality; it seems intrinsically intensional. Merely requiring a particular level of performance on a particular functionality falls afoul of what might be called Montaigne’s misconception. Michel de Montaigne in his arguing for the intelligence of animals notes the abilities of individual animals at various tasks: Take the swallows, when spring returns; we can see them ferreting through all the corners of our houses; from a thousand places they select one, finding it the most suitable place to make their nests: is that done without judgement or discernment? . . . Why does the spider make her web denser in one place and slacker in another, using this knot here and that knot there, if she cannot reflect, think, or reach conclusions? We are perfectly able to realize how superior they are to us in most of their works and how weak our artistic skills are when it comes to imitating them. Articles Our works are coarser, and yet we are aware of the faculties we use to construct them: our souls use all their powers when doing so. Why do we not consider that the same applies to animals? Why do we attribute to some sort of slavish natural inclination works that surpass all that we can do by nature or by art? (de Montaigne 1987 [1576], 19–20) Of course, an isolated ability does not intelligence make. It is the generality of cognitive performance that we attribute intelligence to. Montaigne gives each type of animal credit for the cognitive performances of all others. Swallows build, but they do not weave. Spiders weave, but they do not play chess. People, our one uncontroversial standard of intelligent being, do all of these. Turing understood this point in devising his test. He remarked that the functionality on which his test is based, verbal behavior, is “suitable for introducing almost any one of the fields of human endeavour that we wish to include.” (Turing [1950], p. 435) Any task based on an individual functionality that does not allow extrapolation to a sufficiently broad range of additional functionalities is not adequate as a basis for an inducement prize contest for AI, however useful the functionality happens to be. (That is not to say that such a task might not be appropriate for an inducement prize contest for its own sake.) There is tremendous variety in the functionalities on which particular computer programs surpass people, many of which require and demonstrate intelligence in humans. Chess programs play at the level of the most elite human chess players, players who rely on highly trained intelligence to obtain their performance. Neural networks recognize faces at human levels and far surpassing human speeds. Computers can recognize spoken words under noise conditions that humans find baffling. But like Montaigne’s animals, each program excels at only one kind of work. It is the generalizability of the Turing test task that results in its testing not only a particular functionality, but the flexibility we take to indicate intelligence. Furthermore, the intensional character of intelligence, that the functionality be provided “in the right way,” and not by mere memorization or brute computation, is also best tested by examining the flexibility of behavior of the subject under test. It is a tall order to find a task that allows us to generalize from performance on a single functionality to performance on a broad range of functionalities while, at the same time, being not so far beyond current capability that its achievement is inconceivable in any reasonable time. It may well be that there are no appropriate prize tasks in the intersection of audacious and achievable. Application of the Principles How do various proposals for tasks fare with respect to these principles? The three principles of flexibility, occasionality, and transparency are properties of the competition rules, not the competition task, so we can assume that an enlightened organizing body would establish them appropriately. But what of the task properties — absoluteness and reasonableness? For instance, would it be reasonable to use that most famous task for establishing intelligence in a machine, the Turing test, as the basis for an inducement prize contest for AI? The short answer is no. I am a big fan of the Turing test. I believe, and have argued in detail (Shieber 2007), that it works exceptionally well as a conceptual sufficient condition for attributing intelligence to a machine, which was, after all, its original purpose. However, just because it works as a thought experiment addressing that philosophical question does not mean that it is appropriate as a concrete task for a research competition. As an absolute criterion, the test as described by Turing is fine (though it has never been correctly put in place in any competition to date). But the Turing test is far too difficult to serve as the basis of a competition. It fails the reasonableness principle.3 Passing a full-blown Turing test is so far beyond the state of the art that it is as silly to establish that criterion in an inducement prize competition as it is to establish transatlantic human-powered flight. It should go without saying that watered-down versions of the Turing test based on purely relative performance among entrants is a nonstarter. The AI XPRIZE rules have not yet been established, but the sample criteria that Chris Anderson has proposed (XPRIZE Foundation 2014) also fail our principles. The first part, presentation of a TED Talk on one of a set of one hundred predetermined topics can be satisfied by a “memorizing machine” (Shieber 2014b) that has in its repertoire one hundred cached presentations. The second part, responding to some questions put to it on the topic of its presentation is tantamount to a Turing test, and therefore fails the reasonableness criterion.4 What about special cases of the Turing test, in which the form of the queries presented to the subject under test is more limited than open-ended natural language communication, yet still requires knowledge and reasoning indicative of intelligence? The Winograd Schema Challenge (Levesque, Davis, and Morgenstern 2012) is one such proposal. The test involves determining pronoun reference in sentences of the sort first proposed by Winograd (1972, p. 33): “The city councilmen refused the demonstrators a permit because they feared violence.” Determining whether the referent of they is the city councilmen or the demonstrators requires not only a grasp of the syntax and semantics of the sentence but an understanding of and reasoning about the bureaucratic roles of governmental bodies and social aims of activists. Presumably, human-level performance on Winograd schema queries requires humanlevel intelligence. The problem with the Winograd Schema Challenge may well be a lack of headroom. It might be the case that simple strategies could yield performance quite close to (but presumably not matching) human level. Such a state of affairs would make the Winograd Schema Challenge problematic as a guide for directing research toward machine intelligence.5 Are there better proposals? I hope so, though I fear there may not be any combination of task domain and award criterion that has the required properties. Intelligence may be a phe- SPRING 2016 95 Articles nomenon about which we know sufficiently little that substantial but reasonable goals elude us for the moment. There is one plausible alternative however. We might wait on establishing an AI inducement prize contest until such time as the passing of the Turing test itself seems audacious but achievable. That day might be quite some time. Acknowledgements I am indebted to Barbara Grosz and Todd Zickler for helpful discussions on the subject of this article, as well as the participants in the AAAI “Beyond the Turing Test” workshop in January 2015 for their thoughtful comments. Notes 1. www.aaai.org/Workshops/ws15workshops.php#ws06. 2. The XPRIZE Foundation, in cooperation with TED, announced on March 20, 2014, the intention to establish the AI XPRIZE presented by TED, described as “a modern-day Turing test to be awarded to the first A.I. to walk or roll out on stage and present a TED Talk so compelling that it commands a standing ovation from you, the audience” (XPRIZE Foundation 2014). The competition has yet to be finalized, however. John Baskett, Printer to the Queens most Excellent Magesty and by the assigns of Thomas Newcomb, and Henry Hills, deceas’d. (cudl.lib.cam.ac.uk/view/MSRGO-00014-00001/22). The Economist. 2015. The X-Files. The Economist. Science and Technology section, 6 May. (www.economist.com/news/scienceand-technology/21651164-want-newinvention-organise-competition-and-offerprize-x-files). de Montaigne, M. 1987 [1576]. An Apology for Raymond Sebond. Translated and edited with an introduction and notes by M. A. Screech. New York: Viking Penguin. The Guardian. 2014. Computer Simulating 13-Year-Old Boy Becomes First to Pass Turing Test. The Guardian. Monday, June 9. (www.theguardian.com/technology/2014/j un/08/super-computer-simulates-13-yearold-boy-passes-turing-test). Hayes, P., and Ford, K. 1995. Turing Test Considered Harmful. In Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence. San Francisco: Morgan Kaufmann Publishers. Levesque, H. J.; Davis, E.; and Morgenstern, L. 2012. The Winograd Schema Challenge. In Proceedings of the 13th International Conference on Principles of Knowledge Representation and Reasoning, 552–561. Palo Alto, CA: AAAI Press. 3. As an aside, it is unnecessary, and therefore counterproductive, to propose tasks that are strict supersets of the Turing test for a prize competition. For instance, tasks that extend the Turing test by requiring nontextual inputs to be handled as well — audition or vision, say — or nontextual behaviors to be generated — robotic manipulations of objects, for instance — complicate the task, making it even less reasonable than the Turing test itself already is. Lindquist, C. 1991. Quest for Machines That Think. Computerworld. 4. Anderson proposes that the system answer only one or two questions, which may seem like a simplification of the task. But to the extent that it is, it can be criticized on the same grounds as other topicand time-limited Turing tests (Shieber 2014b). The Royal Aeronautical Society, Human Powered Flight Group. 1988. Human Powered Flight: Regulations and Conditions for the Kremer International Marathon Competition. Information Sheet, August 1988. London: The Royal Aeronautical Society, London, England. (aerosociety.com/ Assets/Docs/About Us/HPAG/Rules/HP Kremer Marathon Rules.pdf). 5. There are practical issues with the Winograd Schema Challenge as well. Generating appropriate challenge sentences is a specialized and labor-intensive process that may not provide the number of examples required for operating an incentive prize contest. References British Parliament. 1714. An Act for Providing a Publick Reward for Such Person or Persons as Shall Discover the Longitude at Sea. London: 96 AI MAGAZINE Locke, S. 2014. Forget the Turing Test. This Is a Better Way to Measure Artificial Intelligence. Vox Technology, November 30. (www.vox.com/2014/11/30/7309879/turing-test). Marcus, G. 2014. What Comes After the Turing Test? The New Yorker. June 9. (www.newyorker.com/tech/elements/whatcomes-after-the-turing-test). Schroeder, A. 2004. The Application and Administration of Inducement Prizes in Technology. Technical Report IP-11-2004, Independence Institute, Golden, CO. (i2i.org/articles/IP_11_2004.pdf). Shieber, S. M. 1994. Lessons from a Restricted Turing Test. Communications of the ACM 37(6): 70–78. dx.doi.org/10.1145/175208. 175217 Shieber, S. M. 2007. The Turing Test as Interactive Proof. Noûs 41(4): 686–713. dx.doi. org/10.1111/j.1468-0068.2007.00636.x Shieber, S. M. 2014a. No, the Turing Test Has Not Been Passed. The Occasional Pamphlet on Scholarly Communication. June 10. (blogs.harvard.edu/pamphlet/2014/06/ 10/no-the-turing-test-has-not-been-passed). Shieber, S. M. 2014b. There Can Be No Turing-Test-Passing Memorizing Machines. Philosophers’ Imprint 14(16): 1–13. (hdl.handle.net/2027/spo.3521354.0014.016). Turing, A. M. 1950. Computing Machinery and Intelligence. Mind 59(236): 433–460. dx.doi.org/10.1093/mind/LIX.236.433 Vardi, M. Y. 2014. Would Turing Have Passed the Turing Test? Communications of the ACM 57(9): 5. dx.doi.org/10.1145/ 2643596 Weizenbaum, J. 1966. ELIZA — A Computer Program for the Study of Natural Language Communication between Man and Machine. Communications of the ACM 9(1): 36–45. Winograd, T. 1972. Understanding Natural Language. Boston: Academic Press. XPRIZE Foundation. 2014. A.I. XPRIZE presented by TED. March 20. Los Angeles, CA: XPRIZE Foundation, Inc. (www.xprize.org/ ted). Stuart M. Shieber is the James O. Welch, Jr., and Virginia B. Welch professor of computer science in the School of Engineering and Applied Sciences at Harvard University. His research focuses on computational linguistics and natural language processing. He is a fellow of AAAI and ACM, the founding director of the Center for Research on Computation and Society, and a faculty codirector of the Berkman Center for Internet and Society. Articles WWTS (What Would Turing Say?) Douglas B. Lenat I Turing’s Imitation Game was a brilliant early proposed test of machine intelligence — one that is still compelling today, despite the fact that in the hindsight of all that we’ve learned in the intervening 65 years we can see the flaws in his original test. And our field needs a good “Is it AI yet?” test more than ever today, with so many of us spending our research time looking under the “shallow processing of big data” lamppost. If Turing were alive today, what sort of test might he propose? WTDS (What Turing Did/Didn’t Say) If you are reading these words, surely you are already familiar with the Imitation Game proposed by Alan Turing (1950). Or are you? Turing was heavily influenced by the World War II “game” of allied and axis pilots and ground stations each trying to fool the enemy into thinking they were friendlies. So his imagined test for AI involved an interrogator being told that he or she was about to interview a man and woman over a teletype, both of whom would be pretending to be the woman; the task was to guess which one was lying. If a machine could fool interrogators as often as a typical man, then one would have to conclude that that machine, as programmed, was as intelligent as a person (well, as intelligent as men.)1 As Judy Genova (1994) puts it, Turing’s originally proposed game involves not a question of species, but one of gender.2 The current version, where the interrogator is told he or she needs to distinguish a person from a machine, is (1) much more difficult to get a program to pass, and (2) almost all the added difficulties are largely irrelevant to intelligence! And it’s possible to muddy the waters even more by some programs appearing to do well at it due to various tricks, such as having the interviewee program claim to be a 13-year-old Ukrainian who doesn’t speak English well (University of Reading 2014), and hence having all its wrong or bizarre responses excused due to cultural, age, or language issues. Going into more detail here about why the current version of the Turing test is inadequate and distracting would be a Copyright © 2016, Association for the Advancement of Artificial Intelligence. All rights reserved. ISSN 0738-4602 SPRING 2016 97 Articles digression from my main point, so I’ve included that discussion as a sidebar to this article. Here, let it suffice for me to point out that one improvement would be simply to go back to his originally proposed test, or some variant of it. I’m imagining here a game similar to the TV program To Tell the Truth. Panelists (the interrogators) are told that they are talking to three people who will all be claiming that some fact is true about them (for example, they treat sick whales; they ate their brother’s bug collection; and others) and that two of the people are lying and one is telling the truth; their job is to ask questions to pick out the truth teller. In my imagined game, the interrogator is told he or she will be interviewing three people online, all claiming X, and her or his task is to pick out the one truth teller. Then we measure whether our supposed AI fools the interrogator at least as often as the human “liars” are able to. Averaged over lots of interrogators, lots of claims, and lots of liars, this might be an improvement over today’s Turing test. Does that go far enough? It still smacks of a challenge one might craft for a magician. I can imagine programs doing well at that task through tricks, but then clearly (through subsequent failed attempts to apply them) revealing themselves not to be generally intelligent after all. So let’s rethink the test from the top down. WTMS (What Turing Might Say) So what might Turing say today, if he were alive to propose a new test for machine intelligence? He was able to state the original test in one paragraph; he might first try to find an equally terse and compelling modern version. Mathematics revolutionized physics in the late nineteenth and early twentieth centuries, and “softer” sciences like psychology and sociology and AI have been yearning not to be left behind. That type of physics envy has all too often led to premature formalization, holding back progress in AI at least as much as helping it. To quote economist Robert Heilbroner, “Mathematics has given economics rigor, but alas, also mortis.” I don’t quite have enough presumption to claim that Turing would come up with the same test that I’m about to discuss, but I do believe that he’d recoil a bit at some of the tricks-based chatbots crafted in his name, and think twice before tossing off a new glib two-sentence-long test for AI. My test, like his original Imitation Game, is one for recognizing AI when it’s here. Instead of focusing on one computer program being examined for intelligence, what matters is that human beings synergizing with the AI exhibit what from our 2016 point of view would be superhuman intelligence. The way to test for that, in turn, will be to look for the many and dramatic impacts that state of affairs 98 AI MAGAZINE would have on us, on our personal and professional lives, and on the way that various aspects of society and economy work. Some of the following are no doubt wrong, and will seem naïve and even humorous 65 years from now, but I’d be genuinely surprised3 if real AI — from now on let’s just call that RAI — didn’t engender most of the following. PDA Almost everyone has a cradle-to-grave general personal assistant application that builds up an integrated model of the person’s preferences, abilities, interests, modes of learning, idiosyncratic use of terms and expressions, experiences (to analogize to), goals, plans, beliefs. Siri and Cortana are indicators of how much demand there is for such PDAs. The real test for this having “arrived” will be not just its universal adoption but metalevel phenomena including legislation surrounding privacy and access by law enforcement; and the rise of standards and applications using those standards that broker communication between multiple individuals’ PDAs; and marketing directed at the PDAs that will be making most of the mundane purchasing decisions in their ratava’s (the inverse of “avatar”) life. Education The popularity of massive open online courses (MOOCs) and the Khan Academy are early indicators of how much demand there is even for non-AI-based education courseware. When AI is here, we will see widespread individualized (using — and feeding back to — one’s PDA) education to the point where in effect everyone is home schooled, “schools” continuing to exist in some form to meet the infrastructure, extracurricular, and social needs of the students. A return to what appears to be the monitorial system, where much of the student’s time is spent emulating not so much a sponge (trying to absorb concepts and skills, as is true today) as emulating a teacher, a tutor, since — I think we’ve all experienced this — we often really understand something only after we’ve had to teach or explain it to someone else. In this case, the human (let’s refer to her or him as the tutor) will be tutoring one or more tutees who will likely be AIs, not other human beings. Those “tutee” AIs will be constantly assessing the tutor and deciding what mistakes to make, what confusions to have, what apparent learning (and forgetting) to exhibit, based on what will best serve that tutor pedagogically, what will be motivated by situations in that person’s real life (teaching you new things in situations where they would be useful and timely for you to know), based on the AI reasoning about what will be fun and entertaining to the person, and similar concerns that in effect blur the boundaries of what education is, compared with today. Articles Health Care The previous two impacts ripple over to this — your PDA watching out for you and helping you become a more accurately and more fully informed consumer of health-care products and services, calling attention to things in ways and at times that will make a difference in your life. From the other direction, though, RAI will enable much more individualized diagnosis and treatment; for an early step along that line, see DARPA’s Big Mechanism project, which has just begun, whose goal is to use AI to read and integrate large amounts of cancer research literature, which (coupled with patient-specific information) will enable plausible hypotheses to be formed about the pathways that your cancer is taking to grow and metastasize, and plausible treatments that might only be effective or even safe for you and a tiny sliver of other individuals. RAI (coupled with robotics only slightly more advanced than the current state of the art) will also revolutionize elderly care, given almost limitless patience, ability to recognize what their “patient”/companion is and isn’t doing (for example, exercise-wise), and so on. This will later spread to nursing care for wider populations of patients. I fear that extending this all the way to child and infant care will be one of the last applications of AI in health care due to the public’s and the media’s intolerance of error in that activity. Economy This is currently based on atoms (goods), services involving atoms, and information treated as a commodity. The creation and curation of knowledge is, by contrast, done for free — given away in return for your exposure to online advertising and as a gateway to other products and services. I believe that RAI will change that, profoundly, and that people will not hesitate to be charged some tiny amount (a penny, let’s say) for each useful alert, useful answer, useful suggestion. That in turn will fuel a knowledge economy in which contributors of knowledge are compensated in micropayment shares of that penny. Once this engine is jump-started, widespread vocation and avocation as knowledge contributors will become the norm. Some individuals will want and will receive the other sort of credit (citation credit) in addition or instead of monetary credit, possibly pseudonymously. Moreover, as we increase our trust in our PDA (above), it will be delegated increasing decision-making and spending authority; the old practice of items being sent to individuals “on approval” will return and human attention being paid to shopping may be relegated to hobby status, much as papermaking or home gardening today. Advertising will have to evolve or die, once consumers are better educated and increasingly the buying decisions are being made by their PDAs anyway. And ever-improving translation and (not using AI particularly) three-dimensional printing tech- nologies will make the consumer’s uncorrected physical location almost as unimportant as his or her uncorrected vision is today. The flip side of the impact of AI on the economy is that a very small fraction of the population will be needed to grow the world’s food and produce the world’s goods, as robots reliably amplify the ability of a relatively few people to meet that worldwide demand. This will lead to something that many critics will no doubt label universal socialism in their then vastly greater free time. Democracy and Government RAI will probably have a dramatic effect in this area, pummeling the status quo of these institutions from multiple directions: for example, more effective education will result in a voting public better able to perform critical thinking and to detect and correct for attempts at manipulation and at revising history. Lawmakers and the public will be able to generate populations of plausible scenarios that enable them to better assess alternative proposed policies and courses of action. Fraud and malfeasance will become more and more difficult to carry out, with multiple independent AI watchdogs always awake and alert. Government functions currently drowning in red tape, due to attempts to be frugal through standardization, may be catalyzed or even automated by RAI, which can afford to — which will inevitably — know and treat everyone as an individual. Our Personal Experience By this I mean to include various sorts of phenomena that will go from unheard of to ubiquitous once RAI arrives. These include the following. Weak Telepathy You formulate an intent, and have barely started to make a gesture to act on it, when the AI understands what you have in mind and why, and completes that action (or a better one that accomplishes your actual goal) for you; think of an old married couple finishing each other’s sentences, raised to the nth power. This isn’t of course real telepathy — hence the word weak — but functionally is almost indistinguishable from it. Weak Immortality Your PDA’s cradle-to-grave model of you is good enough that, even after your death, it can continue to interact with loved ones, friends, business associates, carry on conversations, carry our assigned tasks, and others; eventually this will be almost as though you never died (well, to everyone except you, of course, hence the word weak). Weak “Cloning” The quotation marks refer to the science-fiction type of duplication of you instantly as you are now, able to be in several places at once, attending to several things at once, with your one “real” biological consciousness and (through VR) awareness flitting to SPRING 2016 99 Articles The Current Turing Test Is Hard in Ways Both Unintended and Irrelevant At AAAI 2006, I went through this at length (Lenat 2008), but the gist is that Turing’s game had a human interrogator talking through a teletype with a man and a woman, both pretending that they were the woman. The experimenter measures what percentage of the time the average interrogator is wrong — identifies the wrong interviewee as being the woman. Turing’s proposed test, then, was to see if a computer could be programmed to fool the interrogator (who was still told that they were talking to a human man and a human woman!) into guessing incorrectly about which interrogatee was the woman at least as often as men were able to fool the interrogator. One could argue then that such a computer, as programmed, was intelligent. Well, at least as intelligent the typical human male.5 Why is the revised genderneutral version harder to pass and less reflective of human intelligence? If the interrogator is told that the task is to distinguish a computer from a person, then they can draw on his or her array of facts, experiences, visual and aural and olfactory and tactile capabilities, current events and history, expectations about how accurately and completely the average person remembers Shakespeare, and so on, to ask things they never would have asked under Turing’s original test, when they thought they were trying to distinguish a human man from a human woman through a teletype. Our vast storehouse of common sense also makes it more difficult to pass the “neutered” Turing test than the original version. Every time we see or hear a sentence with a pronoun, or an ambiguous word, we draw on 100 AI MAGAZINE that reservoir to decode what the author or speaker encoded into that shorthand. Most of the examples I’ve used in my talks and articles for the last 40 years (such as disambiguating the word pen in “the box is in the pen” versus “the pen is in the box”) have been borrowed and reborrowed from Bar-Hillel, Chomsky, Schank, Winograd, Woods, and — surprisingly often and effectively — from Burns and Allen. Almost all of these disambiguatings are gender neutral — men perform them about as well as women perform them — hence they simply wouldn’t come up or figure into the original Turing test, only the modern, neutered one. The previous two paragraphs listed various ways in which the gender-neutral Turing test is made vastly more difficult because of human beings’ gender-independent general knowledge and reasoning capabilities. The next few paragraphs list a few ways in which the gender-neutral Turing test is made more difficult because of gender-independent human foibles and limitations. Human beings exhibit dozens of translogical behaviors: illogical but predictable wrong decisions that most people make, incorrect but predictable wrong answers to queries. Since they are so predictable, an interrogator in today’s “neutered” Turing test could use these to separate human from nonhuman interrogatees, since that’s what they are told their task is. As I said in 2008 (Lenat 2008): “Some of these are very obvious and heavy-handed, hence uninteresting, but still work a surprising fraction of the time — ‘work’ meaning, here, to enable the interrogator instantly to unmask many of the programs entered into a Turing test competition as programs and not human beings: slow and errorful typing; 7 +/– 2 short-term memory size; forgetting (for example, what day of the week was April 7, 1996? What day of the week was yesterday?); wrong answers to math problems (some wrong answers being more ‘human’ than others: 93 – 25 = 78 is more understandable than if the program pretends to get a wrong answer of 0 or –9998 for that subtraction problem. [Brown and van Lehn 1980]). … Asked to decide which is more likely, ‘Fred S. just got lung cancer.’ or ‘Fred S. smokes and just got lung cancer,’ most people say the latter. People worry more about dying in a hijacked flight than the drive to the airport. They see the ‘face’ on Mars. They hold onto a losing stock too long because of ego. If a choice is presented in terms of rewards, they opt for a different alternative than if it’s presented in terms of risks. They are swayed by ads.” When faced with a difficult decision, human beings often select the alternative of inaction — if it is available to them — rather than action. One example of this is the startling statistic that in those European countries that ask driver’s license applicants to “check this box to opt in” to organ donation, there is only a 15 percent enrollment, whereas in neighboring, culturally similar countries where the form says “check this box to opt out” there is an 85 percent organ donor enrollment. That is, 85 percent don’t check the box no matter what it says! This isn’t because this decision is beneath their notice, quite the contrary: they care very deeply about the issue, but they are ambivalent, and thus their reaction is to make the choice that doesn’t require them to do anything, not even check a box on a piece of paper. Another, even more tragic, example of this “omission bias” (Ritov and Baron 1990) involves American parents’ widespread reluctance to have their children vaccinated. For more examples of these sorts of irrational yet predictable human behaviors, see, for example, Tversky and Kahneman (1983). As an exercise, imagine that an extraterrestrial lands in Austin, Texas, and wants to find out how Microsoft Word works, the program I am currently running as I type these words. The alien carefully measures the cooling fan air outflow rate and temperature, and the disk-seeking sounds that my computer makes as I type these words, and then spends 65 years trying to mimic those air-heatings and clicking noises so precisely that no one can distinguish them from the sounds my Dell PC is making right now. Absurd! Pathetic! But isn’t that in effect what the “neutered” Turing test proponents are requiring we do, requiring that our program do if it is to be adjudged to pass their test? Are we really so self-enthralled that we think it’s wise to spend our precious collective AI research time getting programs to mimic the latency delays, error rates, limited short-term memory size, omission bias, and others, of human beings? Those aren’t likely to be intimately tied up with intelligence, but rather just unfortunate artifacts of the platform on which human intelligence runs. They are about as relevant to intelligence as my Dell PC’s cooling fan and disk noises are to understanding how Microsoft Word works. Articles whichever of your simulated selves needs you the most at that moment. Arbitrarily Augmented Reality This includes real-time correction for what is being said around and to you, so almost no one ever mishears or misunderstands any more. It includes superimposing useful details onto what you see, so you have the equivalent of X-ray and telescopic vision, and the sort of “important objects glow” effects seen in video games, paths of glowing particles to guide you, reformulation of objects you’d prefer to see differently (but with physical boundaries and edges preserved for safety). Better-Than-Life Games and Entertainment This is of course potentially dangerous and addictive, and — like many of the above predicted indicators — may herald very serious brand new problems, not just solutions to old ones.4 I’ll close here, on that cautionary note. My purpose is not to provide answers, or even make predictions (though I seem to have done that), but rather to stimulate discussion about how we’ll know when RAI has arrived: not through some Turing test Mark II but because the world will change almost overnight if or when superhuman aliens arrive — and real AI making its appearance is likely to be the one and only time that happens. Acknowledgments The following individuals provided comments and suggestions that have helped make this article more accurate and on-point, but they should not be held accountable for any remaining inaccuracies, omissions, commissions, or inflammations: Paul Cohen, Ed Feigenbaum, Elaine Kant, Elaine Rich, and Mary Shepherd. Notes 1. Creepily, many people today in effect play this game online every day: men trying to “crash” women-only chats and forums, pedophiles pretending to be 10 year olds, MMO players lying about their gender or age, and others. 2. There remains some ambiguity (given his dialogue examples) about what Turing was proposing. But there is no ambiguity in the fact that the gender-neutral version is how the world came to recall what Turing wrote, by the time of the 1956 Dartmouth AI Summer Project, and ever since. 3. Alan Kay says that the best way to predict the future is to invent it. In that sense, these “predictions” could be recast as challenge problems for AI, a point of view consonant with Feigenbaum (2003) and Cohen (2006). 4. For example, while most of us will use AI to help us see multiple sides of an issue, to see reality more accurately and completely, AI could also be used for the opposite purpose, to filter out parts of the world that disagree with how we want to believe it to be. 5. He then gives some dialogue examples that make his intent somewhat ambiguous, but after that he returns to his main point about the computer pretending to be a man; and then discusses various possible objections to a computer ever being considered intelligent. References Brown, J. S., and VanLehn, K. 1980. Repair Theory: A Generative Theory of Bugs in Procedural Skills. Cognitive Science 4(4): 379–426. Cohen, P. 2006. If Not Turing’s Test, Then What? AI Magazine 26(4): 61–67. Feigenbaum, E. A. 2003. Some Challenges and Grand Challenges for Computational Intelligence. Journal of the Association for Computing Machinery 50(1): 32–40. Genova, J. 1994. Turing’s Sexual Guessing Game. Journal of Social Epistemology 8(4): 313–326. Lenat, D. B. 2008. The Voice of the Turtle: Whatever Happened to AI? AI Magazine 29(2): 11–22. Ritov, I., and Baron, J. 1990. Reluctance to Vaccinate: Omission Bias and Ambiguity. Journal of Behavioral Decision Making 3(4): 263–277. Turing, A. M. 1950. Computing Machinery and Intelligence. Mind 59(236): 433–460. Tversky, A., and Kahneman, D. 1983. Extensional Versus Intuitive Reasoning: The Conjunction Fallacy in Probability Judgment. Psychological Review 90(4): 293–315. University of Reading. 2014. Turing Test Success Marks Milestone in Computing History. Press Release, June 8, 2014. Communications Office, University of Reading, Reading, UK (www.reading.ac.uk/news-and-events/releases/ PR583836.aspx). Doug Lenat, a prolific author and pioneer in artificial intelligence, focuses on applying large amounts of structured knowledge to information management tasks. As the head of Cycorp, Lenat leads groundbreaking research in software technologies, including the formalization of common sense, the semantic integration of — and efficient inference over — massive information sources, the use of explicit contexts to represent and reason with inconsistent knowledge, and the use of existing structured knowledge to guide and strengthen the results of automated information extraction from unstructured sources. He has worked in diverse parts of AI — natural language understanding and generation, automatic program synthesis, expert systems, machine learning, and so on — for more than 40 years now. His 1976 Stanford Ph.D. dissertation, AM, demonstrated that creative discoveries in mathematics could be produced by a computer program (a theorem proposer, rather than a theorem prover) guided by a corpus of hundreds of heuristic rules for deciding which experiments to perform and judging “interestingness” of their outcomes. That work earned him the IJCAI Computers and Thought Award and sparked a renaissance in machine-learning research. Lenat was on the computer science faculties at Carnegie Mellon University and Stanford, was one of the founders of Teknowledge, and was in the first batch of AAAI Fellows. He worked with Bill Gates and Nathan Myhrvold to launch Microsoft Research Labs, and to this day he remains the only person to have served on the technical advisory boards of both Apple and Microsoft. He is on the technical advisory board of TTI Vanguard, and his interest and experience in national security has led him to regularly consult for several U.S. agencies and the White House. SPRING 2016 101 Competition Reports Summary Report of the First International Competition on Computational Models of Argumentation Matthias Thimm, Serena Villata, Federico Cerutti, Nir Oren, Hannes Strass, Mauro Vallati I We review the First International Competition on Computational Models of Argumentation (ICCMA’15). The competition evaluated submitted solvers’ performance on four different computational tasks related to solving abstract argumentation frameworks. Each task evaluated solvers in ways that pushed the edge of existing performance by introducing new challenges. Despite being the first competition in this area, the high number of competitors entered, and differences in results, suggest that the competition will help shape the landscape of ongoing developments in argumentation theory solvers. 102 AI MAGAZINE C omputational models of argumentation are an active research discipline within artificial intelligence that has grown since the beginning of the 1990s (Dung 1995). While still a young field when compared to areas such as SAT solving and logic programming, the argumentation community is very active, with a conference series (COMMA, which began in 2006) and a variety of workshops and special issues of journals. Argumentation has also worked its way into a variety of applications. For example, Williams et al. (2015) described how argumentation techniques are used for recommending cancer treatments, while Toniolo et al. (2015) detail how argumentation-based techniques can support critical thinking and collaborative scientific inquiry or intelligence analysis. Many of the problems that argumentation deals with are computationally difficult, and applications utilizing argumentation therefore require efficient solvers. To encourage this line of research, we organised the First International Competition on Computational Models of Argumentation (ICCMA), with the intention of assessing and promoting state-of-the-art solvers for abstract argumentation problems, and to identify families of challenging benchmarks for such solvers. Copyright © 2016, Association for the Advancement of Artificial Intelligence. All rights reserved. ISSN 0738-4602 Competition Reports The objective of ICCMA’15 is to allow researchers to compare the performance of different solvers systematically on common benchmarks and rules. Moreover, as witnessed by competitions in other AI disciplines such as planning and SAT solving, we see ICCMA as a new pillar of the community, which provides information and insights on the current state of the art and highlights future challenges and developments. This report summarizes the first ICCMA held in 2015 (ICCMA’15). In this competition, solvers were invited to address standard decision and enumeration problems of abstract argumentation frameworks (Dunne and Wooldridge 2009). Solvers’ performance is evaluated based on their time taken to provide a correct solution for a problem; incorrect results were discarded. More information about the competition, including complete results and benchmarks, can be found on the ICCMA website.1 Tracks In abstract argumentation (Dung 1995), a directed graph (A, R) is used as knowledge representation formalism, where the set of nodes A are identified with the arguments under consideration and R represents a conflict-relation between arguments, that is, aRb for a, b ∈ A if a is a counterargument for b. The framework is abstract because the content of the arguments is left unspecified. They could, for example, consist of a chain of logical deductions from logic programming with defeasible rules (Simari 1992); a proof for a theorem in classical logic (Besnard and Hunter 2007); or an informal presumptive reason in favour of some conclusion (Walton, Reed, and Macagno 2008). The notion of conflict then depends on the chosen formalization. Irrespective of the precise formalization used, one can identify a subset of arguments that can be collectively accepted given interargument conflicts. Such a subset is referred to as an extension, and (Dung 1995) defined four commonly used argumentation semantics — namely the complete (CO), preferred (PR), grounded (GR), and stable (ST) semantics — each of which defines an extension differently. More precisely, a complete extension is a set of arguments that do not attack each other,2 and in which arguments defend each other; a preferred extension is a maximal (with regard to set inclusion) complete extension; the grounded extension is the minimal (with regard to set inclusion) complete extension; and a stable extension is a complete extension such that each argument not in the extension is attacked by at least one argument within the extension. The competition was organized around four computational tasks of abstract argumentation: (1) Given an abstract argumentation framework, determine some extension (SE). (2) Given an abstract argumentation framework, determine all extensions (EE). (3) Given an abstract argumentation framework and some argument, decide whether the given argument is contained in some extension (DC). (4) Given an abstract argumentation framework and some argument, decide whether the given argument is contained in all extensions (DS). Combining these four different tasks with the four semantics discussed above yields a total of 16 tracks that constituted ICCMA’15. Each submitted solver was free to support any number of these tracks. Participants The competition received 18 solvers from research groups in Austria, China, Cyprus, Finland, France, Germany, Italy, Romania, and UK, of which 8 were submitted to all tracks. The solvers used a variety of approaches and programming languages to solve the competition tasks. In particular, 5 solvers were based on transformations of argumentation problems to SAT, 3 on transformations to ASP, 2 on CSP, and 8 were built on tailor-made algorithms. Seven solvers were implemented in C/C++, 4 in Java, 2 used shellscripts for translations to other formalisms, and the remaining solvers were implemented in Haskell, Lisp, Prolog, Python, and Go. All participants were required to submit the source code of their solver, which was made freely available after the competition, to foster independent evaluation and exploitation in research or real-world scenarios, and to allow for further refinements. Submitted solvers were required to support the probo (Cerutti et al. 2014)3 command-line interface, which was specifically designed for running and comparing solvers within ICCMA. Performance Evaluation Each solver was evaluated over N different argumentation graph instances within each track (N = 192 for SE and EE, and 576 for DC and DS). Instances were generated with the intention of being challenging — one group of instances was generated so as to contain a large grounded extension and few extensions in the other semantics. This group’s graphs were large (1224 to 9473 arguments), and challenged solvers that scaled poorly (that is, those that used combinatorial approaches for computing extensions). A second group of instances was smaller (141 to 400 arguments), but had a rich structure of stable, preferred, and complete extensions (up to 159 complete extensions for the largest graphs) and thus provided combinatorial challenges for solvers relying on simple search-based algorithms. A final group contained medium-sized graphs (185 to 996 arguments) and featured many strongly connected components with many extensions. This group was particularly challenging for solvers not able to decompose the graph into smaller components. SPRING 2016 103 Competition Reports Each solver was given 10 minutes to solve an instance. For each correctly and timely solved instance the solver received one point, and a ranking for each track was obtained based on points scored on all its instances. Ties were broken by considering total run time on all instances. Additionally, a global ranking of the solvers across all tracks was generated by computing the Borda count of all solvers in all tracks. Results and Concluding Remarks The obtained rankings for all 16 tracks can be found on the competition website.4 The global ranking identified the following top three solvers: (1) CoQuiAAS, (2) ArgSemSAT, and (3) LabSATSolver. Another solver, Cegartix, participated in only three tracks (SE-PR, EE-PR, DS-PR), but came top in all of these. It is interesting to note that these four solvers are based on SAT-solving techniques. Additionally, an answer set programming–based solver (ASPARTIX-D) came first in the four tracks related to the stable semantics; there is a strong relationship between these semantics and the answer set semantics, which probably explains its strength in these tracks. Information on the solvers and their authors can also be found on the home page of the competition. Given the success of the competition, a second iteration will take place in 2017 with an extended number of tracks. Notes 1. argumentationcompetition.org. 2. S ⊆ A defends a if ∀bRA, ∃c ∈ S s.t. cRB, that is, all attackers of a are counterattacked by S. 3. See also F. Cerutti, N. Oren, H. Strass, M. Thimm, and M. Vallati, M. 2015: The First International Competition on Computational Models of Argumentation (ICCMA15): Supplementary notes on probo (argumentationcompetition. org/2015/iccma15notes_v3.pdf) 4. argumentationcompetition.org/2015/results.html. References Besnard, P., and Hunter, A. 2007. Elements of Argumentation. Cambridge, MA: The MIT Press. Cerutti, F.; Oren, N.; Strass, H.; Thimm, M.; and Vallati, M. 2014. A Benchmark Framework for a Computational Argumentation Competition. In Proceedings of the 5th International Conference on Computational Models of Argument, 459– 460. Amsterdam: IOS Press. Dung, P. M. 1995. On the Acceptability of Arguments and Its Fundamental Role in Nonmonotonic Reasoning, Logic Programming, and n-Person Games. Artificial Intelligence 77(2): 321–357. dx.doi.org/10.1016/0004-3702(94)00041-X Dunne, P. E., and Wooldridge, M. 2009. Complexity of Abstract Argumentation. In Argumentation in AI, ed. I. Rahwan and G. Simari, chapter 5, 85–104. Berlin: Springer-Verlag. dx.doi.org/10.1007/978-0-387-98197-0_5 Simari, G. 1992. A Mathematical Treatment of Defeasible 104 AI MAGAZINE Reasoning and Its implementation. Artificial Intelligence 53(2–3): 125–157. dx.doi.org/10.1016/00043702(92)90069-A Toniolo, A.; Norman, T. J.; Etuk, A.; Cerutti, F.; Ouyang, R. W.; Srivastava, M.; Oren, N.; Dropps, T.; Allen, J. A.; and Sullivan, P. 2015. Agent Support to Reasoning with Different Types of Evidence in Intelligence Analysis. In Proceedings of the 14th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2015), 781–789. Richland, SC: International Foundation for Autonomous Agents and Multiagent Systems. Walton, D. N.; Reed, C.; and Macagno, F. 2008. Argumentation Schemes. New York: Cambridge University Press. dx.doi.org/10.1017/CBO9780511802034 Williams, M.; Liu, Z. W.; Hunter, A.; and Macbeth, F. 2015. An Updated Systematic Review of Lung Chemo-Radiotherapy Using a New Evidence Aggregation Method. Lung Cancer (Amsterdam, Netherlands) 87(3): 290–5. dx.doi.org/10.1016/j.lungcan.2014.12.004 Matthias Thimm is a senior lecturer at the Universität Koblenz-Landau, Germany. His main research interests are in knowledge representation and reasoning, particularly on aspects of uncertainty and inconsistency. Serena Villata is a researcher at CNRS, France. Her main research interests are in knowledge representation and reasoning, particularly in argumentation theory, normative systems, and the semantic web. Federico Cerutti is a lecturer at Cardiff University, UK. His main research interests are in knowledge representation and reasoning, and in computational models of trust. Nir Oren is a senior lecturer at the University of Aberdeen, UK. His research interests lie in the area of agreement technologies, with specific interests in argumentation, normative reasoning, and trust and reputation systems. Hannes Strass is a postdoctoral researcher at Leipzig University, Germany. His main research interest is in logicbased knowledge representation and reasoning. Mauro Vallati is a research fellow at the PARK research group of the University of Huddersfield, United Kingdom. His main research interest is in AI planning. He was coorganiser of the 2014 edition of the International Planning Competition (IPC). Reports A Report on the Ninth International Web Rule Symposium Adrian Paschke I The annual International Web Rule Symposium (RuleML) is an international conference on research, applications, languages, and standards for rule technologies. RuleML is a leading conference to build bridges between academe and industry in the field of rules and its applications, especially as part of the semantic technology stack. It is devoted to rule-based programming and rulebased systems including production rules systems, logic programming rule engines, and business rule engines/business rule management systems; semantic web rule languages and rule standards; rule-based event-processing languages (EPLs) and technologies; and research on inference rules, transformation rules, decision rules, production rules, and ECA rules. The Ninth International Web Rule Symposium (RuleML 2015) was held in Berlin, Germany, August 2–5. This report summarizes the events of that conference. T he Ninth International Web Rule Symposium (RuleML 2015) was held in Berlin, Germany, from August 2–5. The symposium was organized by Adrian Paschke (general chair), Fariba Sadri (program cochair), Nick Bassiliades (program cochair), and Georg Gottlob program cochair). A total number of 94 papers were submitted from which 22 full papers, 1 short paper, 2 keynote papers, 3 track papers, 4 tutorial papers, 6 industry papers, 6 challenge papers, 3 competition papers, 5 Ph.D. papers and 3 poster papers were selected. The papers were presented in multiple tracks on complex event processing, existential rules and Datalog+/–, industry applications, legal rules and reasoning, and rule learning. Following the precedent set in earlier years, RuleML also hosted the Fifth RuleML Doctoral Consortium and the Ninth International Rule Challenge as well as the RuleML Competition, which this year was dedicated to rule-based recommender systems on the web of data. A highlight of this year’s event was the industry track, which introduced six papers describing research work in innovative companies. New this year was also the joint RuleML / reasoning web tutorial day on the first day of the symposium, with four tutorials — TPTP World by Geoff Sutcliffe, PSOA RuleML by Harold Boley, Rulelog by Benjamin Grosof, and OASIS LegalRuleML by Tara Athan. Copyright © 2016, Association for the Advancement of Artificial Intelligence. All rights reserved. ISSN 0738-4602 SPRING 2016 105 Reports The Thirty-First AAAI Conference on Artificial Intelligence (AAAI-17) and the Twenty-Ninth Conference on Innovative Applications of Artificial Intelligence (IAAI-17), will be held in New Orleans, Louisiana, USA, during the mid-January to mid-February timeframe. Final dates will be available by March 31, 2016. The technical conference will continue its 3.5-day scheduled, either preceded or followed by the workshop and tutorial programs. AAAI-17 will arrive in New Orleans just prior to Mardi Gras and festivities will already be underway. Enjoy legendary jazz music, the French Quarter filled with lively clubs and restaurants, world-class museums, and signature architecture. New Orleans’ multicultural and diverse communities will make your choices and and experience in the Big Easy unique. The 2017 Call for Papers will be available soon at www.aaai.org/Conferences/AAAI/aaai17.php. Please join us in 2017 in NOLA for a memorable AAAI! This year’s symposium featured three invited keynote talks. Michael Genesereth of Stanford University, USA, presented the Herbrand Manifesto. Thom Fruehwirth of the University of Ulm, Germany, presented an overview of constraint-handling rules, while Avigdor Gal of the Technion – Israel Institute of Technology, presented a framework for mining the rules that guide event creation. Very special this year was the great collocation of subevents and colocated events. A total number of 138 registered participants attended the main RuleML 2015 symposium and affiliated subevents, including the colocated Conference on Web Reasoning and Rule Systems (RR 2015), the Reasoning Web Summer School (RW 2015), and the Workshop on Formal Ontologies meet Industry (FOMI). Additionally, the Conference on Automated Deduction (CADE 2015) celebrated its 25th meeting with more than 200 participants. This “Berlin on Rules” colocation provided great opportunity for the rule-based community to meet with the automated deduction community at one of the several joint social events, including the joint reception at the Botanic Garden on Monday, August 3, the joint keynote by Michael Genesereth, the poster session on Tuesday, August 4, and the joint conference dinner at the Fischerhuette restaurant at Lake Schlachtensee on Wednesday, August 5. The welcome address at the reception was given by Ute Finckh-Krämer (Berlin, SPD, member of 106 AI MAGAZINE the German Parliament) followed by Wolfgang Bibel (University of Darmstadt) who was the invited speaker. The dinner speech at the Fischerhuette was given by Jörg Siekmann (University of Saarbrücken). The poster session, consisting of 18 posters and demos, was jointly organized as a get-together with the Berlin Semantic Web Meetup. At the session, wine, beer, and finger food were provided in the greenhouses of the Computer Science Department at the Freie Universität Berlin. The organizers also used this unique opportunity to hold a joint public RuleML and RR business meeting as well as an invited dinner with all chairs, and invited keynote speakers of RuleML, RR, RW, FOMI, and CADE. The additional rich social program, with a bus sightseeing tour to east, west, and downtown Berlin on Saturday, August 1, a boat sightseeing tour from lake Wannsee to the Reichstag on Sunday, August 2, the CADE exhibitions on Wednesday, and plenty visits to the various beer gardens, made it a memorable stay in the capital of Germany for the participants. The RuleML 2015 Best Paper Award was given to Thomas Lukasiewicz, Maria Vanina Martinez, Livia Predoiu, and Gerardo I. Simari for their paper Existential Rules and Bayesian Networks for Probabilistic Ontological Data Exchange. The Ninth International Rule Challenge Award went to Jean-François Baget, Alain Gutierrez, Michel Leclère, Marie-Laure Mugnier, Swan Rocher, and Clément Sipieter, for their paper Datalog+, RuleML, and OWL 2: Formats and Translations for Existential Rules. The winners of the RuleML 2015 Competition Award were Marta Vomlelova, Michal Kopecky, and Peter Vojtas, for their paper Transformation and Aggregation Preprocessing for Top-k Recommendation GAP Rules Induction. As in previous years, RuleML 2015 was also a place for presentations and face-to-face meetings about rule technology standardizations, which this year covered OASIS LegalRuleML, RuleML 1.02 (Consumer+Deliberation+Reaction), OMG API4KB, OMG SBVR, ISO Common Logic, ISO PSL, and TPTP. We would like to thank our sponsors, whose contributions allowed us to cover the costs of student participants and invited keynote speakers. We would also like to thank all the people who have contributed to the success of this year’s special RuleML 2015 and colocated events, including the organization chairs, PC members, authors, speakers, and participants. The next RuleML symposium will be held at Stony Brook University, in New York, USA, from July 5–8, 2016 (2016.ruleml.org). Adrian Paschke is a professor and head of the Corporate Semantic Web (AG-CSW), chair at the Institute of Computer Science, Department of Mathematics and Computer Science at Freie Universitaet Berlin (FUB). He also is director of the Data Analytics Center (DANA) at Fraunhofer FOKUS and director of RuleML Inc. in Canada. Reports Fifteenth International Conference on Artificial Intelligence and Law (ICAIL 2015) Katie Atkinson, Jack G. Conrad, Anne Gardner, Ted Sichelman I The 15th International Conference on AI and Law (ICAIL 2015) was held in San Diego, California, USA, June 8– 12, 2015, at the University of San Diego, at the Kroc Institute, under the auspices of the International Association for Artificial Intelligence and Law (IAAIL), an organization devoted to promoting research and development in the field of AI and law with members throughout the world. The conference is held in cooperation with the Association for the Advancement of Artificial Intelligence (AAAI) and with ACM SIGAI (the Special Interest Group on Artificial Intelligence of the Association for Computing Machinery). T he 15th International Conference on AI and Law (ICAIL 2015) was held in San Diego, California, on June 8–12, 2015 and broke all prior attendance records. The conference has been held every two years since 1987, alternating between North America and (usually) Europe. The program for ICAIL 2015 included three days of plenary sessions and two days of workshops, tutorials, and related events. Attendance reached a total of 179 participants from 23 countries. Of the total, 95 were registered for the full conference and 84 for one or two days. The work reported at the ICAIL conferences has always had two thrusts: using law as a rich domain for AI research, and using AI techniques to develop legal applications. That duality continued this year, with an increased emphasis on the applications side. Workshop topics included (1) discovery of electronically stored information, (2) law and big data, (3) automated semantic analysis of legal texts, and (4) evidence in the law. There were also two sessions for which attorneys could obtain Continuing Legal Education credit, one on AI techniques for intellectual property analytics and the other on trends in legal search and software. The program also contained events intended to reach out to a variety of communities and audiences. There was a mul- Copyright © 2016, Association for the Advancement of Artificial Intelligence. All rights reserved. ISSN 0738-4602 SPRING 2016 107 Reports AAAI Email Addresses Please note that AAAI will be modifying its email addresses in 2014 in an effort to reduce the amount of spam that we are receiving. We will be adding 14 to all email addresses, as follows: aaai16 aiide16 aimagazine16 aitopics16 fellows16 fss16 hcomp16 iaai16 icwsm16 info16 membership16 orders16 press16 sss16 volunteer16 workshops16 The number will be updated on an annual basis. AAAI can also be reached by filling out the contact form at www.aaai.org/ scripts/Contact/contact.php. tilingual workshop for AI and Law researchers from non-English-speaking countries, and a successful doctoral consortium was held to welcome and encourage student researchers. Two well-attended tutorials were offered for those new to the field, an introduction to AI and law and an examination of legal ontologies. The talks given by the invited speakers of the conference each had a different focal point: Jan Becker (Robert Bosch LLC) reported on progress in self-driving vehicles and how these vehicles obey traffic rules; Jack Conrad (Thomson Reuters), in his IAAIL Presidential Address, reflected upon past developments within AI and law and commented on current and upcoming challenges facing researchers in the field and the means to address them; Jerry Kaplan (Stanford University) explored the attribution of rights and responsibilities to AI systems under the law; Michael Luck (King’s College London) discussed electronic contracts in agent-based systems and the emergence of norms within these systems. For this 15th edition of ICAIL, 58 contributions were submitted. Of these submissions, 15 were accepted as full papers (10 pages) and 15 were accepted as research abstracts (5 pages). Four additional submissions were accepted as abstracts of system demonstrations, and these systems were showcased in a lively demo session. In addition to the long-standing award for the best student paper, three new awards were presented at ICAIL 2015. The awards and their winners follow. The Donald Berman best student paper prize was awarded to Sjoerd Timmer (Utrecht University), for A Structure-Guided Approach to Capturing Bayesian 108 AI MAGAZINE Reasoning about Legal Evidence in Argumentation. The paper was coauthored by John-Jules Ch. Meyer, Henry Prakken, Silja Renooij, and Bart Verheij. The Peter Jackson best innovative application paper prize was awarded to Erik Hemberg (Massachusetts Institute of Technology), Jacob Rosen (Massachusetts Institute of Technology), Geoff Warner (MITRE Corporation), Sanith Wijesinghe (MITRE Corporation), and Una-May O’Reilly (Massachusetts Institute of Technology), for their paper Tax Non-Compliance Detection Using Co-Evolution of Tax Evasion Risk and Audit Likelihood. The Carole Hafner best paper prize, memorializing an ICAIL founder who passed away in 2015, was awarded to Floris Bex (Utrecht University), for An Integrated Theory of Causal Stories and Evidential Arguments. Finally, the award for the best doctoral consortium student paper was presented to Jyothi Vinjumur (University of Maryland), for Methodology for Constructing Test Collections using Collaborative Annotation. The conference was held at the University of San Diego, at the Joan B. Kroc Institute for Peace and Justice. Conference sponsors were the International Association for Artificial Intelligence and Law, Thomson Reuters, the University of San Diego Center for IP Law & Markets, Davis Polk & Wardwell LLP, TrademarkNow, and Legal Robot. Both AAAI and ACM SIGAI were in cooperation. Conference officials were Katie Atkinson (program chair), Ted Sichelman (conference chair), and Anne Gardner (secretary/treasurer). Further information about the conference is available at icail2015.org. The proceedings were published by the Association for Computing Machinery and are available in the ACM Digital Library. Katie Atkinson is a professor and head of the Department of Computer Science at the University of Liverpool. She gained her Ph.D. in computer science from the University of Liverpool, and her research interests concern computational models of argument, with a particular focus on how these can be applied in the legal domain. Jack G. Conrad is a lead research scientist with Thomson Reuters Corporate Research and Development group. He applies his expertise in information retrieval, natural language processing, data mining, and machine learning to meet the technology needs of the company’s businesses, including coverage of the legal domain, to develop capabilities for products such as WestlawNext. Anne Gardner is an independent scholar with a longstanding interest in artificial intelligence and law. Her law degree and her Ph.D. in computer science are both from Stanford University. Ted Sichelman is a professor of law at the University of San Diego. He teaches and writes in the areas of intellectual property, law and entrepreneurship, empirical legal studies, law and economics, computational legal studies, and tax law. AAAI News AAAI News Spring News from the Association for the Advancement of Artificial Intelligence AAAI Announces New Senior Member! AAAI congratulates Wheeler Ruml (University of New Hampshire, USA) on his election to AAAI Senior Member status. This honor was announced at the recent AAAI-16 Conference in Phoenix. Senior Member status is designed to recognize AAAI members who have achieved significant accomplishments within the field of artificial intelligence. To be eligible for nomination for Senior Member, candidates must be consecutive members of AAAI for at least five years and have been active in the professional arena for at least ten years. Congratulations to the 2016 AAAI Award Winners! Tom Dietterich, AAAI President, Manuela Veloso, AAAI Past President and Awards Committee Chair, and Rao Kambhampati, AAAI President-Elect, presented the AAAI Awards in February at AAAI-16 in Phoenix. AAAI Classic Paper Award The 2016 AAAI Classic Paper Award was given to the authors of the two papers deemed most influential from the Fifteenth National Conference on Artificial Intelligence, held in 1998 in Madison, Wisconsin, USA. The 2016 recipients of the AAAI Classic Paper Award were: The Interactive Museum Tour-Guide Robot (Wolfram Burgard, Armin B. Cremers, Dieter Fox, Dirk Hähnel, Gerhard Lakemeyer, Dirk Schulz, Walter Steiner, and Sebastian Thrun) Boosting Combinatorial Search through Randomization (Carla P. Gomes, Bart Selman, and Henry Kautz) Burgard and colleagues were honored for significant contributions to probabilistic robot navigation and the integration with high-level planning methods, while Gomes, Selman, and Kautz were recognized for their significant contributions to the area of automated reasoning and constraint solving through the introduction of randomization and restarts into complete solvers. Wolfram Burgard and Carla Gomes presented invited talks during the conference in recognition of this honor. For more information about nominations for AAAI 2017 Awards, please contact Carol Hamilton at [email protected]. AAAI-16 Outstanding Paper Awards This year, AAAI's Conference on Artificial Intelligence honored the following two papers, which exemplify high standards in technical contribution and exposition by regular and student authors. AAAI-16 Outstanding Paper Award Bidirectional Search That Is Guaranteed to Meet in the Middle (Robert C. Holte, Ariel Felner, Guni Sharon, Nathan R. Sturtevant) AAAI-16 Outstanding Student Paper Award Toward a Taxonomy and Computational Models of Abnormalities in Images (Babak Saleh, Ahmed Elgammal, Jacob Feldman, Ali Farhadi) IAAI-16 Innovative Application Awards Each year the AAAI Conference on Innovative Applications selects the recipients of the IAAI Innovative Application Award. These deployed application case study papers must describe deployed applications with measurable benefits that include some aspect of AI technology. The application needs to have been in production use by its final end-users for sufficiently long so that the experience in use can be meaningfully collected and reported. The 2016 winners were as follows: Deploying PAWS: Field Optimization of the Protection Assistant for Wildlife Security (Fei Fang, Thanh H. Nguyen, Rob Pickles, Wai Y. Lam, Gopalasamy R. Clements, Bo An, Amandeep Singh, Milind Tambe, Andrew Lemieux) Ontology Re-Engineering: A Case Study from the Automotive Industry (Nestor Rychtyckyj, Baskaran Sankaranarayanan, P Sreenivasa Kumar, Deepak Khemani, Venkatesh Raman) Deploying nEmesis: Preventing Foodborne Illness by Data Mining Social Media (Adam Sadilek, Henry Kautz, Copyright © 2016, Association for the Advancement of Artificial Intelligence. All rights reserved. ISSN 0738-4602 SPRING 2016 109 AAAI News Lauren DiPrete, Brian Labus, Eric Portman, Jack Teitel, Vincent Silenzio) Special Computing Community Consortium (CCC) Blue Sky Awards I CWSM Please Join Us for ICWSM-16 in Cologne, Germany! The Tenth International AAAI Conference on Web and Social Media will be held May 17–20 at Maternushaus and GESIS - Leibniz Institute for the Social Sciences in Cologne, Germany. This interdisciplinary conference is a forum for researchers in computer science and social science to come together to share knowledge, discuss ideas, exchange information, and learn about cutting-edge research in diverse fields with the common theme of online social media. This overall theme includes research in new perspectives in social theories, as well as computational algorithms for analyzing social media. ICWSM is a singularly fitting venue for research that blends social science and computational approaches to answer important and challenging questions about human social behavior through social media while advancing computational tools for vast and unstructured data. ICWSM-16 will include a lively program of technical talks and posters, invited presentations, and keynote talks by Lise Getoor (University of California, Santa Cruz) and Amr Goldberg (Stanford Graduate School of Business). Workshops and Tutorials The ICWSM Workshop program will continue in 2016, and the Tutorial Program will return. Both will be held on the first day of the conference, May 17. For complete details about the workshop program, please see www.icwsm.org/2016/program/workshop. Registration Is Now Open! Registration information is available at the ICWSM-16 website (www.icwsm.org/2016/attending/registration). The early registration deadline is March 25, and the late registration deadline is April 15. For full details about the conference program, please visit the ICWSM-16 website (icwsm. org) or write to [email protected]. AAAI-16, in cooperation with the CRA Computing Community Consortium (CCC), honored three papers in the Senior Member track that presented ideas and visions that can stimulate the research community to pursue new directions, such as new problems, new application domains, or new methodologies. The recipients of the 2016 Blue Sky Idea travel awards, sponsored by the CCC, were as follows: Indefinite Scalability for Living Computation (David H. Ackley) Embedding Ethical Principles in Collective Decision Support Systems (Joshua Greene, Francesca Rossi, John Tasioulas, Kristen Brent Venable, Brian Williams) Five Dimensions of Reasoning in the Wild (Don Perlis) 2016 AI Video Competition Winners The tenth annual AI video competition was held during AAAI-16 and several winning videos were honored during the awards presentation. Videos were nominated for awards in six categories, and winners received a “Shakey” award during a special award ceremony at the conference. Our thanks go to Sabine Hauert and Charles Isbell for all their work on this event. The winners of the three awards were as follows: Best Video Machine Learning Techniques for Reorchestrating the European Anthem (François Pachet, Pierre Roy, Mathieu Ramona, Marco Marchini, Gaetan Hadjeres, Emmanuel Deruty, Benoit Carré, Fiammetta Ghedini) Best Robot Video A Sea of Robots (Anders Lyhne Christensen, Miguel Duarte, Vasco Costa, Tiago Rodrigues, Jorge Gomes, Fernando Silva, Sancho Oliveira) Best Student Video Deep Neural Networks Are Easily Fooled (Anh Nguyen, Jason Yosinski, Jeff Clune) 110 AI MAGAZINE AAAI News Congratulations to the 2016 AAAI Fellows! Each year a small number of fellows are recognized for their unusual distinction in the profession and for their sustained contributions to the field for a decade or more. An official dinner and ceremony were held in their honor during AAAI-16 in Phoenix, Arizona. Giuseppe De Giacomo (University of Rome La Sapienza, Italy) For significant contributions to the field of knowledge representation and reasoning, and applications to data integration, ontologies, planning, and process synthesis and verification. Daniel D. Lee (University of Pennsylvania, USA) For significant contributions to machine learning and robotics, including algorithms for perception, planning, and motor control Bing Liu (University of Illinois at Chicago, USA) For significant contributions to data mining and development of widely used sentiment analysis, opinion spam detection, and Web mining algorithms. Maja J. Mataric (University of Southern California, USA) For significant contributions to the advancement of multirobot coordination, learning in human-robot systems, and socially assistive robotics. Eric Poe Xing (Carnegie Mellon University, USA) For significant contributions to statistical machine learning, its theoretical analysis, new algorithms for learning probabilistic models, and applications of these to important problems in biology, social network analysis, natural language processing and beyond; and to the development of new architecture, system platform, and theory for distributed machine learning programs on large scale applications. Zhi-Hua Zhou (Nanjing University, China) For significant contributions to ensemble methods and learning from multi-labeled and partially-labeled data. The 2016 AAAI Distinguished Service Award The 2016 AAAI Distinguished Service Award recognizes one individual for extraordinary service to the AI community. The AAAI Awards Committee is pleased to announce that this year's recipient is Maria Gini (University of Minnesota). Professor Gini is being recognized for her outstanding contributions to the field of artificial intelligence through sustained service leading AI societies, journals, and conferences; mentoring colleagues; and working to increase participation of women in AI and computing. Maria Gini is a professor in the Department of Computer Science and Engineering at the University of Minnesota. She studies decision making for autonomous agents in a variety of applications and contexts, ranging from distributed methods for allocation of tasks to robots, to methods for robots to explore an unknown environment, teamwork for search and rescue, and navigation in dense crowds. She is a Fellow of the Association for the Advancement of Artificial Intelligence. She is coeditor in chief of Robotics and Autonomous Systems, and is on the editorial board of numerous journals, including Artificial Intelligence, and Autonomous Agents and Multi-Agent Systems. SPRING 2016 111 AAAI News AAAI/EAAI 2016 Outstanding Educator Award The Inagural AAAI/EAAI Outstanding Educator was established in 2016 to recognize a person (or group of people) who has (have) made major contributions to AI education that provide long-lasting benefits to the AI community. Examples might include innovating teaching methods, providing service to the AI education community, generating pedagogical resources, designing curricula, and educating students outside of higher education venues (or the general public) about AI. AAAI is pleased to announce the first corecipients of this award, Peter Norvig (Google) and Stuart Russell (University of California, Berkeley), who are being honored for their definitive text, Peter Norvig and Stuart Russell accept the AAAI / “Artificial Intelligence: A Modern Approach,” that systemized EAAI Outstanding Educator Award. the field of artificial intelligence and inspired a new generation of scientists and engineers throughout the world, as well as for their individual contributions to education in artificial intelligence. This award is jointly sponsored by AAAI and the Symposium on Educational Advances in Artificial Intelligence. Han Yu, Chunyan Miao, Cyril Leung, Daniel Wei Quan Ng, Kian Khang Ong, Bo Huang and Yaming Zhang) AAAI gratefully acknowledges the Bristol Robotics Laboratory for help with the manufacturing of the awards. Congratulations to all the winners! 2016 Fall Symposium Series November 17–19 Mark Your Calendars! The 2016 AAAI Fall Symposium Series will be held Thursday through Saturday, November 17–19, at the Westin Arlington Gateway in Arlington, Virginia, adjacent to Washington, DC. Proposals are due April 8, and accepted symposia will be announced in late April. Submissions will be due July 29, 2016. For more information, please see the 2016 Fall Symposium Series website (www.aaai. org/Symposia/Fall/fss16.php). Most Entertaining Video Finding Linda — A Search and Rescue Mission by SWARMIX (Mahdi Asadpour, Gianni A. Di Caro, Simon Egli, Eduardo Feo-Flushing, Danka Csilla, Dario Floreano, Luca M. Gambardella, Yannick Gasser, Linda Gerencsér, Anna Gergely, Domenico Giustiniano, Gregoire Heitz, Karin A. Hummel, Barbara Kerekes, Ádám Miklósi, Attila David Molnar, Bernhard Plattner, 112 AI MAGAZINE Maja Varga, Gábor Vásárhelyi, JeanChristophe Zufferey) Best Application of AI Save the Wildlife, Save the Planet: Protection Assistant for Wildlife Security (Fei Fang, Debarun Kar, Dana Thomas, Nicole Sintov, Milind Tambe) People's Choice AI for Liveable Cities (Zhengxiang Pan, AAAI-16 Student Abstract Awards Two awards were presented to participants in the AAAI-16 Student Abstract Program, including the Best Student 3Minute Presentation and the Best Student Poster. Fifteen finalists in the Best Student 3-Minute Presentation category presented one-minute oral spotlight presentations during the second day of the technical conference, followed that evening by their poster presentations. Votes for both awards were cast by senior program committee members and students. The winners were as follows: Best Student 3-Minute Presentation Towards Structural Tractability in Hedonic Games (Dominik Peters) Honorable Mention: Student 3-Minute Presentation Epitomic Image Super-Resolution (Yingzhen Yang, Zhangyang Wang, Zhaowen Wang, Shiyu Chang, Ding Liu, Honghui Shi, and Thomas S. Huang) AAAI News AAAI President Tom Dietterich delivers his Presidential Address, Steps Toward Robust Artificial Intelligence, on Sunday, February 14 at AAAI-16. Best Student Poster Efficient Collaborative Crowdsourcing (Zhengxiang Pan, Han Yu, Chunyan Miao, and Cyril Leung) Join Us in New Orleans for AAAI-17 The Thirty-First AAAI Conference on Artificial Intelligence (AAAI-17) and the Twenty-Ninth Conference on Innovative Applications of Artificial Intelligence (IAAI-17), will be held in New Orleans, Louisiana, USA, during the mid-January to mid-February timeframe. Final dates will be available by March 31, 2016. The technical conference will continue its 3.5-day scheduled, either preceded or followed by the workshop and tutorial programs. AAAI-17 will arrive in New Orleans just prior to Mardi Gras and festivities will already be underway. Enjoy legendary AAAI President-Elect and Executive Council Elections Please watch your mailboxes for an announcement of the 2016 AAAI Election. The link to the electronic version of the annual AAAI Ballot will be mailed to all regular individual AAAI members in the spring. The membership will vote for a new President-Elect (two-year term, followed by two years as President, and two additional years as Past President), as well as four new councilors, who will each serve three-year terms. The online voting system is expected to close on June 10. Please note that the ballot will be available via the online system only. If you have not provided AAAI with an up-to-date email address, please do so immediately by writing to membership16@ aaai.org. jazz music, the French Quarter filled with lively clubs and restaurants, world-class museums, and signature architecture. New Orleans’ multicultural and diverse communities will make your choices and and experience in the Big Easy unique. The 2017 Call for Papers will be available soon at www.aaai.org/Conferences/AAAI/aaai1 7.php. Please join us in 2017 in NOLA for a memorable AAAI! SPRING 2016 113 AAAI News Robert S. Engelmore Memorial Lecture Award The Robert S. Engelmore Memorial Lecture Award award was established in 2003 to honor Dr. Robert S. Engelmore's extraordinary service to AAAI, AI Magazine, and the AI applications community, and his contributions to applied AI. The annual keynote lecture is presented at the Innovative Applications of Artificial Intelligence Conference. Topics encompass Engelmore's wide interests in AI, and each lecture is linked to a subsequent article published upon approval by AI Magazine. The lecturer and, therefore, the author for the magazine article, are chosen jointly by the IAAI Program Committee and the Editor of AI Magazine. AAAI congratulates the 2016 recipient of this award, Reid G. Smith, i2k Connect, who was honored for pioneering research contributions and high-impact applications in knowledge management and for extensive contributions to AAAI, including educating and inspiring the broader community about AI through AITopics. Smith presented his award lecture, “A Quarter Century of AI Applications: What We Knew Then versus What We Know Now,” at the Innovative Applications of Artificial Intelligence Conference in Phoenix. Reid G. Smith is cofounder and chief executive officer of i2k Connect, an AI technology company that transforms unstructured documents into structured data enriched with subject matter expert knowledge. Formerly, he was vice president of research and knowledge management at Schlumberger, enterprise content management director at Marathon Oil, and senior vice president at Medstory, a vertical search company purchased by Microsoft. He holds a Ph.D. in electrical engineering from Stanford University and is a Fellow of AAAI. He has served as AAAI Councilor, AAAI-88 program cochair, IAAI-91 program chair and program committee member for IAAI from its inception in 1989. He is coeditor of AITopics. Join Us in Austin for HCOMP-16 The Fourth AAAI Conference on Human Computation and Crowdsourcing will be held October 30 – November 3, 2016 at the AT&T Executive Education and Conference Center on the University of Texas at Austin campus. HCOMP-16 will be co-located with EMNLP 2016, the 2016 Conference on Empirical Methods in Natural Language Processing. HCOMP is the premier venue for disseminating the latest research findings on crowdsourcing and human computation. While artificial intelligence (AI) and humancomputer interaction (HCI) represent traditional mainstays of the conference, HCOMP believes strongly in inviting, fostering, and promoting broad, interdisciplinary research. This field is particularly unique in the diversity of disciplines it draws upon, and contributes to, ranging from humancentered qualitative studies and HCI design, to computer science and artificial intelligence, economics and the social sciences, all the way to law and policy. We promote the exchange of scientific advances in human computation and crowdsourcing not only 114 AI MAGAZINE among researchers, but also engineers and practitioners, to encourage dialogue across disciplines and communities of practice. Submissions are due May 17, 2016. For more information, please visit humancomputation.com, or write to [email protected]. AAAI Member News Nick Jennings Receives New Year Honor Nick Jennings, professor of computer science at the University of Southampton, has been made Companion of the Order of the Bath in the Queen’s New Year Honours List for his services to computer science and national security science. Jennings, who is head of Electronics and Computer Science (ECS) at the University, has been recognized for his pioneering contributions to the fields of artificial intelligence, autonomous systems and agent-based computing. He is the UK’s only Regius Professor in Computer Science, a prestigious title awarded to the University by HM The Queen to mark her Diamond Jubilee. Jennings just completed a six-year term of office as the Chief Scientific Advisor to the UK Government in the area of National Security. Jennings is also a successful entrepreneur and is chief scientific officer for Aerogility, a 20-person start-up that develops advanced software solutions for the aerospace and defense sectors. Bill Clancey Named NAI Fellow William “Bill” Clancey, a senior research scientist with the Florida Institute for Human and Machine Cognition (IHMC), was named a Fellow of the National Academy of Inventors (NAI). The Tampa-based Academy named a total of 168 Fellows this week, bringing the total number of Fellows to 582. This is the fourth year that Fellows have been named. Clancey is most well-known for developing a work practice modeling and simulation system called Brahms, a tool for comprehensive design of work systems, relating people and automation. Using the Brahms modeling system, scientists study the flow of information and communications in realworld work settings, and the effect of automated systems. One important practical application is the coordination among air traffic controllers, AAAI News Nick Bostrom of Oxford University addresses AAAI-16 in his talk “What We Should Think about Regarding the Future of Machine Intelligence.” As part of a series of events addressing ethical issues and AI, AAAI-16 held a debate on AI’s Impact on Labor Markets, with participants (left to right) Erik Brynjolfsoon (MIT), Moshe Vardi (Rice University), Nick Bostrom (Oxford University), and Oren Etzioni (Allen Institute for AI). The panel was moderated by Toby Walsh (Data61) (far left). SPRING 2016 115 AAAI News Marvin Minsky, 1927 – 2016 AAAI is deeply saddened to note the death of Marvin Minsky on 25 January 2016, at the age of 88. One of the founders of the discipline of artificial intelligence, Minsky was a professor emeritus at the Massachusetts Institute of Technology, which he joined in 1958. With John McCarthy, Minsky cofounded the MIT Artificial Intelligence Laboratory. Minsky was also a founding member of the MIT Media Lab, founder of Logo Computer Systems, Thinking Machines Corporation, as well as AAAI’s third President. Minsky’s research spanned an enormous range of fields, including mathematics, computer science, the theory of computation, neural networks, artificial intelligence, robotics, commonsense reasoning, natural language processing and psychology. An accomplished musician, Minsky had boundless energy and creativity. He was not only a scientific pioneer and leader, but also a mentor and teacher to many. Minsky’s impact on many leaders in our community was documented in the Winter 2007 issue of AI Magazine article “In Honor of Marvin Minsky’s Contributions on his 80th Birthday” written by Danny Hillis, John McCarthy, Tom M. Mitchell, Erik T. Mueller, Doug Riecken, Aaron Sloman, and Patrick Henry Winston. His prolific writing career also included many articles within our pages as well. Minsky’s passing is a tragic landmark for the many AI scientists he influenced personally, for the many more that he inspired intellectually, as well as for the history of the discipline. AI Magazine will celebrate his many contributions in a future issue. 116 AI MAGAZINE AAAI News pilots and automated systems during flights. The NAI Fellows will be inducted on April 15, 2016, as part of the Fifth Annual Conference of the National Academy of Inventors at the United States Patent and Trademark Office (USPTO) in Alexandria, Virginia. Ken Ford Named AAAS Fellow The American Association for the Advancement of Science (AAAS) has elected Ken Ford, director and chief executive officer of the Florida Institute for Human and Machine Cognition (IHMC), as a Fellow. Ford is one of 347 scientists who have been named Fellows this year. The electing Council elects people “whose efforts on behalf of the advancement of science of its applications are scientifically or socially distinguished.” Ford was selected “for founding and directing the IHMC, for his scientific contributions to artificial intelligence and human-centered computing, and for service to many federal agencies.” IHMC, which Ford founded in 1990, is known for its groundbreaking research in the field of artificial intelligence. Ford has served on the National Science Board, chaired the NASA Advisory Council and served on the U.S. Air Force Science Advisory Board and the Defense Science Board. Ford received an official certificate and a gold and blue rosette pin on Saturday, February 13 at the AAAS Fellows Forum during the 2016 Annual Meeting in Washington, D.C. AAAI congratulates all three of these AAAI Fellows for their honors! AAAI Executive Council Meeting Minutes The AAAI Executive Council met via teleconference on September 25, 2016. Attending: Tom Dietterich, Sonia Chernova, Vincent Conitzer, Boi Faltings, Ashok Goel, Carla Gomes, Eduard Hovy, Julia Hirschberg, Charles Isbell, Rao Kambhampati, Sven Koenig, David Leake, Henry Lieberman, Diane Litman, Jen Neville, Francesca Rossi, Ted Senator, Steve Smith, Manuela Veloso, Kiri Wagstaff, Shlomo Zilberstein, Sonia Chernova, Carol Hamilton Not Attending: Sylvie Thiebaux, Toby Walsh, Brian Williams Tom Dietterich convened the meeting at 6:05am PDT, and welcomed the newly-elected councilors. The new councilors gave brief statements about their interests and goals while serving on the Executive Council. The retiring councilors were also given an opportunity to offer their advice about future priorities for AAAI. All agreed that outreach to other research communities, international outreach, and encouraging diversity should be paramount. They urged the Council to concentrate on the direction of the field and AAAI, and not get bogged down in the mechanics. Dietterich thanked the retiring councilors for their contributions to the Council and to AAAI, and encouraged them to stay involved. Standing Committee Reports Awards/Fellows/Nominating: Manuela Veloso reviewed the current nomination process for Fellows, Senior Members, Distinguished Service, and Classic Paper. She asked the Council to encourage their colleagues to nominate people for all of these honors. She noted that it is fine to ask a Fellow to nominate you, and in the case of Senior Members, self-nomination with accompanying references is the normal process. She noted that there is a new award this year, the Outstanding Educator Award, which will be cosponsored by AAAI and EAAI. Sylvie Thiebaux volunteered to serve on the selection committee, representing the Council. Veloso reported that the Nominating Committee completed its selections for the recent ballot earlier in the summer, and welcomed the new councilors, thanking them for their thoughtful ballot statements. Conference: Shlomo Zilberstein reported that Michael Wellman and Dale Schuurmans, AAAI-16 program cochairs, are doing a great job with AAAI-16 conference. Another record number of submissions was received on September 15, which indicates that the timing of the conference is continuing to work well for the community. While the conference is very prestigious, many do not realize that AAAI is more than the conference, so raising the visibility of the Association and what it does is important. Sandip Sen is spearheading sponsor recruitment, and has been working with Carol Hamilton on the annual AI Journal proposal for support of student activities, as well as other corporate sponsorships. Zilberstein noted that he is in the process of recruiting chairs for 2017. The committee discussed an overview of venues circulated earlier, and decided that New Orleans was their top choice, with Albuquerque next in line. Hamilton will enter into negotiations with New Orleans. Finally, Zilberstein noted that the Council should establish a written policy regarding its decision to make conference attendance for young families more accessible. The Council discussed the importance of continuing the initiatives started in 2015, such as outreach, ethics panels, and demos of new competitions. Zilberstein will follow up with the 2016 program chairs to be sure these issues are being addressed. Conference Outreach: Henry Lieberman noted that the conference outreach program continues to offer sister conferences publicity opportunities through various AAAI outlets. However, more conferences need to be encouraged to take advantage of this. The committee recently added the 10th IEEE International Conference on Semantic Computing to the list. Ethics: Francesca Rossi reported that the Ethics Committee has been formed, and noted the recent open letter on Autonomous Weapons signed by many members of the AI community. The Committee has not yet formed a proposal for a code of ethics, but has started gathering samples from other organizations. Finance: Ted Senator reported that the Association investment portfolio is now over $9,000,000.00, and that programs have continued to operate on budget for 2015. Due to the larger surplus from AAAI-15, it is likely that a smaller operating deficit will be realized for 2015. Senator noted that there will be another meeting of the Council in November to approve the 2016 budget. (Ed note: This meeting was rescheduled to February 2016 due to extenuating circumstances.) International: Tom Dietterich reported on behalf of Toby Walsh that AAAI recently sponsored a “Lunch with a AAAI Fellow” event at IJCAI, and this SPRING 2016 117 AAAI News AAAI President Tom Dietterich (far left) thanks members of the AAAI-16 Conference Committee for all their efforts in making the conference a great success, including AAAI-16 Program Cochairs, Michael Wellman (second from left) and Dale Schuurmans (fourth from left). program proved to be very successful. He would like to see more of this type of outreach in the future. Sven Koenig noted that AAAI could have a stronger presence via town hall meetings, a booth, or panels and talks at IJCAI or other events. This is a strong possibility in 2016 because of the North American location of IJCAI. This would raise the visibility of AAAI and encourage the international community to join. All agreed that AAAI presidents should pursue a presence at IJCAI on a permanent basis. Policy/Government Relations: The newly-formed Policy / Goverment Relations committee will be conducting a survey, and will convene its first meeting after that survey is complete. Membership: Ed Hovy reported that reduced membership fees for weak currency countries have been instituted, and reminded the Council to review 118 AI MAGAZINE this program every year. The program will be up for renewal in three years. At that time, the Council should decide if membership fees need to be adjusted. However, an annual review will help avoid any long-term problems. He also reminded the Council that the other component of this program — developing strong local representatives in international locales — is equally important, and should be finalized in 2016. AAAI is continuing to offer free memberships to attendees of cooperating conferences who are new to AAAI. Several conferences have taken advantage of this opportunity. Hovy noted that the most important challenge facing AAAI is the retention of members through the development of programs that serve the needs of its members. Tom Dietterich thanked Hovy for his service as chair of the Membership Committee, and noted that he is in the process of revamping the committee assignments in light of the recent Council transition. Publications: David Leake reviewed all of the publishing activities of the Association during the past nine months, including AI Magazine, several proceedings, and workshop and symposium technical reports. He noted that Ashok Goel has agreed to join AI Magazine as associate editor, and welcomed Goel to the group. Goel thanked Leake and Tom Dietterich for this opportunity, and said he looked forward to contributing to the magazine. Funding Requests CRA-W: Tom Dietterich reported that AAAI had received a request from CRAW for $15,000.00 in support of their efforts. He proposed that AAAI contribute $5,000.00, primarily in support of outreach and diversity for women in AAAI News The Arizona State University’s Robot Learning Class Demonstrated their Autonomous Robot Arm at AAAI-16. research. During the following discussion, it was noted that CRA-W has had difficulty of late securing funding from traditional sources, and would like to attract 100–200 more women at their conference, which supports women in their first three years of graduate study. The Council suggested that CRA-W be encouraged to collocate their conference with AAAI, but also would like to see an increased effort to establish a AAAI subgroup of women. Julia Hirschberg moved to support this request at the $5,000.00 level, and Steve Smith seconded the motion. The motion passed. Charles Isbell will follow up with CRA-W to offer complimentary one-year memberships to award recipients. AI Topics: Tom Dietterich noted that Bruce Buchanan and Reid Smith would like to continue AI Topics, but fund it via membership revenue, rather than an NSF grant, as it has been for the past several years. While this may be possible, it was decided that a clearer proposal needs to be developed to establish what the exact dollar amount would be per member. The Council also noted that the AI Alert mailing list should be optout rather than opt-in. An article in the AI Magazine would help raise the visibility of the site, which has far better curation then Wikipedia. Focus Group Report Tom Dietterich reported that he ran a focus group during AAAI to explore new directions for AAAI. The group helped develop a questionnaire reflecting their discussions, and this was subsequently circulated to the AAAI membership. The top-ranked item by the membership was the pursuit of education initiatives, including summer schools for college freshmen, development of course materials, and continuing education opportunities for AAAI members. As a result, Dietterich is seeking to establish a Committee on Education, and asking the members to study and implement one of the ideas detailed in the survey. He is seeking volunteers from the Council. Kiri Wagstaff volunteered to help with this committee, and moved to approve the formation of this committee as an ad hoc committee of the Executive Council. Francesca Rossi seconded the motion, and it passed unanimously. In addition, the Council moved to appoint Wagstaff as chair of the committee, which also passed with no opposing votes. Media Committee Tom Dietterich would like to create an ad hoc media committee, which would advise on and create social media opportunities for members, including a blog, and other features. Advice on appropriate members of the committee is needed. This committee would be a subcommittee of the Publications Committee, and might also oversee the AI Topics website. Dietterich will work on finding an appropriate chair for the committee. Ted Senator moved to create the ad hoc committee, which was seconded by Steve Smith. The motion passed. Tom Dietterich thanked everyone for their participation, and the meeting was adjourned at 8:01 AM PDT. SPRING 2016 119 Calendar AAAI Conferences Calendar This page includes forthcoming AAAI sponsored conferences, conferences presented by AAAI Affiliates, and conferences held in cooperation with AAAI. AI Magazine also maintains a calendar listing that includes nonaffiliated conferences at www.aaai.org/Magazine/calendar.php. AAAI Sponsored Conferences Conferences Held by AAAI Affiliates Conferences Held in Cooperation with AAAI The Tenth International AAAI Conference on Web and Social Media ICWSM-16 will be held May 17–20, 2016 in Cologne, Germany. 15th International Conference on Principles of Knowledge Representation and Reasoning (KR 2016) KR 2016 will be held April 25–29, 2016 in Cape Town, South Africa. 18th International Conference on Enterprise Information Systems ICEIS 2016 will be held April 27–30, 2016, in Rome, Italy URL: www.icwsm.org/2016 Twelfth AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment. AIIDE-16 will be held in October in the San Francisco Bay Area. URL: kr2016.cs.uct.ac.za Twenty-Ninth International Florida AI Research Society Conference. FLAIRS-2016 will be held May 16–18, 2016 in Key Largo, Florida, USA. URL: www.flairs-29.info URL: aiide.org Fourth AAAI Conference on Human Computation and Crowdsourcing. HCOMP-16 will be held October 30–November 3 in Austin, Texas, USA. URL: humancomputation.com AAAI Fall Symposium. The AAAI Fall Symposium Series will be held November 17–19 in Arlington, Virginia, adjacent to Washington DC USA. URL: www.aaai.org/Symposia/Fall/ fss16.php 14th International Conference on Practical Applications of Agents and Multi-Agent Systems. PAAMS-2016 will be held 1-3 June, 2016, in Sevilla, Spain URL: paams.net The 26th International Conference on Automated Planning and Scheduling. ICAPS-16 will be held June 12–17, 2016 in London, UK. URL: icaps16.icaps-conference.org 25th International Joint Conference on Artificial Intelligence. IJCAI-16 will be held July 19–15, 2016 in New York, New York USA. URL: /ijcai-16.org/ URL: www.iceis.org 9th Conference on Artificial General Intelligence. AGI-16 will be held 16–19 July in New York, New York URL: agi-conf.org/2016 Twenty-Ninth International Conference on Industrial, Engineering, and Other Applications of Applied Intelligent Systems. IEA/AIE-2016 will be held 2–4 August, 2016, in Morioka, Japan URL: www.ieaaie2016.org Thirty-First AAAI Conference on Artificial Intelligence. AAAI-17 will be held in January–February in New Orleans, Louisiana USA. URL: www.aaai.org/aaai17 Twenty-Ninth Innovative Applications of Artificial Intelligence Conference. IAAI-17 in January–February in New Orleans, Louisiana USA URL: www.aaai.org/iaai17 120 AI MAGAZINE Visit AAAI on Facebook! We invite all interested individuals to check out the Facebook site by searching for AAAI. We welcome your feedback at [email protected]. Copyright © 2016, Association for the Advancement of Artificial Intelligence. All rights reserved. ISSN 0738-4602 Please Join Us for the Fourth AAAI Conference on Human Computation and Crowdsourcing October 30–November 3 2016 Austin Texas, USA www.humancomputation.com/2016/ I CWSM 2016 Cologne, Germany www.icwsm.org/2016 Join Us in Cologne, Germany on May 17–20, 2016 for the Tenth International AAAI Conference on Web and Social Media