Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Trying to Understand Misunderstanding: How Robust Can Spoken Natural Language Dialogue Systems Be? Ronnie W. Smith East Carolina University Sponsors • • • • • National Science Foundation Duke University East Carolina University DARPA BBN Collaborators 1987-1994: Dr. Alan Biermann, Dr. Ruth Day, Dr. Robert Rodman, Richard Hipp, Barry Koster, Dania Egedi, Robin Gambill, Curry Guinn 1994-2000: Dr. Steve Gordon, Robert Hoggard, Shannon Pollard, Chris Shaffer, Greg Keim, Jeremy Mauget, Devender Singh, Jason Carrow, Brian Manning Quick Outline • • • • Initial motivation and results Careful studies Not so careful studies What I’d really like to do! Initial Research Focus Develop a model of dialog processing that: • Uses problem-solving to achieve target goal • Exploits contextual knowledge from the discourse and user • Allows mixed-initiative interaction (computer or user controlled) • Allows voice interaction • Operates in real time to permit experimentation Experimental Evaluation Issues: • • Is computational model practical? Would users dialog behavior change as they gained experience and were given control of the dialog? Hypotheses --As users gained experience and were given the initiative, they • Would complete the task faster • Would speak fewer utterances • Would speak longer utterances • Would spend more time thinking before producing a response Side effect: analyzable data Data Collection: The Circuit Fix-It Shop • • • • • • • • Constructed at Duke University Assisted users in repairing an electronic circuit 125 word vocabulary, speaker-dependent connected speech Varying initiative fixed per interaction 141 dialogs, 2840 user utterances 114 or 84% of the dialogs were successfully completed 50% utterance recognition rate 81.5% correct interpretation rate Sample Interaction – Part 1 C: This is the Circuit Fix-It Shop. How may I help you? U: This RS111 circuit is not working. C: I am familiar with that circuit. The LED is supposed to be displaying alternately flashing one and seven. U: The LED is flashing seven. C: Did you mean to say that the LED is flashing? U: The LED is displaying a flashing seven. C: Is the one on the LED displaying? U: No. C: Is anything else on the LED displaying other than the NOT** flashing seven? U: The seven is flashing. C: Is anything else on the LED displaying other than the flashing seven? U: No. C: What is the switch at when the LED is displaying only a flashing seven? U: It is up. C: What is the knob at when the LED is displaying only a flashing seven? U: The knob is at ten. Sample Interaction – Part 2 C: Is there a wire between connector one eight and connector four one? U: Where is connector four one? C: Below the white knob with the orange circle is a pink box, called transistors. Now, in the upper right of this pink box, called transistors is connector four one. U: The wire is not there. C: Did you mean to say that there IS** a wire between connectors one eight and four one? U: There is no wire. C: Add a wire between connector one eight and connector four one. Initial Miscommunication Handling Techniques Error-correcting Parsing (Hipp) • Convert input to “nearest” grammatical utterance • “nearest” is determined by a cost matrix for insertions, deletions, and substitutions of words • Costs are not all the same (e.g., “a” vs. “not”) Tell the user what went wrong • Only tell user what computer’s interpretation was • Only when misrecognition caused contradictory interpretation (but required for only 48% of these) What to Do Next? Get a better speech recognizer! Well--• • • • Better is not the same as perfect! Better => stretch its limits anyway There will probably always be ungrammatical spoken inputs. There will always be mismatched speaker/hearer background knowledge. What to Do Next? Investigate strategies for the prevention, detection, and repair of miscommunication in natural language dialog • Detailed analysis of existing dialogs • Development and evaluation of strategies for handling miscommunication Effects of Variable Initiative on Linguistic Behavior in Human-Computer Spoken Natural Language Dialog • Smith and Gordon (Computational Linguistics, March 1997) Based on Circuit Fix-It Shop Data Based on classifying utterances according to task phase • • – – – – – Introduction: establish task purpose Assessment: establish current system behavior Diagnosis: establish cause for errant behavior Repair: establish completion of correction Test: establish correctness of behavior Result 1: Relative Number of Utterances Computer-Controlled Subdialog Type User-Controlled Average Percent Average Percent Introduction 2.9 5.2% 2.6 11.6% Assessment 15.4 27.4% 7.9 35.1% Diagnosis 11.8 21.0% 6.7 29.8% Repair 2.9 5.2% 0.3 1.3% Test 23.2 41.2% 5.0 22.2% Conclusion: Experienced users tend not to discuss details they can handle themselves. Result 2: Frequency of User Subdialog Transistions Subdialog Type Computer Controlled User Controlled Introduction 0.0% 0.0% Assessment 11.4% 19.4% Diagnosis 0.0% 6.8% Repair 0.0% 62.5% Test 23.7% 92.8% Conclusion: Computer initiates most subdialogs except when experienced users are completing the task. Result 3: Predictability of Subdialog Transistions Idealized Transition Model I A D R T F Result 3: Predictability of Subdialog Transistions Empirical Transition Model 100 I 97 91 A 80 69 D 19 39 Computer controlled % User controlled % Percentage “normal” dialogs • Computer-controlled: 64% • User-controlled: 33% 8 R 12 53 96 75 25 24 62 T 72 F Study Conclusions Computer controlled dialogs--• Have an orderly pattern of computer-initiated subdialogs • Have terse user responses • Are not amenable to user-correction during miscommunication User controlled dialogs--• Are less orderly • Contain more user-initiated subdialogs • Indicate user willingness to exploit growing expertise Analysis of Strategies for Selective Utterance Verification • • Smith (ANLP, 1997; IJHCS, 1998) Motivation---miscommunication due to speech recognition errors Spoken: I want to fix this circuit Recognized: power a six a circuit Spoken: there is no wire on connector one zero four Recognized: stays no wire I connector one zero four Verification Subdialogs Computer: This is the circuit fix-it shop. How may I help you? Spoken: I want to fix a circuit. Recognized: power a six a circuit. Computer: Did you mean to say there is a power circuit? WHEN TO USE THIS?? Goal: SelectiveVerification • Initiate a verification subdialog only when it is believed to be needed. • Criteria for need: sufficiently unsure you’ve fully understood AND the need to fully understand is sufficiently great. • Terminology – Under-verification---system generates an incorrect meaning that is not verified – Over-verification---a correct meaning is verified • Ideal: minimize under-verifications while keeping oververifications to a minimum as well Measurements of Uncertainty • Parse Cost---sum of costs incurred by error-correcting parser in transforming input to a grammatical utterance • Expectation Cost---how expected was the response given the dialog context Measuring Utterance Importance • Unexplored • Domain-dependent? • Fixed-threshold (depends on risk due to miscommunication) Basic Results Strategy Under-verification Rate Over-verification Rate Parse Cost Only 2.6% 19.2% Expectation Only 1.4% 31.8% Combination 4.7% 7.6% Generalization: This result holds under most threshold values. Study Conclusions • Key parameters for selecting strategy and threshold value: expectation strength, word recognition rate, and penalty for under-verifications • Results depend on the level of system initiative • Results assume all verification subdialogs succeed • Limits to verification subdialogs => other strategies are needed. General Problem • “Knowing” when you don’t know. OR • “Suspecting” when you don’t know. BUT • If you don’t know, how do you talk about it? THUS • System must provide some information on its understanding of context Domain-Level Techniques • Detection of dead ends (suspicious error path checks out OK) C: The LED is supposed to be displaying alternately flashing one and seven. U: LED displaying seven (misrecognized and understood as “LED displaying flashing seven”) C: Is anything else on the LED on? U: No. C: Is there a wire between connectors 18 and 41? U: Yes C: Are you sure that the LED is displaying only a flashing seven? Domain-Level Techniques • Detection of impossibilities (description not possible given “known” conditions) C: What is the LED displaying? U: The LED is displaying a one on the left hand side. C: Is anything else on the LED on? U: No. (NOTE: misstatement---seven also displaying) C: Is the one on the LED flashing? U: Yes. C: Are you sure that the one on the LED is flashing? U: Yes. C: Are you sure that the seven on the LED is off? U: The LED is displaying alternately flashing one and seven. Domain-Level Techniques In general, these are detections of logical inconsistencies, but are their general dialogue strategies for discussing them? Grounding---establishing common context of understanding • Beyond context free acknowledgments U: C: U: C: U: C: U: The LED is displaying a flashing seven. (misunderstood) Is the one on the LED displaying? No. Is anything else on the LED displaying other than the NOT** flashing seven? The seven is flashing. Is anything else on the LED displaying other than the flashing seven? No. • How much grounding is enough? The Initiative Factor • Explicit semantic grounding leverages human intelligence • But only if we allow it! C: Whether there is a wire between connectors 111 and 120 needs to be observed. U: There is a wire between connectors 111 and 120 (misrecognized). C: There is not supposed to be a wire between connectors 111 and 1. U: (repeats and is understood) The Initiative Factor • Too much computer control? C: What is the voltage between connectors 121 and 120? (computer suspects problem in power circuit) U: I believe there’s a problem with the LED circuit. C: What is the voltage between connectors 121 and 120? Suppose instead the response is--U: Voltage is eight (misrecognized as control knob at six). Then computer response is okay. Final Thoughts/Summary • Design systems to leverage human intelligence – The ability to follow step-by-step instruction? – Varying levels of system initiative • VERY challenging when user expertise evolves. • Menus vs. keyboard shortcuts???? – Explicit semantic grounding • Verification subdialogs, etc. • What’s the right amount? • We still need carefully designed studies with real systems!!!