Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Psychometrics wikipedia , lookup
Numerical weather prediction wikipedia , lookup
Machine learning wikipedia , lookup
History of numerical weather prediction wikipedia , lookup
General circulation model wikipedia , lookup
Multi-state modeling of biomolecules wikipedia , lookup
User Simulation for Spoken Dialogue Systems Diane Litman Computer Science Department & Learning Research and Development Center University of Pittsburgh (Currently Leverhulme Visiting Professor, University of Edinburgh) Joint work with Hua Ai Intelligent Systems Program, University of Pittsburgh Motivation: Empirical Research requires Dialogue Corpora User Simulation l Less expensive More efficient More (and better?) data compared to humans Motivation: Empirical Research requires Dialogue Corpora User Simulation How realistic? Power of evaluation measures Discriminative ability [AAAI WS, 2006] Impact of the source corpus Subjects vs. real users [SIGDial, 2007] Human assessment Validation of evaluation [ACL, 2008] Less expensive More efficient More (and better?) data compared to humans Motivation: Empirical Research requires Dialogue Corpora User Simulation How realistic? Power of evaluation measures Discriminative ability [AAAI WS, 2006] Impact of the source corpus Subjects vs. real users [SIGDial, 2007] Less expensive More efficient More (and better?) data compared to humans How useful? Human assessment Validation of evaluation Task Dependent Dialogue System Evaluation Dialogue Strategy Learning [ACL, 2008] More realistic models via knowledge consistency [Interspeech, 2007] Utility of realistic vs. exploratory models for reinforcement learning [NAACL, 2007] Motivation: Empirical Research requires Dialogue Corpora Less expensive More efficient More (and better?) data compared to humans User Simulation How realistic? Power of evaluation measures Discriminative ability [AAAI WS, 2006] Impact of the source corpus Subjects vs. real users [SIGDial, 2007] How useful? Task Dependent Human assessment Validation of evaluation Dialogue System Evaluation Dialogue Strategy Learning [ACL, 2008] More realistic models via knowledge consistency [Interspeech, 2007] * Utility of realistic vs. exploratory models for reinforcement learning [NAACL, 2007] Outline • User Simulation Models – Previous work – Our initial models • Are more realistic models always “better”? • Developing more realistic models via knowledge consistency • Summary and Current Work User Simulation Models • Simulate user dialogue behaviors in simple (or, not too complicated) ways • How to simulate – Various strategies: random, statistical, analytical • What to simulate – Model dialogue behaviors on different levels: acoustic, lexical, semantic / intentional Previous Work • Most models simulate on the intentional level, and are statistically trained from human user corpora • Bigram Models – P(next user action | previous system action) – Only accept the expected dialogue acts [Eckert et al., 1997] [Levin et al., 2000] • Goal-Based Models – Hard-coded fixed goal structures [Scheffler, 2002] – P(next user action | previous system action, user goal) [Pietquin, 2004] – Goal and agenda-based models [Schatzmann et al., 2007] Previous Work (continued) • Models that exploit user state commonalities – Linear combinations of shared features [Georgila et al., 2005] – Clustering [Rieser et al., 2006] • Improve speech recognizer and understanding components – Word-level simulation [Chung, 2004] Our Domain: Tutoring • ITSpoke: Intelligent Tutoring Spoken Dialogue System – Back-end is Why2-Atlas system [VanLehn et al., 2002] – Sphinx2 speech recognition and Cepstral text-to-speech • The system initiates a tutoring conversation with the student to correct misconceptions and to elicit explanations • Student answers: correct, incorrect ITSpoke Corpora Corpus Student Population System Voice Number of Dialogues f03 2003 Synthesized 100 syn 2005 Synthesized 136 pre 2005 Pre-recorded 135 s05 • Two different student groups in f03 and s05 • Systems have minor variations (e.g., voice, slightly different language models) Our Simulation Approach • Simulate on the word level – We use the answers from the real student answer sets as candidate answers for simulated students • First step – basic simulation models – A random model • Gives random answers – A probabilistic model • Answers a question with the same correctness rate as our real students The Random Model • A unigram model – Randomly pick a student answer from all utterances, neglecting the tutor question • Example dialogue ITSpoke: The best law of motion to use is Newton’s third law. Do you recall what it says? Student: Down. ITSpoke: Newton’s third law says… … ITSpoke: Do you recall what Newton’s third law says? Student: More. The ProbCorrect Model • A bigram model – P(Student Answer | Tutor Question) – Give correct/incorrect answers with the same probability as the real students • Example dialogue ITSpoke: The best law of motion to use is Newton’s third law. Do you recall what it says? Student: Yes, for every action, there is an equal and opposite reaction. ITSpoke: This is correct! … ITSpoke: Do you recall what Newton’s third law says? Student: No. Outline • User Simulation Models – Previous work – Our initial models • Are more realistic models always “better”? – Task: Dialogue Strategy Learning • Developing more realistic models via knowledge consistency • Summary and Current Work Learning Task • ITSpoke can only respond to student (in)correctness, but student (un)certainty is also believed to be relevant • Goal: Learn how to manipulate the strength of tutor feedback, in order to maximize student certainty Corpus • Part of S05 data (with annotation) – 26 human subjects, 130 dialogues • Automatically logged – Correctness (c, ic); percent incorrectness (ic%) – Kappa (automatic/manual) = 0.79 • Human annotated – certainty (cert, ncert) – Kappa (two annotators) = 0.68 Sample Coded Dialogue ITSPoke: Which law of motion would you use? Student: Newton’s second law. [ic, ic%=100, ncert] ITSpoke: Well… The best law to use is Newton’s third law. Do you recall what it says? Student: For every action there is an equal and opposite reaction. [c, ic%=50, ncert] Markov Decision Processes (MDPs) and Reinforcement Learning • What is the best action for an agent to take at any state to maximize reward? • MDP Representation – States, Actions, Transition Probabilities – Reward • Learned Policy – Optimal action to take for each state MDP’s in Spoken Dialogue MDP can be created offline MDP Training data Policy Dialogue System User Simulator Human User Interactions work online Our MDP Action Choices • Tutor feedback – Strong Feedback (SF) • “This is great!” – Weak Feedback (WF) • “Well…”, doesn’t comment on the correctness • Strength of tutor’s feedback is strongly related to the percentage of student certainty (chi-square, p<0.01) Our MDP States and Rewards • State features are derived from Certainty and Correctness Annotations • Reward is based on the percentage of Certain student utterances during the dialogue Our MDP Configuration • States – Representation 1: c + ic% – Representation 2: c + ic% + cert • Actions – Strong Feedback, Weak Feedback • Reward – +100 (high certainty), -100 (low certainty) Our Reinforcement Learning Goal • Learn an optimal policy using simulated dialogue corpora • Example Learned Policy – Give Strong Feedback when the current student answer is Incorrect and the percentage of Incorrect answers is greater than 50% – Otherwise give Weak Feedback • Research Question: what is the impact of using different simulation models? Probabilistic Simulation Model • Capture realistic student behavior in a probabilistic way For each question: c+cert (5) c+cert (4) Strong Feedback c+ncert (1) ic+cert (2) Weak Feedback c+ncert (4) ic+cert (1) ic+ncert (3) ic+ncert (3) Total Random Simulation Model • Explore all possible dialogue states • Ignores what the current question is or what feedback is given • Randomly picks one utterance from the candidate answer set Restricted Random Model • Compromise between the exploration of the dialogue state space and the realness of generated user behaviors. For each question: c+cert (1) c+cert (1) Strong Feedback c+ncert (1) ic+cert (1) Weak Feedback c+ncert (1) ic+cert (1) ic+ncert (1) ic+ncert (1) Methodology Prob Corpus1 MDP Policy1 Sys1 40,000 500 40,000 Old System Total Ran. Corpus2 500 MDP Policy2 Sys2 40,000 Res. Ran. Corpus3 Prob 500 MDP Policy3 Sys3 Methodology (continued) • For each configuration, we run the simulation models until the learned policies do not change anymore • Evaluation measure – number of dialogues that would be assigned reward +100 using the old median split – Baseline = 250 Evaluation Results Simulation Model Probabilistic State Rep. 1 State Rep. 2 222 217 Total Random 192 211 Restricted Random 390 368 Blue: Restricted Random significantly outperforms the other two models Underline: the learned policy significantly outperforms the baseline NB: Results similar with other reward functions and evaluation metrics Discussion • We suspect that the performance of the Probabilistic Model is harmed by the data sparsity issue in the real corpus – In State Representation 1, 25.8% of the possible states do not exist in the real corpus – Of most frequent states in State Representation 1 • 70.1% are seen frequently in Probabilistic Training corpus • 76.3% are seen frequently in Restricted Random corpus • 65.2% are seen frequently in Total Random corpus In Sum • When using simulation models for MDP policy training – Hypothesis confirmed: when trained from a sparse data set, it may be better to use a Restricted Random Model than a more realistic Probabilistic Model or a more exploratory Total Random Model • Next Step: – Test the learned policies with human subjects to validate the learning process – How about the cases when we do need a realistic simulation model? Outline • User Simulation Models – Previous work – Our initial models • Are more realistic models always “better”? • Developing more realistic models via knowledge consistency • Summary and Current Work A New Model & A New Measure Goal Consistency Knowledge Consistency Student’s knowledge during a tutoring session is consistent. If the student answers a question correctly, the student is more likely to answer a similar question correctly later. A new simulation model Knowledge consistency can be measured using learning curves. If a simulated student behaves similarly to a real student, we should see a similar learning curve in the simulated data. A new evaluation measure The Cluster Model • Model student learning – P(Student Answer | Cluster of Tutor Question, last Student Correctness) • Example dialogue ITSpoke: The best law of motion to use is Newton’s third law. Do you recall what it says? Student: Yes, for every action, there is an equal reaction. ITSpoke: This is almost right… there is an equal and opposite reaction … ITSpoke: Do you recall what Newton’s third law says? Student: Yes, for every action, there is an equal and opposite reaction. Knowledge Component Representation • Knowledge component – “concepts” discussed by the tutor • The choice of grain size is determined by the instructional objectives of the designers • A domain expert manually clustered the 210 tutor questions into 20 knowledge components (f03 data) – E.g., 3rdLaw, acceleration, etc. Sample Coded Dialogue ITSpoke: Do you recall what Newton’s third law says? [3rdLaw] Student: No. ITSpoke: Newton’s third law says … If you hit the wall harder, is the force of your fist acting on the wall greater or less? [3rdLaw] Student: Greater. [incorrect] [correct] Evaluation: Learning Curves (1) • Learning effect – the student performs better after practicing more • We can visualize the learning effect by plotting an exponentially decreasing learning curve [PSLC, http://learnlab.web.cmu.edu/mhci_2005/documentation/design2d.html] Learning Curves (2) Among all the students, 36.5% of them made at least 1 error at their 2nd opportunity to practice 0.365 Learning Curves (3) • Standard way to plot the learning curve – First compute separate learning curves for each knowledge components, then, average them to get an overall learning curve • We only see smooth learning curves among high learners – High/Low Learners: median split based on normalized learning gain – Learning Curve: Mathematical representation ErrorRate 0.409 * ith Opportunit y ( 1.50) Experiments (1) • Simulation Models – ProbCorrect Model P(A | Q) • A: • Q: Student Answer Tutor Question – Cluster Model P(A | KC, C) • A: Student Answer • KC: Knowledge Component • C: Correctness of the student’s answer to the last previous question that requires the same KC Experiments (2) • Evaluation Measures: Compare simulated user dialogues to human user dialogues using automatic measures • New Measure: User Processing based on Knowledge Consistency – R-squared – How good the simulated learning curve correlates with the observed learning curve in the real student data • Prior Measures: High-level Dialogue Features [Schatzmann et al., 2005] Prior Evaluation Measures Schatzmann et al. Our measures Abbreviation High-level Dialogue Features Dialogue Length (Number of turns) Number of students/tutor turns Sturn, Tturn Turn Length (Number of actions per turn) Total words per student/tutor turn Swordrate, Twordrate Participant Activity Ratio of (Ratio of system/user system/user words actions per dialogue) per dialogue wordRatio Prior Evaluation Measures Schatzmann et al. Our measures Abbreviation High-level Dialogue Features Dialogue Length (Number of turns) Number of students/tutor turns Sturn, Tturn Turn Length (Number of actions per turn) Total words per student/tutor turn Swordrate, Twordrate Participant Activity Ratio of (Ratio of system/user system/user words actions per dialogue) per dialogue wordRatio Learning Feature: % of Correct Answers CRate Experiments (3) • Simulation Models – The ProbCorrect Model P(A | Q) – The Cluster Model P(A | KC, c) • Evaluation Measures – Previously proposed Evaluation Measures – Knowledge Consistency Measures • Both of the simulation models interact with the system, generating 500 dialogues for each model Results: Prior Measures 1.2 1 0.8 0.6 0.4 0.2 0 Tturn Twordrate Sturn real ProbCorrect Swordrate wordRatio cRate man_Cluster • Both models do not significantly differ from the real students, on all the original evaluation measures • Thus, both models can simulate realistic high-level dialogue behaviors Results: New Measures Model R2 probCorrect Cluster 0.252 0.564 adjusted R2 0.102 0.477 Results: New Measures Model R2 probCorrect Cluster 0.252 0.564 adjusted R2 0.102 0.477 • The Cluster model outperforms the ProbCorrect simulation model, with respect to learning curves Results: New Measures Model R2 probCorrect Cluster 0.252 0.564 adjusted R2 0.102 0.477 • The Cluster model outperforms the ProbCorrect simulation model, with respect to learning curves • Model ranking also validated by human judges [Ai and Litman, 2008] In Sum • Recall goal: simulate consistent user behaviors based on user knowledge consistency rather than fixed user goals • Knowledge consistent models outperform the probabilistic model when measured by knowledge consistency measures – Do not differ on high-level dialogue measures – Similar approach should be applicable for other temporal user processes (e.g., forgetting) Conclusions: The Big Picture User Simulation How realistic? Power of evaluation measures Discriminative ability [AAAI WS, 2006] Impact of the source corpus Subjects vs. real users [SIGDial, 2007] How useful? Task Dependent Human assessment Validation of evaluation Dialogue System Evaluation Dialogue Strategy Learning [ACL, 2008] More realistic models via knowledge consistency [Interspeech, 2007] * Utility of realistic vs. exploratory models for reinforcement learning [NAACL, 2007] Other ITSpoke Research • Affect detection and adaptation in dialogue systems – Annotated ITSpoke Corpus now available! – https://learnlab.web.cmu.edu/datashop/index.jsp • Reinforcement Learning • Using NLP and psycholinguistics to predict learning – Cohesion, alignment/convergence, semantics More details: http://www.cs.pitt.edu/~litman/itspoke.html Questions? Thank You!