Download ulster08 - University of Pittsburgh

Document related concepts

Psychometrics wikipedia , lookup

Numerical weather prediction wikipedia , lookup

Machine learning wikipedia , lookup

History of numerical weather prediction wikipedia , lookup

General circulation model wikipedia , lookup

Multi-state modeling of biomolecules wikipedia , lookup

Joint Theater Level Simulation wikipedia , lookup

Computer simulation wikipedia , lookup

Transcript
User Simulation for Spoken
Dialogue Systems
Diane Litman
Computer Science Department &
Learning Research and Development Center
University of Pittsburgh
(Currently Leverhulme Visiting Professor, University of Edinburgh)
Joint work with Hua Ai
Intelligent Systems Program,
University of Pittsburgh
Motivation: Empirical Research
requires Dialogue Corpora
User Simulation
l
Less expensive
More efficient
More (and better?) data
compared to humans
Motivation: Empirical Research
requires Dialogue Corpora
User Simulation
How realistic?
Power of
evaluation
measures
Discriminative
ability
[AAAI WS, 2006]
Impact of
the source
corpus
Subjects vs. real
users
[SIGDial, 2007]
Human
assessment
Validation of
evaluation
[ACL, 2008]
Less expensive
More efficient
More (and better?) data
compared to humans
Motivation: Empirical Research
requires Dialogue Corpora
User Simulation
How realistic?
Power of
evaluation
measures
Discriminative
ability
[AAAI WS, 2006]
Impact of
the source
corpus
Subjects vs. real
users
[SIGDial, 2007]
Less expensive
More efficient
More (and better?) data
compared to humans
How useful?
Human
assessment
Validation of
evaluation
Task Dependent
Dialogue System
Evaluation
Dialogue Strategy
Learning
[ACL, 2008]
More realistic models via
knowledge consistency
[Interspeech, 2007]
Utility of realistic
vs. exploratory
models for
reinforcement
learning
[NAACL, 2007]
Motivation: Empirical Research
requires Dialogue Corpora
Less expensive
More efficient
More (and better?) data
compared to humans
User Simulation
How realistic?
Power of
evaluation
measures
Discriminative
ability
[AAAI WS, 2006]
Impact of
the source
corpus
Subjects vs. real
users
[SIGDial, 2007]
How useful?
Task Dependent
Human
assessment
Validation of
evaluation
Dialogue System
Evaluation
Dialogue Strategy
Learning
[ACL, 2008]
More realistic models via
knowledge consistency
[Interspeech, 2007]
*
Utility of realistic
vs. exploratory
models for
reinforcement
learning
[NAACL, 2007]
Outline
• User Simulation Models
– Previous work
– Our initial models
• Are more realistic models always “better”?
• Developing more realistic models via knowledge
consistency
• Summary and Current Work
User Simulation Models
• Simulate user dialogue behaviors in simple (or,
not too complicated) ways
• How to simulate
– Various strategies: random, statistical, analytical
• What to simulate
– Model dialogue behaviors on different levels:
acoustic, lexical, semantic / intentional
Previous Work
• Most models simulate on the intentional level, and are
statistically trained from human user corpora
• Bigram Models
– P(next user action | previous system action)
– Only accept the expected dialogue acts
[Eckert et al., 1997]
[Levin et al., 2000]
• Goal-Based Models
– Hard-coded fixed goal structures [Scheffler, 2002]
– P(next user action | previous system action, user goal)
[Pietquin, 2004]
– Goal and agenda-based models [Schatzmann et al., 2007]
Previous Work (continued)
• Models that exploit user state commonalities
– Linear combinations of shared features [Georgila et al., 2005]
– Clustering [Rieser et al., 2006]
• Improve speech recognizer and understanding
components
– Word-level simulation [Chung, 2004]
Our Domain: Tutoring
• ITSpoke: Intelligent Tutoring Spoken Dialogue System
– Back-end is Why2-Atlas system [VanLehn et al., 2002]
– Sphinx2 speech recognition and Cepstral text-to-speech
• The system initiates a tutoring conversation with the
student to correct misconceptions and to elicit
explanations
• Student answers: correct, incorrect
ITSpoke Corpora
Corpus
Student
Population
System
Voice
Number of
Dialogues
f03
2003
Synthesized
100
syn
2005
Synthesized
136
pre
2005
Pre-recorded
135
s05
• Two different student groups in f03 and s05
• Systems have minor variations (e.g., voice,
slightly different language models)
Our Simulation Approach
• Simulate on the word level
– We use the answers from the real student
answer sets as candidate answers for
simulated students
• First step – basic simulation models
– A random model
• Gives random answers
– A probabilistic model
• Answers a question with the same correctness rate
as our real students
The Random Model
• A unigram model
– Randomly pick a student answer from all utterances,
neglecting the tutor question
• Example dialogue
ITSpoke:
The best law of motion to use is Newton’s third law. Do you
recall what it says?
Student:
Down.
ITSpoke:
Newton’s third law says…
…
ITSpoke:
Do you recall what Newton’s third law says?
Student:
More.
The ProbCorrect Model
• A bigram model
– P(Student Answer | Tutor Question)
– Give correct/incorrect answers with the same probability
as the real students
• Example dialogue
ITSpoke:
The best law of motion to use is Newton’s third law. Do you
recall what it says?
Student:
Yes, for every action, there is an equal and opposite reaction.
ITSpoke:
This is correct!
…
ITSpoke:
Do you recall what Newton’s third law says?
Student:
No.
Outline
• User Simulation Models
– Previous work
– Our initial models
• Are more realistic models always “better”?
– Task: Dialogue Strategy Learning
• Developing more realistic models via knowledge
consistency
• Summary and Current Work
Learning Task
• ITSpoke can only respond to student
(in)correctness, but student (un)certainty is
also believed to be relevant
• Goal: Learn how to manipulate the
strength of tutor feedback, in order to
maximize student certainty
Corpus
• Part of S05 data (with annotation)
– 26 human subjects, 130 dialogues
• Automatically logged
– Correctness (c, ic); percent incorrectness (ic%)
– Kappa (automatic/manual) = 0.79
• Human annotated
– certainty (cert, ncert)
– Kappa (two annotators) = 0.68
Sample Coded Dialogue
ITSPoke: Which law of motion would you use?
Student:
Newton’s second law. [ic, ic%=100, ncert]
ITSpoke: Well… The best law to use is
Newton’s third law. Do you recall
what it says?
Student:
For every action there is an equal and
opposite reaction.
[c, ic%=50, ncert]
Markov Decision Processes (MDPs)
and Reinforcement Learning
• What is the best action for an agent to take at
any state to maximize reward?
• MDP Representation
– States, Actions, Transition Probabilities
– Reward
• Learned Policy
– Optimal action to take for each state
MDP’s in Spoken Dialogue
MDP can be created offline
MDP
Training data
Policy
Dialogue
System
User
Simulator
Human
User
Interactions work online
Our MDP Action Choices
• Tutor feedback
– Strong Feedback (SF)
• “This is great!”
– Weak Feedback (WF)
• “Well…”, doesn’t comment on the correctness
• Strength of tutor’s feedback is strongly
related to the percentage of student
certainty (chi-square, p<0.01)
Our MDP States and Rewards
• State features are derived from Certainty
and Correctness Annotations
• Reward is based on the percentage of
Certain student utterances during the
dialogue
Our MDP Configuration
• States
– Representation 1: c + ic%
– Representation 2: c + ic% + cert
• Actions
– Strong Feedback, Weak Feedback
• Reward
– +100 (high certainty), -100 (low certainty)
Our Reinforcement Learning Goal
• Learn an optimal policy using simulated dialogue
corpora
• Example Learned Policy
– Give Strong Feedback when the current student answer
is Incorrect and the percentage of Incorrect answers is
greater than 50%
– Otherwise give Weak Feedback
• Research Question: what is the impact of using
different simulation models?
Probabilistic Simulation Model
• Capture realistic student behavior in a
probabilistic way
For each question:
c+cert (5)
c+cert (4)
Strong Feedback
c+ncert (1) ic+cert (2)
Weak Feedback
c+ncert (4) ic+cert (1)
ic+ncert (3)
ic+ncert (3)
Total Random Simulation Model
• Explore all possible dialogue states
• Ignores what the current question is or
what feedback is given
• Randomly picks one utterance from the
candidate answer set
Restricted Random Model
• Compromise between the exploration of
the dialogue state space and the realness
of generated user behaviors.
For each question:
c+cert (1)
c+cert (1)
Strong Feedback
c+ncert (1) ic+cert (1)
Weak Feedback
c+ncert (1) ic+cert (1)
ic+ncert (1)
ic+ncert (1)
Methodology
Prob
Corpus1
MDP
Policy1
Sys1
40,000
500
40,000
Old
System
Total
Ran.
Corpus2
500
MDP
Policy2
Sys2
40,000
Res.
Ran.
Corpus3
Prob
500
MDP
Policy3
Sys3
Methodology (continued)
• For each configuration, we run the
simulation models until the learned
policies do not change anymore
• Evaluation measure
– number of dialogues that would be assigned
reward +100 using the old median split
– Baseline = 250
Evaluation Results
Simulation Model
Probabilistic
State Rep. 1
State Rep. 2
222
217
Total Random
192
211
Restricted Random
390
368
Blue: Restricted Random significantly outperforms the other two models
Underline: the learned policy significantly outperforms the baseline
NB: Results similar with other reward functions and evaluation metrics
Discussion
• We suspect that the performance of the
Probabilistic Model is harmed by the data
sparsity issue in the real corpus
– In State Representation 1, 25.8% of the possible
states do not exist in the real corpus
– Of most frequent states in State Representation 1
• 70.1% are seen frequently in Probabilistic Training corpus
• 76.3% are seen frequently in Restricted Random corpus
• 65.2% are seen frequently in Total Random corpus
In Sum
• When using simulation models for MDP
policy training
– Hypothesis confirmed: when trained from a sparse
data set, it may be better to use a Restricted Random
Model than a more realistic Probabilistic Model or a
more exploratory Total Random Model
• Next Step:
– Test the learned policies with human subjects to
validate the learning process
– How about the cases when we do need a realistic
simulation model?
Outline
• User Simulation Models
– Previous work
– Our initial models
• Are more realistic models always “better”?
• Developing more realistic models via knowledge
consistency
• Summary and Current Work
A New Model & A New Measure
Goal
Consistency
Knowledge
Consistency
Student’s knowledge during a
tutoring session is consistent.
If the student answers a
question
correctly,
the
student is more likely to
answer a similar question
correctly later.
A new simulation
model
Knowledge consistency can be
measured using learning curves.
If a simulated student behaves
similarly to a real student, we
should see a similar learning
curve in the simulated data.
A new evaluation
measure
The Cluster Model
• Model student learning
– P(Student Answer |
Cluster of Tutor Question,
last Student Correctness)
• Example dialogue
ITSpoke:
The best law of motion to use is Newton’s third law. Do you
recall what it says?
Student:
Yes, for every action, there is an equal reaction.
ITSpoke:
This is almost right… there is an equal and opposite reaction
…
ITSpoke:
Do you recall what Newton’s third law says?
Student:
Yes, for every action, there is an equal and opposite reaction.
Knowledge Component
Representation
• Knowledge component – “concepts” discussed by
the tutor
• The choice of grain size is determined by the
instructional objectives of the designers
• A domain expert manually clustered the 210 tutor
questions into 20 knowledge components (f03 data)
– E.g., 3rdLaw, acceleration, etc.
Sample Coded Dialogue
ITSpoke:
Do you recall what Newton’s third law
says?
[3rdLaw]
Student:
No.
ITSpoke:
Newton’s third law says … If you hit the wall
harder, is the force of your fist acting on the
wall greater or less? [3rdLaw]
Student:
Greater.
[incorrect]
[correct]
Evaluation: Learning Curves (1)
• Learning effect – the student performs
better after practicing more
• We can visualize the learning effect by
plotting an exponentially decreasing
learning curve
[PSLC,
http://learnlab.web.cmu.edu/mhci_2005/documentation/design2d.html]
Learning Curves (2)
Among all the students,
36.5% of them made at
least 1 error at their 2nd
opportunity to practice
0.365
Learning Curves (3)
• Standard way to plot the learning curve
– First compute separate learning curves for each
knowledge components, then, average them to get an
overall learning curve
• We only see smooth learning curves among high
learners
– High/Low Learners: median split based on normalized
learning gain
– Learning Curve: Mathematical representation
ErrorRate  0.409 * ith  Opportunit y ( 1.50)
Experiments (1)
• Simulation Models
– ProbCorrect Model P(A | Q)
• A:
• Q:
Student Answer
Tutor Question
– Cluster Model P(A | KC, C)
• A: Student Answer
• KC: Knowledge Component
• C: Correctness of the student’s answer to the last
previous question that requires the same KC
Experiments (2)
• Evaluation Measures: Compare simulated user
dialogues to human user dialogues using automatic
measures
• New Measure: User Processing based on
Knowledge Consistency
– R-squared – How good the simulated learning
curve correlates with the observed learning
curve in the real student data
• Prior Measures: High-level Dialogue Features
[Schatzmann et al., 2005]
Prior Evaluation Measures
Schatzmann et al.
Our measures
Abbreviation
High-level Dialogue Features
Dialogue Length
(Number of turns)
Number of
students/tutor turns
Sturn, Tturn
Turn Length
(Number of actions
per turn)
Total words per
student/tutor turn
Swordrate,
Twordrate
Participant Activity
Ratio of
(Ratio of system/user system/user words
actions per dialogue) per dialogue
wordRatio
Prior Evaluation Measures
Schatzmann et al.
Our measures
Abbreviation
High-level Dialogue Features
Dialogue Length
(Number of turns)
Number of
students/tutor turns
Sturn, Tturn
Turn Length
(Number of actions
per turn)
Total words per
student/tutor turn
Swordrate,
Twordrate
Participant Activity
Ratio of
(Ratio of system/user system/user words
actions per dialogue) per dialogue
wordRatio
Learning Feature: % of Correct Answers CRate
Experiments (3)
• Simulation Models
– The ProbCorrect Model P(A | Q)
– The Cluster Model P(A | KC, c)
• Evaluation Measures
– Previously proposed Evaluation Measures
– Knowledge Consistency Measures
• Both of the simulation models interact with the
system, generating 500 dialogues for each
model
Results: Prior Measures
1.2
1
0.8
0.6
0.4
0.2
0
Tturn
Twordrate
Sturn
real
ProbCorrect
Swordrate
wordRatio
cRate
man_Cluster
• Both models do not significantly differ from the real
students, on all the original evaluation measures
• Thus, both models can simulate realistic high-level
dialogue behaviors
Results: New Measures
Model
R2
probCorrect Cluster
0.252
0.564
adjusted R2 0.102
0.477
Results: New Measures
Model
R2
probCorrect Cluster
0.252
0.564
adjusted R2 0.102
0.477
• The Cluster model outperforms the
ProbCorrect simulation model, with respect
to learning curves
Results: New Measures
Model
R2
probCorrect Cluster
0.252
0.564
adjusted R2 0.102
0.477
• The Cluster model outperforms the
ProbCorrect simulation model, with respect
to learning curves
• Model ranking also validated by human
judges [Ai and Litman, 2008]
In Sum
• Recall goal: simulate consistent user
behaviors based on user knowledge
consistency rather than fixed user goals
• Knowledge consistent models outperform
the probabilistic model when measured by
knowledge consistency measures
– Do not differ on high-level dialogue measures
– Similar approach should be applicable for
other temporal user processes (e.g.,
forgetting)
Conclusions: The Big Picture
User Simulation
How realistic?
Power of
evaluation
measures
Discriminative
ability
[AAAI WS, 2006]
Impact of
the source
corpus
Subjects vs. real
users
[SIGDial, 2007]
How useful?
Task Dependent
Human
assessment
Validation of
evaluation
Dialogue System
Evaluation
Dialogue Strategy
Learning
[ACL, 2008]
More realistic models via
knowledge consistency
[Interspeech, 2007]
*
Utility of realistic
vs. exploratory
models for
reinforcement
learning
[NAACL, 2007]
Other ITSpoke Research
• Affect detection and adaptation in dialogue systems
– Annotated ITSpoke Corpus now available!
– https://learnlab.web.cmu.edu/datashop/index.jsp
• Reinforcement Learning
• Using NLP and psycholinguistics to predict learning
– Cohesion, alignment/convergence, semantics
More details:
http://www.cs.pitt.edu/~litman/itspoke.html
Questions?
Thank You!