Download Techniques, models and results - AAAC emotion

Document related concepts
no text concepts found
Transcript
Emotion
and
Speech
Techniques, models and results
Facts, fiction and opinions
Past present and future
Acted, spontaneous, recollected
In Asia Europe and America
And the middle east
HUMAINE Workshop on Signals and signs (WP4), Santorini, September 2004
1
Overview



A short introduction to speech science
… and speech analysis tools
Speech and emotion:




Models, problems ... and
Results
A review of open issues
Deliverables within the HUMAINE
framework
2
Part 1:
Speech science in a nutshell
3
A short introduction to SPEECH:


Most of those present here are familiar
with various aspects of signal processing
For the benefit of those who aren’t
acquainted with the speech signal in
particular:


We’ll start with an overview of speech
production models and analysis techniques
The rest of you can sleep for a few
minutes
4
The speech signal

A 1-D signal




Does that make it a simple one? NO…
There are many analysis techniques
Like many types of systems - parametric
models are one very useful here…
A simple and very useful speech
production model:

the source/filter model
(in case you’re worried, we’ll see that this
is directly related to emotions also)
5
The source/filter model

Components:


The lungs (create air pressure)
Two elements that turn this into
a “raw” signal:
source



The vocal folds (periodic signals)
Constrictions that make the airflow
turbulent (noise)
The vocal tract
filter



Partly immobile: upper jaw, teeth
Partly mobile: soft palate, tongue,
lips, lower jaw – also called
“articulators”
Its influence on the raw signal can
be modeled very will with a low
order (~10) digital filter
6
The net result:

A complex signal that changes its properties constantly:




Sometimes periodic
Sometimes colored noise
Approximately stationary
over time windows of ~20
milliseconds
And of course –
contains a great deal of information


Text – linguistic information
Other stuff – paralinguistic information






Speaker identity
Gender
Socioeconomic background
Stress, accent
Emotional state
Etc. …
7
How is this information coded?

Textual information mainly in the filter and the way it changes its
properties over time


Filter “snapshots” are called segments
Paralinguistic information –
mainly in the source parameters



Lung pressure – determines the intensity
Vocal fold periodicity – determines instantaneous
frequency or “pitch”
Configuration of the glottis determines overall spectral
tilt – “voice quality”
8
Prosody:

Prosody is another name for part of the
paralinguistic information, composed of:


Intonation – the way in which pitch changes over
time
Intensity – changes in intensity over time



Problem: some segments are inherently weaker than others
Rhythm – segment durations vs. time
Prosody does not include voice quality, but voice
quality is also part of the paralinguistic
information
9
To summarize:




Speech science is at a mature stage
The source/filter model is very useful in
understanding speech production
Many applications (speech recognition, speaker
verification, emotion recognition, etc.) require
extraction of the model parameters from the
speech signal (an inverse problem)
This is the domain of:
speech analysis techniques
10
Part 2:
Speech analysis and classification
11
The large picture:
speech analysis in the HUMAINE framework

Speech analysis is just one
component in the context of
speech and emotion:
Theory
Of
emotion
Real
data
Speech
Analysis
engine
Training
data

Its overall objectives:



High
Level
application

Calculate raw speech
parameters
Extract features salient to
emotional content
Discard irrelevant features
Use them to characterize
and maybe classify
emotional speech
12
Signals to Signs - The process
Knowledge
Patterns
Selection and
Transformation
Data Mining
Data
Warehouse
Evaluation
and
Presentation
Data
Representation
Data Cleaning and Integration
Files
Databases
13
S2S (SOS…?) - The tools

a combination of techniques that belong
to different types of disciplines:




Data warehouse technologies (data storage,
information retrieval, query answering, etc’)
Data preprocessing and handling
Data modeling / visualization
Machine learning (statistical data analysis,
pattern recognition, information retrieval, etc’)
14
The objective of speech analysis
techniques
To extract the raw model parameters from the
speech signal
1.

Interfering factors:



2.
3.
Reality never exactly fits the model
Background noise
Speaker overlap
To extract features
To interpret them in meaningful ways
(pattern recognition)

Really hard!
15
It remains that 

Useful models and techniques exist for extracting the
various information types from the speech signal
Yet …
Many applications such as speech recognition, speaker
identification, speech synthesis, etc., are far from
being perfected
… So what about emotion?
16
For the moment –
let’s focus on the small picture

The consensus is that emotions are coded in




Prosody
Voice quality
And sometimes in the textual information
Let’s discuss the purely technical aspects of
evaluating all of these …
17
Extracting features from the speech
signal
 Stage 1 – Extracting raw features:







Pitch
Intensity
Voice quality
Pauses
Segmental information – phones and their
duration
Text
(by the way …who extracts them – man,
machine or both? )
18
Pitch

Pitch: The instantaneous frequency



Sounds deceptively simple to find – but it isn’t!
Lots of research has been devoted to pitch detection
Composed of two sub-problems:



Complicating factors:



For a given signal – is there periodicity at all?
If so – what’s the fundamental frequency?
Speaker related factors – hoarseness, diplophony, etc.
Background related factors – noise, overlapping speakers,
filters (as in telephony)
In the context of emotions:


Small errors are acceptable
Large errors (octave jumps, false positives) are catastrophic
19
An example:

The raw pitch contour in PRAAT:
Errors:
20
Intensity


Appears to be even simpler than pitch!
Intensity is quite easy to measure …


Yet most influenced by unrelated factors!
Aside from the speaker, intensity is gravely
affected by:


Distance from the microphone
Gain settings in the recording equipment





Clipping
AGC
Background noise
Recording environment
Without normalization – intensity is almost
useless!
21
Voice quality

Several measures are used to measure it:






Local irregularity in pitch and intensity
Ratio between harmonic components and noise
components
Distribution of energy in the spectrum
Affected by a multitude of factors other than
emotions
Some standardized measures are often used in
clinical applications
A large factor in emotional speech!
22
Segments



There are different ways of defining
precisely what these are
Automatic segmentation is difficult,
though not as difficult as speech
recognition
Even the segment boundaries can give
important timing information, related to
rhythm –

an important component of prosody
23
Text


Is this “raw” data or not?
Is it data … at all?



Some studies on emotion specifically eliminated this
factor (filtered speech, uniform texts)
Other studies are interested mainly in text
If we want to deal with text, we must keep in
mind:

Automated speech recognition is HARD!



Especially with strong background noise
Especially when strong emotions are present, modifying the
speakers normal voices and mannerisms
Especially when dealing with multiple speakers
24
Some complicating factors in raw
feature extraction:




Background noise
Speaker overlap
Speaker variability
Variability in recording equipment
25
In the general context of speech
analysis 


The raw features we discussed are not
specific only to the study of emotion
Yet – issues related to calculating them
reliably crop up again and again in
emotion related studies
Some standard and reliable tools would be
very helpful
26
Two opposing approaches to
computing raw features:

Assume we have perfect algorithms
for extracting all this information




If we don’t – help out manually
This can be carried out only over small
databases
Useful in purely theoretical studies
Ideal
Real life
Acknowledge we only have imperfect
algorithms


Find how to deal automatically with imperfect
data
Very important for large databases
Error
prone
27
Next - what do we do with it all?


Reminder: we have large amounts of raw
data
Now we have to make some meaning
from it
28
Feature extraction …
 Stage




2 – data reduction:
Take a sea of numbers
Reduce it to a small number of meaningful
measures
Prove they’re meaningful
An interesting way to look at it:

Separating the “signal” (e.g emotion) from
the “noise” (anything else)
29
An example of “Noise”:

Here pitch
and intensity
have totally
unemotional
(but
important)
roles:
[Deller et al]
30
Examples of high level features

Pitch fitting –




stylization
MoMel
Parametric modeling
statistics
31
32
An example:

The raw pitch contour in PRAAT:
Errors:
33
Patching it up a bit:
500
0
0
3.39769
Time (s)
34
One way to extract the essential
information:
500
0
0
3.39769
Time (s)
Pitch stylization – IPO method
Another way to extract the essential
information:
MoMel
35
Yet another way to extract the
essential information:
MoMel
36
Some observations:

Different parameterizations give



different curves
different features
Yet: perceptually – they are all very similar
37
Questions:


We can ask what is the minimal or most
representative information to capture the
pitch contour?
More importantly, though:
What aspects of the pitch contour are
most relevant to emotion?
38
Several answers appear in the
literature:

Statistical features taken from the raw
contour:


Mean, variance, max, min, range etc.
Features taken from parameterized
contours:

Slopes, “main” peaks and dips etc.
39
There’s not much time to go into:



Intensity contours
Spectra
Duration
But the problems are very similar
40
The importance of time frames



We have several measures that vary over time
Over what time frame should we consider them?
The meaning we attribute to speech parameters
is dependent on the time frame over which
they’re considered:





Fixed length windows
Phones
Words
“Intonation units”
“Tunes”
41
Which time frame is best?

Fixed time frames of several seconds – simple to
implement, but naïve


Words



Need a recognizer to be marked
Probably the shortest meaningful frame
“Intonation units”




Very arbitrary
Nobody knows exactly what they are (one “idea” per unit?)
Hard to measure
Correlate best with coherent stretches of speech
“Tunes” – from one pause to the next


feasible to implement
Correlate to some extent with coherent stretches of speech.
42
Why is this such an important
decision?

It might help us interpret our data
correctly!
43
Therefore …
the problem of feature extraction:



Is NOT a general one
We want features that are specifically
relevant to emotional content …
But before we get to that we have:
44
The Data Mining part
Stage 3: To extract knowledge
= previously unknown
information (rules, constraints,
regularities, patterns, etc’) from
the features database
45
What are we mining?

We look for patterns that either describe the stored data
Eran
25
before gamble
after
20 gamble
Rafi
20
10
Haim
25
18
slope
pause
accent 1
accent 2
duration
Yuval
20
15
15
15
slope
15
pause
10
before gamble
accent 1
after gamble
accent 2
5
duration
after gamble
before gamble
0
Eran
Rafi
Haim
Yuval
Discrimination and
comparison of features of
different classes

20
30
15
30
5
Summarization and
characterization (of the class of
data that interests us)
or infer from it (predictions)
46
Types of Analysis

Association analysis of rules of the form X =>
Y
(DB tuples that satisfy X are likely to satisfy Y)
where X and Y are pairs of attribute and value/set


of values
Classification and class prediction – find a set
of functions to describe and distinguish data
classes/concepts that can be used predict the
class of unlabeled data.
Cluster analysis (unsupervised clustering) –
analyze the data when there are no class labels to
deal with new types of data and help group similar
47
events together
Association Rules


We search for interesting relationships among
items in the data
Interestingness Measures:
A B
•Support = # tuples that contain both A and B /
# tuples
•Confidence = # tuples that contain both A and B /
# tuples that contain A
Support measures usefulness
P( A  B)
Confidence measures certainty
P( B | A)
48
Classification
A two step process:
1. Use data tuples with known labels to construct a model
2. Use the learned model to classify (assign labels) new data
Data is divided into two groups: training data and test
data
Test data is used to estimate the predictive accuracy
of the learned model.
Since the class label of each training sample is known,
this is Supervised Learning
49
Assets






No need to know the rules in advance
Some rules are not easily formulated as
mathematical or logical expressions
Similar to one of the ways human learn
Could be more robust to noise and
incomplete data
May require a lot of samples
Learning depends on existing data only!
50

Dangers:




The model might not be able to learn
There might not be enough data
Over-fitting the model to the training data
Algorithms:



Machine learning (Statistical learning)
Expert systems
Computational neuroscience
51
Prediction



Classification predicts categorical labels
Prediction models continuous valued
function
It is usually used to predict the value or a
range of values of an attribute of a given
sample


Regression
Neural Networks
52
Clustering





constructing models for assigning class
labels to data that is unlabeled.
un supervised learning
Clustering is an ill defined task
Once clusters are discovered, the clustering
model can be used for predicting labels of
new data
Alternatively, the clusters can be used as
labels to train a supervised classification
algorithm
53
So how does this technical
Mumbo Jumbo
tie into -
54
Part 3:
Speech and emotion
55
Speech and emotion

Emotion can affect speech in many ways




Consciously
Unconsciously
Through the Autonomous nervous system
Examples:



Textual content is usually consciously chosen, except maybe
sudden interjections which may stem from sudden or strong
emotions
Many speech patterns related to emotions are strongly
ingrained – therefore, though they can be controlled by the
speaker, most often they are not, unless the speaker tries
modify them consciously
Certain speech characteristics are affected by the degree of
arousal, and therefore nearly impossible to inhibit (e.g. vocal
tremor due to grief)
56
Speech analysis: the big picture
- again

Speech analysis is just one component in the
context of speech and emotion:
Databases
Real
data
Application
57
Is this just another way to
spread the blame?




Us speech analysis guys are just poor little
engineers
Methods we can supply can be no better than
the theory and the data that drive them
… and unfortunately, the jury is still out on both
of those points … or not?
Ask WP3 and WP5 people


They’re here somewhere 
Actually –

One of the difficulties HUMAINE is intended to ease,
is that often researchers in the field find themselves
58
having to address all of the above! (guilty)
The most fundamental problem:

What are the features that signify emotion? To
paraphrase – what signals are signs of emotion?
59
The most common solutions:

Calculate as many as you can think of
Intuition
Theory based answers
Data-driven answers

Ha! Once more – it’s not our fault!



60
What seems to be the most
plausible approach 
The data driven approach

Requiring:



Emotional speech databases (“corpora”)
Perceptual evaluation of these databases
This is then correlated with speech
features

Which takes us back to a previous square
61
So tell us already – how does
emotion influence speech?


… It seems that the answer depends on
how you look for it
As hinted before – the answer cannot
really be separated from:


The theories of emotion
The databases we have of emotional speech Who the subjects are
 How emotion was elicited

62
63
A short digression 
Will all the speech clinicians in the audience please
stand up?

Hmm…. We don’t seem to have so many

Let’s look at what one of them has to say
64
Emotions in the speech Clinic

Some speakers have speech/voice problems that modify
their “signal”, thus misleading the listener

VOICE – People with vocal instability (high
jitter/shimmer/tremor are clinically perceived as nervous
(although the problems reflect irregularity in the vocal folds).
- Breathy voice (in women) is, sometimes, perceived as
“sexy” (while it actually reflects incomplete adduction of the vocal folds).
- Higher excitation level leads to vocal instability (high
jitter/shimmer/ tremor)
65
Clinical Examples:

STUTTERING – listeners judge people who stutter as
nervous, tensed, and less confident (identification of
stuttering depends on pause duration within the “repetition units”, and on
rate of repetitions).

CLUTTERING – listeners judge cluttering people as
nervous and less intelligent
Sothough this is a WP4 meeting …


It’s impossible to avoid talking about WP3
(theory of emotion) and WP5 (databases) issues
The signs we’re looking for can never be
separated from the questions:



Signs of what (emotions)?
Signs in what (data)?
May God and Phillipe Gelin forgive me …
66
A not-so-old example:
(Murray and Arnott, 1993)


Very qualitative
Presupposes dealing with primary emotions
67
BUT …

If you expect more
recent results to give
more detailed
descriptive outlines


Then you’re wrong
The data-driven
approaches use a large
number of features,
and let the computer
sort them out



32 significant features
found by ASSESS, from
the initial 375 used
5 emotions, acted
55% recognition
68
Some remarks:

Some features are indicative, even though we
probably don’t use them perceptually




e.g. pitch mean: usually this is raised with higher
activation
But we don’t have to know the speaker’s neutral
mean to perceive heightened activation
My guess: voice quality is what we perceive in such
cases
How “simple” can characterization of emotions
become?


How many features do we listen for?
Can this be verified?
69
Time intervals


This issue becomes more and more
important as we go towards “natural” data
Emotion production:

How long do emotions last?



Full blown emotions are usually short (but not
always! Look at Peguy in the LIMSI interview
database)
Moods, or pervasive emotions are subtle but long
lasting
Emotion Analysis:

Over what span of speech are they easiest to
detect?
70
From the analysis viewpoint:

Current efforts seem to be focusing on methods
that aim to use time spans that have some
inherent meaning:



Acoustically (ASSESS – Cowie et al)
Linguistically (Batliner et al)
We mentioned that prosody carries


emotional information (our “signal”)
other information (“noise”): phrasing, various types of
prominence
BUT …
71
Why I like intonation units

Spontaneous speech is organized differently from written
language


“sentences” and “paragraphs” don’t really exist there
Phrasing is a loose phrase for …”Intonation units”





Prosodic markers help replace various written markers
Maybe emotion is not an “orthogonal” bit of information
on top of these (the signal+noise model)
If emotion modifies these,


Theoretical linguists love to discuss what they are
An exact definition is as hard to find as it is to parse spontaneous
speech
It would be very useful if we could identify the prosodic markers
we use and the ways we modify them when we’re emotional
Problem: Engineers don’t like ill defined concepts!

But emotion is one of them too, isn’t it?
72
Just to provoke some thought:

From a paper on animation
(think of it – these guys have to integrate speech and
image to make them fit naturally):
“… speech consists of a sequence of intonation phrases.
Each intonation phrase is realized with fluid, continuous
articulation and a single point of maximum emphasis.
Boundaries between successive phrases are associated
with perceived disjuncture and are marked in English
with cues such as pitch movements … Gestures are
performed in units that coincide with these intonation
phrases, and points of prominence in gestures also
coincide with the emphasis in the concurrent speech…”
[Stone et al., SIGGRAPH 2004]
73
We haven’t even discussed WP3
issues 
What are the scales/categories?



Possibility 1: emotional labeling
Possibility 2: psychological scales (such as
valence/activation – e.g. Feeltrace)
QUESTION:

Which is more directly related to speech
features?
Hopefully we’ll hammer out a tentative answer by Tuesday..
74
Part 4:
Current results
75
Evaluating results


Results often demonstrate how elusive the
solution is …
Consider a similar problem: Speech Recognition

To evaluate results –




Make recordings
Submit them to an algorithm
Measure the recognition rate!
Emotion recognition results are far more difficult
to quantify

Heavily dependent on induction techniques and
labeling methods
76
Several popular contexts:


Acted prototypical emotions
Call center data






Real
WoZ type
Media (radio, TV) based data
Narrative speech (event recollection)
Synthesized speech (monterro, gobl)
Most of these methods can be placed on the
spectrum between:


Acted, full blown bursts of stereotypical emotions
Fully natural, mixtures of mood, affect and bursts of
difficult-to-label emotions recorded in noisy
environments
77
Call centers


A real life scenario! (with commercial
interests…)!
Sparse emotional content:



Controlled (usually)
Negative (usually)
Lends itself easily to WOZ scenarios
78
Ang et al., 2002





Standardized call-center data from 3 different
sources
Uninvolved users, true HMI interaction
Detects neutral/annoyance/frustration
Mostly automatic extraction, with some
additional human labeling
Defines human “accuracy” as 75%



But this is actually the percentage of human
consensus
Machine accuracy is comparable
A possible measure: maybe “accuracy” is where
users wanted human intervention
79
Batliner et al.

Professional acting, amateur acting, WOZ
scenario


Detects trouble in communication


Much thought was given to this definition!
Combines prosodic features with others:



the latter with uninvolved users, true HMI interaction
POS labels
Syntactic boundaries
Overall – shows a typical result:




The closer we get to “real” scenarios, the more
difficult the problem becomes!
Up to 95% on acted speech
Up to 79% on read speech
Up to 73% on WOZ data
80
Devillers et al.

Real call center data



Human – human interaction, involved users
Human accuracy of 75% is reported


Treat pauses and filled pauses separately
Some results:


Is this, as in Ang, the degree of human agreement?
Use a small number of intonation features


Contains also fear (of losing money!)
Different behavior between clients and agents, males
and females
Was classification attempted also?
81
Games and simulators



These provide an extremely interesting
setting
Participants can often be found to
experience real emotions
The experimenter can sometimes control
these to a certain extent

Such as driving conditions or additional tasks
in a driving simulator
82
Fernandez & Picard (2000)

Subjects did math problems while driving
a simulator


Spectral features were used


This was supposed to induce stress
No prosody at all!
Advanced classifiers were applied


Results were inconsistent across users, raising
a familiar question:
Is it the classifier, or is it the data?
83
Kehrein (2002)

2 subjects in 2 separate rooms:




One had instructions
One had a set of Lego building blocks
The first had to explain to the other what to
construct
A wide range of “natural” emotions was
reported


His thesis is in German 
No classification was attempted
84
Acted speech


Widely used
An ever-recurring question:

Does it reflect the way emotions are
expressed in spontaneous speech?
85
McGilloway et al.





ASSESS used for feature extraction
Speech read by non-professionals
Emotion evoking texts
Categories: sadness, happiness, fear,
anger, neutral
Up to 55% recognition
86
Recalled emotions




Subjects are asked to recall emotional
episodes and describe them
Data is composed of long narratives
It isn’t clear if subjects actually reexperience these emotions or just recount
them as “observers”
Can contain good instances of low-key
pervasive emotions
87
Ron and Amir

Ongoing work 
88
Part 5:
Open issues
89
Robust raw feature extraction




Pitch and VAD (voice
activity detection)
Intensity (normalization)
Vocal quality
Duration – is this still an
open problem?
90
Determination of time intervals

This might have to be addressed on a
theoretical vs. practical level –





Phones?
Words?
Tunes?
Intonation units?
Fixed length intervals?
91
Feature extraction


Which features are most relevant to
emotion?
How do we separate noise (speaker
mannerisms, culture, language, etc) from
the signals of emotion?
92
Part 6:
HUMAINE Deliverables
93
Tangible results we are expected to
deliver:


Tools
Exemplars
94
Tools:

Something along the lines of:
solutions to parts of the problem that
people can actually download and use
right off
95
Exemplars:

These should cover a wide scope 





Concepts
Methodologies
Knowledge pools – tutorials, reviews, etc.
Complete solutions to “reduced” problems
Test-bed systems
Designs for future systems/applications
96
Tools - suggestions:

Useful feature extractors:



Robust pitch detection and smoothing
methods
Public domain segment/speech recognizers
Synthesis engines or parts thereof


E.g. emotional prosody generators
Classifying engines
97
Exemplars - suggestions:

Knowledge bases 
A taxonomy of speech features




Papers (especially short ones) say what we used
What about why? And what we didn’t used?
What about what we wished we had?
Test-bed systems 
A working modular SAL (credit to Marc Schroeder)

Embodies analysis, classification, synthesis, emotion
induction/data collection …
like a breeder nuclear reactor!

Parts of it already exist

Human parts can be replaced by automated ones as they
develop
98
Exemplars – suggestions (cont):

More focused systems –

Call center systems
Deal with sparse emotional content
 emotions vary over a relatively small range


Standardized (provocative?) data
Exemplifying difficulties on different levels: feature
extraction, emotion classification
 Maybe in conjunction with WP5


Integration

Demonstrations of how different modalities can
complement/enhance each other
99
How do we get useful info from
WP3 and WP5?



Categories
Scales
Models (pervasive, burst etc)
100
What is it realistic to expect?

Useful info from other workgroups

WP3:
Models of emotional behavior in different contexts
 Definite scales and categories for measuring it


WP5:
Databases embodying the above
 Data which exemplifies data on the scale from



Clearly identifiable
… to …
Difficult to identify
101
What is it realistic to expect?

Exemplars that show




Some of the problems that are easier to solve
The many problems that are difficult to solve
Directions for useful further research
How not to repeat previous errors
102
Some personal thoughts


Oversimplification is a common pitfall to
be avoided
Looking at real data, one finds that
emotion is often




Difficult to describe in simple terms
Jumps between modalities (text might be
considered a separate modality)
Extremely dependent on context, character,
settings, personality
A task so complex for humans cannot be
easy for machines!
103
Summary

Speech is a major channel for signaling
emotional information


And lots of other information too
HUMAINE will not solve all the issues
involved


We should focus on those that can benefit
most from the expertise and collaboration of
its members
Examining multiple modalities can prove
extremely interesting
104