Download Techniques, models and results - AAAC emotion

Emotion and Speech Techniques, models and results Facts, fiction and opinions Past present and future Acted, spontaneous, recollected In Asia Europe and America And the middle east HUMAINE Workshop on Signals and signs (WP4), Santorini, September 2004 1 Overview    A short introduction to speech science … and speech analysis tools Speech and emotion:     Models, problems ... and Results A review of open issues Deliverables within the HUMAINE framework 2 Part 1: Speech science in a nutshell 3 A short introduction to SPEECH:   Most of those present here are familiar with various aspects of signal processing For the benefit of those who aren’t acquainted with the speech signal in particular:   We’ll start with an overview of speech production models and analysis techniques The rest of you can sleep for a few minutes 4 The speech signal  A 1-D signal     Does that make it a simple one? NO… There are many analysis techniques Like many types of systems - parametric models are one very useful here… A simple and very useful speech production model:  the source/filter model (in case you’re worried, we’ll see that this is directly related to emotions also) 5 The source/filter model  Components:   The lungs (create air pressure) Two elements that turn this into a “raw” signal: source    The vocal folds (periodic signals) Constrictions that make the airflow turbulent (noise) The vocal tract filter    Partly immobile: upper jaw, teeth Partly mobile: soft palate, tongue, lips, lower jaw – also called “articulators” Its influence on the raw signal can be modeled very will with a low order (~10) digital filter 6 The net result:  A complex signal that changes its properties constantly:     Sometimes periodic Sometimes colored noise Approximately stationary over time windows of ~20 milliseconds And of course – contains a great deal of information   Text – linguistic information Other stuff – paralinguistic information       Speaker identity Gender Socioeconomic background Stress, accent Emotional state Etc. … 7 How is this information coded?  Textual information mainly in the filter and the way it changes its properties over time   Filter “snapshots” are called segments Paralinguistic information – mainly in the source parameters    Lung pressure – determines the intensity Vocal fold periodicity – determines instantaneous frequency or “pitch” Configuration of the glottis determines overall spectral tilt – “voice quality” 8 Prosody:  Prosody is another name for part of the paralinguistic information, composed of:   Intonation – the way in which pitch changes over time Intensity – changes in intensity over time    Problem: some segments are inherently weaker than others Rhythm – segment durations vs. time Prosody does not include voice quality, but voice quality is also part of the paralinguistic information 9 To summarize:     Speech science is at a mature stage The source/filter model is very useful in understanding speech production Many applications (speech recognition, speaker verification, emotion recognition, etc.) require extraction of the model parameters from the speech signal (an inverse problem) This is the domain of: speech analysis techniques 10 Part 2: Speech analysis and classification 11 The large picture: speech analysis in the HUMAINE framework  Speech analysis is just one component in the context of speech and emotion: Theory Of emotion Real data Speech Analysis engine Training data  Its overall objectives:    High Level application  Calculate raw speech parameters Extract features salient to emotional content Discard irrelevant features Use them to characterize and maybe classify emotional speech 12 Signals to Signs - The process Knowledge Patterns Selection and Transformation Data Mining Data Warehouse Evaluation and Presentation Data Representation Data Cleaning and Integration Files Databases 13 S2S (SOS…?) - The tools  a combination of techniques that belong to different types of disciplines:     Data warehouse technologies (data storage, information retrieval, query answering, etc’) Data preprocessing and handling Data modeling / visualization Machine learning (statistical data analysis, pattern recognition, information retrieval, etc’) 14 The objective of speech analysis techniques To extract the raw model parameters from the speech signal 1.  Interfering factors:    2. 3. Reality never exactly fits the model Background noise Speaker overlap To extract features To interpret them in meaningful ways (pattern recognition)  Really hard! 15 It remains that   Useful models and techniques exist for extracting the various information types from the speech signal Yet … Many applications such as speech recognition, speaker identification, speech synthesis, etc., are far from being perfected … So what about emotion? 16 For the moment – let’s focus on the small picture  The consensus is that emotions are coded in     Prosody Voice quality And sometimes in the textual information Let’s discuss the purely technical aspects of evaluating all of these … 17 Extracting features from the speech signal  Stage 1 – Extracting raw features:        Pitch Intensity Voice quality Pauses Segmental information – phones and their duration Text (by the way …who extracts them – man, machine or both? ) 18 Pitch  Pitch: The instantaneous frequency    Sounds deceptively simple to find – but it isn’t! Lots of research has been devoted to pitch detection Composed of two sub-problems:    Complicating factors:    For a given signal – is there periodicity at all? If so – what’s the fundamental frequency? Speaker related factors – hoarseness, diplophony, etc. Background related factors – noise, overlapping speakers, filters (as in telephony) In the context of emotions:   Small errors are acceptable Large errors (octave jumps, false positives) are catastrophic 19 An example:  The raw pitch contour in PRAAT: Errors: 20 Intensity   Appears to be even simpler than pitch! Intensity is quite easy to measure …   Yet most influenced by unrelated factors! Aside from the speaker, intensity is gravely affected by:   Distance from the microphone Gain settings in the recording equipment      Clipping AGC Background noise Recording environment Without normalization – intensity is almost useless! 21 Voice quality  Several measures are used to measure it:       Local irregularity in pitch and intensity Ratio between harmonic components and noise components Distribution of energy in the spectrum Affected by a multitude of factors other than emotions Some standardized measures are often used in clinical applications A large factor in emotional speech! 22 Segments    There are different ways of defining precisely what these are Automatic segmentation is difficult, though not as difficult as speech recognition Even the segment boundaries can give important timing information, related to rhythm –  an important component of prosody 23 Text   Is this “raw” data or not? Is it data … at all?    Some studies on emotion specifically eliminated this factor (filtered speech, uniform texts) Other studies are interested mainly in text If we want to deal with text, we must keep in mind:  Automated speech recognition is HARD!    Especially with strong background noise Especially when strong emotions are present, modifying the speakers normal voices and mannerisms Especially when dealing with multiple speakers 24 Some complicating factors in raw feature extraction:     Background noise Speaker overlap Speaker variability Variability in recording equipment 25 In the general context of speech analysis    The raw features we discussed are not specific only to the study of emotion Yet – issues related to calculating them reliably crop up again and again in emotion related studies Some standard and reliable tools would be very helpful 26 Two opposing approaches to computing raw features:  Assume we have perfect algorithms for extracting all this information     If we don’t – help out manually This can be carried out only over small databases Useful in purely theoretical studies Ideal Real life Acknowledge we only have imperfect algorithms   Find how to deal automatically with imperfect data Very important for large databases Error prone 27 Next - what do we do with it all?   Reminder: we have large amounts of raw data Now we have to make some meaning from it 28 Feature extraction …  Stage     2 – data reduction: Take a sea of numbers Reduce it to a small number of meaningful measures Prove they’re meaningful An interesting way to look at it:  Separating the “signal” (e.g emotion) from the “noise” (anything else) 29 An example of “Noise”:  Here pitch and intensity have totally unemotional (but important) roles: [Deller et al] 30 Examples of high level features  Pitch fitting –     stylization MoMel Parametric modeling statistics 31 32 An example:  The raw pitch contour in PRAAT: Errors: 33 Patching it up a bit: 500 0 0 3.39769 Time (s) 34 One way to extract the essential information: 500 0 0 3.39769 Time (s) Pitch stylization – IPO method Another way to extract the essential information: MoMel 35 Yet another way to extract the essential information: MoMel 36 Some observations:  Different parameterizations give    different curves different features Yet: perceptually – they are all very similar 37 Questions:   We can ask what is the minimal or most representative information to capture the pitch contour? More importantly, though: What aspects of the pitch contour are most relevant to emotion? 38 Several answers appear in the literature:  Statistical features taken from the raw contour:   Mean, variance, max, min, range etc. Features taken from parameterized contours:  Slopes, “main” peaks and dips etc. 39 There’s not much time to go into:    Intensity contours Spectra Duration But the problems are very similar 40 The importance of time frames    We have several measures that vary over time Over what time frame should we consider them? The meaning we attribute to speech parameters is dependent on the time frame over which they’re considered:      Fixed length windows Phones Words “Intonation units” “Tunes” 41 Which time frame is best?  Fixed time frames of several seconds – simple to implement, but naïve   Words    Need a recognizer to be marked Probably the shortest meaningful frame “Intonation units”     Very arbitrary Nobody knows exactly what they are (one “idea” per unit?) Hard to measure Correlate best with coherent stretches of speech “Tunes” – from one pause to the next   feasible to implement Correlate to some extent with coherent stretches of speech. 42 Why is this such an important decision?  It might help us interpret our data correctly! 43 Therefore … the problem of feature extraction:    Is NOT a general one We want features that are specifically relevant to emotional content … But before we get to that we have: 44 The Data Mining part Stage 3: To extract knowledge = previously unknown information (rules, constraints, regularities, patterns, etc’) from the features database 45 What are we mining?  We look for patterns that either describe the stored data Eran 25 before gamble after 20 gamble Rafi 20 10 Haim 25 18 slope pause accent 1 accent 2 duration Yuval 20 15 15 15 slope 15 pause 10 before gamble accent 1 after gamble accent 2 5 duration after gamble before gamble 0 Eran Rafi Haim Yuval Discrimination and comparison of features of different classes  20 30 15 30 5 Summarization and characterization (of the class of data that interests us) or infer from it (predictions) 46 Types of Analysis  Association analysis of rules of the form X => Y (DB tuples that satisfy X are likely to satisfy Y) where X and Y are pairs of attribute and value/set   of values Classification and class prediction – find a set of functions to describe and distinguish data classes/concepts that can be used predict the class of unlabeled data. Cluster analysis (unsupervised clustering) – analyze the data when there are no class labels to deal with new types of data and help group similar 47 events together Association Rules   We search for interesting relationships among items in the data Interestingness Measures: A B •Support = # tuples that contain both A and B / # tuples •Confidence = # tuples that contain both A and B / # tuples that contain A Support measures usefulness P( A  B) Confidence measures certainty P( B | A) 48 Classification A two step process: 1. Use data tuples with known labels to construct a model 2. Use the learned model to classify (assign labels) new data Data is divided into two groups: training data and test data Test data is used to estimate the predictive accuracy of the learned model. Since the class label of each training sample is known, this is Supervised Learning 49 Assets       No need to know the rules in advance Some rules are not easily formulated as mathematical or logical expressions Similar to one of the ways human learn Could be more robust to noise and incomplete data May require a lot of samples Learning depends on existing data only! 50  Dangers:     The model might not be able to learn There might not be enough data Over-fitting the model to the training data Algorithms:    Machine learning (Statistical learning) Expert systems Computational neuroscience 51 Prediction    Classification predicts categorical labels Prediction models continuous valued function It is usually used to predict the value or a range of values of an attribute of a given sample   Regression Neural Networks 52 Clustering      constructing models for assigning class labels to data that is unlabeled. un supervised learning Clustering is an ill defined task Once clusters are discovered, the clustering model can be used for predicting labels of new data Alternatively, the clusters can be used as labels to train a supervised classification algorithm 53 So how does this technical Mumbo Jumbo tie into - 54 Part 3: Speech and emotion 55 Speech and emotion  Emotion can affect speech in many ways     Consciously Unconsciously Through the Autonomous nervous system Examples:    Textual content is usually consciously chosen, except maybe sudden interjections which may stem from sudden or strong emotions Many speech patterns related to emotions are strongly ingrained – therefore, though they can be controlled by the speaker, most often they are not, unless the speaker tries modify them consciously Certain speech characteristics are affected by the degree of arousal, and therefore nearly impossible to inhibit (e.g. vocal tremor due to grief) 56 Speech analysis: the big picture - again  Speech analysis is just one component in the context of speech and emotion: Databases Real data Application 57 Is this just another way to spread the blame?     Us speech analysis guys are just poor little engineers Methods we can supply can be no better than the theory and the data that drive them … and unfortunately, the jury is still out on both of those points … or not? Ask WP3 and WP5 people   They’re here somewhere  Actually –  One of the difficulties HUMAINE is intended to ease, is that often researchers in the field find themselves 58 having to address all of the above! (guilty) The most fundamental problem:  What are the features that signify emotion? To paraphrase – what signals are signs of emotion? 59 The most common solutions:  Calculate as many as you can think of Intuition Theory based answers Data-driven answers  Ha! Once more – it’s not our fault!    60 What seems to be the most plausible approach  The data driven approach  Requiring:    Emotional speech databases (“corpora”) Perceptual evaluation of these databases This is then correlated with speech features  Which takes us back to a previous square 61 So tell us already – how does emotion influence speech?   … It seems that the answer depends on how you look for it As hinted before – the answer cannot really be separated from:   The theories of emotion The databases we have of emotional speech Who the subjects are  How emotion was elicited  62 63 A short digression  Will all the speech clinicians in the audience please stand up?  Hmm…. We don’t seem to have so many  Let’s look at what one of them has to say 64 Emotions in the speech Clinic  Some speakers have speech/voice problems that modify their “signal”, thus misleading the listener  VOICE – People with vocal instability (high jitter/shimmer/tremor are clinically perceived as nervous (although the problems reflect irregularity in the vocal folds). - Breathy voice (in women) is, sometimes, perceived as “sexy” (while it actually reflects incomplete adduction of the vocal folds). - Higher excitation level leads to vocal instability (high jitter/shimmer/ tremor) 65 Clinical Examples:  STUTTERING – listeners judge people who stutter as nervous, tensed, and less confident (identification of stuttering depends on pause duration within the “repetition units”, and on rate of repetitions).  CLUTTERING – listeners judge cluttering people as nervous and less intelligent Sothough this is a WP4 meeting …   It’s impossible to avoid talking about WP3 (theory of emotion) and WP5 (databases) issues The signs we’re looking for can never be separated from the questions:    Signs of what (emotions)? Signs in what (data)? May God and Phillipe Gelin forgive me … 66 A not-so-old example: (Murray and Arnott, 1993)   Very qualitative Presupposes dealing with primary emotions 67 BUT …  If you expect more recent results to give more detailed descriptive outlines   Then you’re wrong The data-driven approaches use a large number of features, and let the computer sort them out    32 significant features found by ASSESS, from the initial 375 used 5 emotions, acted 55% recognition 68 Some remarks:  Some features are indicative, even though we probably don’t use them perceptually     e.g. pitch mean: usually this is raised with higher activation But we don’t have to know the speaker’s neutral mean to perceive heightened activation My guess: voice quality is what we perceive in such cases How “simple” can characterization of emotions become?   How many features do we listen for? Can this be verified? 69 Time intervals   This issue becomes more and more important as we go towards “natural” data Emotion production:  How long do emotions last?    Full blown emotions are usually short (but not always! Look at Peguy in the LIMSI interview database) Moods, or pervasive emotions are subtle but long lasting Emotion Analysis:  Over what span of speech are they easiest to detect? 70 From the analysis viewpoint:  Current efforts seem to be focusing on methods that aim to use time spans that have some inherent meaning:    Acoustically (ASSESS – Cowie et al) Linguistically (Batliner et al) We mentioned that prosody carries   emotional information (our “signal”) other information (“noise”): phrasing, various types of prominence BUT … 71 Why I like intonation units  Spontaneous speech is organized differently from written language   “sentences” and “paragraphs” don’t really exist there Phrasing is a loose phrase for …”Intonation units”      Prosodic markers help replace various written markers Maybe emotion is not an “orthogonal” bit of information on top of these (the signal+noise model) If emotion modifies these,   Theoretical linguists love to discuss what they are An exact definition is as hard to find as it is to parse spontaneous speech It would be very useful if we could identify the prosodic markers we use and the ways we modify them when we’re emotional Problem: Engineers don’t like ill defined concepts!  But emotion is one of them too, isn’t it? 72 Just to provoke some thought:  From a paper on animation (think of it – these guys have to integrate speech and image to make them fit naturally): “… speech consists of a sequence of intonation phrases. Each intonation phrase is realized with fluid, continuous articulation and a single point of maximum emphasis. Boundaries between successive phrases are associated with perceived disjuncture and are marked in English with cues such as pitch movements … Gestures are performed in units that coincide with these intonation phrases, and points of prominence in gestures also coincide with the emphasis in the concurrent speech…” [Stone et al., SIGGRAPH 2004] 73 We haven’t even discussed WP3 issues  What are the scales/categories?    Possibility 1: emotional labeling Possibility 2: psychological scales (such as valence/activation – e.g. Feeltrace) QUESTION:  Which is more directly related to speech features? Hopefully we’ll hammer out a tentative answer by Tuesday.. 74 Part 4: Current results 75 Evaluating results   Results often demonstrate how elusive the solution is … Consider a similar problem: Speech Recognition  To evaluate results –     Make recordings Submit them to an algorithm Measure the recognition rate! Emotion recognition results are far more difficult to quantify  Heavily dependent on induction techniques and labeling methods 76 Several popular contexts:   Acted prototypical emotions Call center data       Real WoZ type Media (radio, TV) based data Narrative speech (event recollection) Synthesized speech (monterro, gobl) Most of these methods can be placed on the spectrum between:   Acted, full blown bursts of stereotypical emotions Fully natural, mixtures of mood, affect and bursts of difficult-to-label emotions recorded in noisy environments 77 Call centers   A real life scenario! (with commercial interests…)! Sparse emotional content:    Controlled (usually) Negative (usually) Lends itself easily to WOZ scenarios 78 Ang et al., 2002      Standardized call-center data from 3 different sources Uninvolved users, true HMI interaction Detects neutral/annoyance/frustration Mostly automatic extraction, with some additional human labeling Defines human “accuracy” as 75%    But this is actually the percentage of human consensus Machine accuracy is comparable A possible measure: maybe “accuracy” is where users wanted human intervention 79 Batliner et al.  Professional acting, amateur acting, WOZ scenario   Detects trouble in communication   Much thought was given to this definition! Combines prosodic features with others:    the latter with uninvolved users, true HMI interaction POS labels Syntactic boundaries Overall – shows a typical result:     The closer we get to “real” scenarios, the more difficult the problem becomes! Up to 95% on acted speech Up to 79% on read speech Up to 73% on WOZ data 80 Devillers et al.  Real call center data    Human – human interaction, involved users Human accuracy of 75% is reported   Treat pauses and filled pauses separately Some results:   Is this, as in Ang, the degree of human agreement? Use a small number of intonation features   Contains also fear (of losing money!) Different behavior between clients and agents, males and females Was classification attempted also? 81 Games and simulators    These provide an extremely interesting setting Participants can often be found to experience real emotions The experimenter can sometimes control these to a certain extent  Such as driving conditions or additional tasks in a driving simulator 82 Fernandez & Picard (2000)  Subjects did math problems while driving a simulator   Spectral features were used   This was supposed to induce stress No prosody at all! Advanced classifiers were applied   Results were inconsistent across users, raising a familiar question: Is it the classifier, or is it the data? 83 Kehrein (2002)  2 subjects in 2 separate rooms:     One had instructions One had a set of Lego building blocks The first had to explain to the other what to construct A wide range of “natural” emotions was reported   His thesis is in German  No classification was attempted 84 Acted speech   Widely used An ever-recurring question:  Does it reflect the way emotions are expressed in spontaneous speech? 85 McGilloway et al.      ASSESS used for feature extraction Speech read by non-professionals Emotion evoking texts Categories: sadness, happiness, fear, anger, neutral Up to 55% recognition 86 Recalled emotions     Subjects are asked to recall emotional episodes and describe them Data is composed of long narratives It isn’t clear if subjects actually reexperience these emotions or just recount them as “observers” Can contain good instances of low-key pervasive emotions 87 Ron and Amir  Ongoing work  88 Part 5: Open issues 89 Robust raw feature extraction     Pitch and VAD (voice activity detection) Intensity (normalization) Vocal quality Duration – is this still an open problem? 90 Determination of time intervals  This might have to be addressed on a theoretical vs. practical level –      Phones? Words? Tunes? Intonation units? Fixed length intervals? 91 Feature extraction   Which features are most relevant to emotion? How do we separate noise (speaker mannerisms, culture, language, etc) from the signals of emotion? 92 Part 6: HUMAINE Deliverables 93 Tangible results we are expected to deliver:   Tools Exemplars 94 Tools:  Something along the lines of: solutions to parts of the problem that people can actually download and use right off 95 Exemplars:  These should cover a wide scope       Concepts Methodologies Knowledge pools – tutorials, reviews, etc. Complete solutions to “reduced” problems Test-bed systems Designs for future systems/applications 96 Tools - suggestions:  Useful feature extractors:    Robust pitch detection and smoothing methods Public domain segment/speech recognizers Synthesis engines or parts thereof   E.g. emotional prosody generators Classifying engines 97 Exemplars - suggestions:  Knowledge bases  A taxonomy of speech features     Papers (especially short ones) say what we used What about why? And what we didn’t used? What about what we wished we had? Test-bed systems  A working modular SAL (credit to Marc Schroeder)  Embodies analysis, classification, synthesis, emotion induction/data collection … like a breeder nuclear reactor!  Parts of it already exist  Human parts can be replaced by automated ones as they develop 98 Exemplars – suggestions (cont):  More focused systems –  Call center systems Deal with sparse emotional content  emotions vary over a relatively small range   Standardized (provocative?) data Exemplifying difficulties on different levels: feature extraction, emotion classification  Maybe in conjunction with WP5   Integration  Demonstrations of how different modalities can complement/enhance each other 99 How do we get useful info from WP3 and WP5?    Categories Scales Models (pervasive, burst etc) 100 What is it realistic to expect?  Useful info from other workgroups  WP3: Models of emotional behavior in different contexts  Definite scales and categories for measuring it   WP5: Databases embodying the above  Data which exemplifies data on the scale from    Clearly identifiable … to … Difficult to identify 101 What is it realistic to expect?  Exemplars that show     Some of the problems that are easier to solve The many problems that are difficult to solve Directions for useful further research How not to repeat previous errors 102 Some personal thoughts   Oversimplification is a common pitfall to be avoided Looking at real data, one finds that emotion is often     Difficult to describe in simple terms Jumps between modalities (text might be considered a separate modality) Extremely dependent on context, character, settings, personality A task so complex for humans cannot be easy for machines! 103 Summary  Speech is a major channel for signaling emotional information   And lots of other information too HUMAINE will not solve all the issues involved   We should focus on those that can benefit most from the expertise and collaboration of its members Examining multiple modalities can prove extremely interesting 104

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Techniques, models and results - AAAC emotion