Download Multimodal Information Analysis for Emotion

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

MiniDisc wikipedia , lookup

Public address system wikipedia , lookup

Cambridge Audio wikipedia , lookup

PS Audio wikipedia , lookup

Transcript
Multimodal Information Analysis
for Emotion Recognition
(Tele Health Care Application)
Malika Meghjani
Gregory Dudek and Frank P. Ferrie
Content
1. Our Goal
2. Motivation
3. Proposed Approach
4. Results
5. Conclusion
Our Goal
• Automatic emotion recognition using audio-visual
information analysis.
• Create video summaries by automatically labeling
the emotions in a video sequence.
Motivation
• Map Emotional States of the Patient to Nursing
Interventions.
• Evaluate the role of Nursing Interventions for
improvement in patient’s health.
NURSING INTERVENTIONS
Proposed Approach
Visual
Feature
Extraction
Visual based
Emotion
Classification
Decision Level Fusion
Audio
Feature
Extraction
Audio based
Emotion
Classification
Data
Fusion
Recognized
Emotional
State
Visual Analysis
1.
Face Detection (Voila Jones Face Detector using Haar
Wavelets and Boosting Algorithm)
2.
Feature Extraction (Gabor Filter: 5 Spatial Frequencies and
4 orientations, 20 filter responses for each frame)
3.
Feature Selection (Select the most discriminative features in
all the emotional classes)
4.
SVM Classification (Classification with probability estimates)
Visual Feature Extraction
X
=
( 5 Frequencies X 4 Orientation )
Frequency Domain Filters
Feature
Selection
Automatic
Emotion
Classification
Audio Analysis
1.
Audio Pre-Processing (Remove leading and trailing edges)
2.
Feature Extraction (Statistics of Pitch and Intensity contours
and Mel Frequency Cepstral Coefficients)
3.
Feature Normalization (Remove inter speaker variability)
4.
SVM Classification (Classification with probability estimates)
Audio Feature Extraction
Audio
Feature
Extraction
Audio Signal
Automatic
Emotion
Classification
Speech Rate
Pitch
Intensity
Spectrum Analysis
Mel Frequency Cepstral Coefficients
(Short-term Power Spectrum of
Sound)
SVM Classification
Feature Selection
•
•
•
Feature selection method is similar to the SVM classification. (Wrapper Method)
Generates a separating plane by minimizing the weighted sum of distances of
misclassified data points to two parallel planes.
Suppress as many components of the normal to the separating plane which
provide consistent results for classification.
Average
Error
Distance
Count of
Features
Selected
Bounding
Planes
Data Fusion
1. Decision Level:
•
Obtaining probability estimate for each emotional class using SVM margins.
•
The probability estimates from two modalities are multiplied and renormalized to the give final estimation of decision level emotional
classification.
2. Feature Level:
Concatenate the Audio and Visual feature and repeat feature selection and SVM
classification process.
Database and Training
Database:
1. Visual only posed database
2. Audio Visual posed database
Training:
• Audio segmentation based on minimum
window required for feature extraction.
• Corresponding visual key frame extraction in
the segmented window.
• Image based training and audio segment
based training.
Experimental Results
Statistics
No. of
Training
Examples
No. of
Subjects
No. of
Emotional
State
Posed
Visual Data
Only
(CKDB)
120
20
5+Nuetral
75%
Leave one
subject
out cross
validation
Posed
Audio
Visual Data
(EDB)
270
9
6
82%
Decision
Level
76%
Feature
Level
Database
%
Validation
Recognition Method
Rate
Time Series Plot
Surprise
Sad
Angry
Disgust Happy
75% Leave One Subject Out Cross Validation Results
( * Cohen Kanade Database, Posed Visual only Database)
Feature Level Fusion
(*eNTERFACE 2005 , Posed Audio Visual Database)
Decision Level Fusion
(*eNTERFACE 2005 , Posed Audio Visual Database)
Confusion Matrix
Demo
Conclusion
• Combining two modalities (Audio and Visual)
improves overall recognition rates by 11% with
Decision Level Fusion and by 6% with Feature Level
Fusion
• Emotions where vision wins: Disgust, Happy and
Surprise.
• Emotions where audio wins: Anger and Sadness
• Fear was equally well recognized by the two
modalities.
• Automated multimodal emotion recognition is clearly
effective.
Things to do…
• Inference based on temporal relation
between instantaneous classifications.
• Tests on natural audio-visual database
(on-going).