Download Real-time Speech Emotion Recognition Using Support Vector

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Mathematics of radio engineering wikipedia , lookup

Transcript
Article can be accessed online at http://www.publishingindia.com
Real-time Speech Emotion Recognition Using
Support Vector Machine
P. Vijayalakshmi*, A. Anny Leema**
Abstract
In this paper we present an approach for Real-time
emotion recognition from speech using Support
Vector Machine (SVM) as a classification technique.
Automatic Speech Emotion Recognition (ASER) is an
upcoming research area in the field of Human Computer
Interaction Intelligence (HCII). Human emotions can
be detected from their speech signals by extracting
some of the speech acoustic and prosodic features like
pitch, Mel frequency Cepstral Coefficient (MFCC)and
Mel Energy Spectrum Dynamic Coefficient (MEDC).
Here SVM classifier is used to classify the emotions as
anger, fear, neutral, sad, disgust, happy and boredom.
UGA and LDC datasets are used for offline analysis
of emotions using LIBSVM kernel functions.With
this analysis the machine is trained and designed for
detecting emotions in real time speech.
Keyword: Support Vector Machine, Speech Signal,
Experimentation, Emotion Analysis. Controller (PDC)
Introduction
Human Computer Interaction Intelligence is an emerging
field which is used to improve the interactions between
users and computers by making computers more
respondable to the user’s needs. Emotion Recognition
is a recent research topic in the field of HCI which aims
to achieve a more natural interaction between machine
and humans. Today’s HCI system has been developed to
identify who is speaking or what he/she is speaking. If the
computers are given an ability to detect human emotions
then they can know how he/she is speaking and can
respond accurately and naturally like humans do.
Automatic Emotion Recognition can be done in two
ways; Speech signals and Image gestures. Since speech
is considered as a most powerful tool for communication,
dealing with speech emotions is one of the most difficult
tasks. In recent years, many researches have been done to
recognise emotions from human speech. Many methods
and classification algorithms like Neural networks (NN),
K-Nearest Neighbour (k-nn), Support vector machine
(SVM), Hidden Markov model (HMM) have been
developed to classify human emotions based on training
datasets (Ayadi, Kamel & Karray, 2011).
However, many applications which are used today
require affective information from real time data.So the
consideration of real-time constraints is also important
for emotion recognition from speech. Good results can
be obtained by using standard classifier algorithms. Some
applications where real time emotions can be used are in
call centres, robots, gaming, in smartphones where songs
can be played based on human emotion etc.
In this approach, basic features of speech signals like pitch,
MFCC and MEDC are extracted from both offline and real
time speech and they are classified into different emotional
classes by using SVM classifier. We use SVM since it has
better classification performance than other classifiers
(Kulkarni & Gadhe, 2011). Support Vector Machine is a
supervised learning algorithm which addresses general
problem of learning to discriminate between positive an
dnegative members of given n-dimensional vectors. The
* Master of Computer Applications Master of Computer Applications B.S.Abdur Rahman University, Chennai, Tamil Nadu,
India. Email: [email protected]
** B.S. Abdur Rahman University, Chennai, Tamil Nadu, India. Email: [email protected]
Real-time Speech Emotion Recognition Using Support Vector Machine
SVM can be used for both classification and regression
purposes. The classification can be done linearly or nonlinearly. Here the kernel functions of SVM are used to
recognise emotions with more accuracy.
7
Rusell introduced Circumflex of effect (Fig. 1).He mapped
emotions in two dimensional axes as pleasure/displeasure
in horizontal axis and high/low arousal in vertical axis.
Fig. 1:
Dimensional Model of Emotions
This paper is organized as follows: the second section
tells on related
work, the third
focuses on
discussesthe
experimentation
andsection
results obtained,
andthe
implementation
of
the
system.
The
next
section
discusses
the finalsection summarizes main conclusions.
the experimentation and results obtained, and the final
<A
level>RELATED
WORK
section
summarizes main
conclusions.
Under this point the focus is on the literature that is
Related Work
available for speech and emotions.
Under
thisSpeech
point the focus is on the literature that is
<B
level>
Speech
primary
means of communication
availableisforthe
speech
and emotions.
between humans.It is a complex signal containing
information
about
message,speaker,language,
Speech
emotional states and so on.
Speech is referred in terms of the speech production
and
speech
used inbetween
vocal
Speech
is theperceptionof
primary meansthe
of sounds
communication
language.Speech
production
defines
how
the
spoken
humans. It is a complex signal containing information
words are processed and how their phonetics are
about message, speaker, language, emotional states and
formulatedwhereas speech perception refers to process
so on.
by
which humans can interpret and understand the
sound present in human language.
Speech is referred in terms of the speech production and
speech
perception
of the sounds used in vocal language.
<B
level>
Emotions
Speech
production
defines
how the in
spoken
words
Emotions are defined
as changes
physical
andare
processed and how their
phonetics
are formulated
psychologicalfeeling
that
influences
behaviourwhereas
and
speech perception
refers to process
by which humans
thought
of
humans.It
is
associated
with
temperament,personality,mood,motivation,energy
can interpret and understand the sound present in human
etc.The
emotions can be categorised by grouping them
language.
with dimensions.
Emotions
<C
level>Dimensional model of emotions
Dimensional model usually conceptualize emotions by
defining
2 or 3 as
dimensions.All
Emotionsthem
are indefined
changes in dimensional
physical and
models incorporate arousal and valence.In 1980,James
psychological feeling that influences behaviour and
Rusell introduced Circumflex of effect (Fig. 1).He
thought ofemotions
humans.Itinis two
associated
with temperament,
mapped
dimensional
axes as
personality,
mood,
motivation,
energy
The
emotions
pleasure/displeasure in horizontal axisetc.
and
high/low
can be categorised
by grouping them with dimensions.
arousal
in vertical axis.
Dimensional Model of Emotions
Dimensional model usually conceptualize emotions by
defining them in 2 or 3 dimensions. All dimensional
models incorporate arousal and valence. In 1980, James
<Figure head>Fig.
Emotions
1:
Dimensional
model
of
Existing System
<B level>Existing System
Wesurvey
surveyexisting
existing emotional
emotional classification
classification techniques
We
techniques
and
methods
used
for
feature
and
and methods used for feature extractionextraction
and selection.
in Lin
research
Lin &speech
Wei
Asselection.As
discussed indiscussed
research by
& Weiby(2005),
(2005),speech
prosodic
and
acoustic
features
are
prosodic and acoustic features are extracted in time
extracted in time domain and frequency domain.Here
domain and frequency domain. Here inorder to reduce
inorder to reduce complexity we choose only basic
complexity
wepitch,energy
choose only
features like pitch,
features like
andbasic
formants.
energy
and formants.
Commonly
many classifiers are used for emotion
recognition
like
Guassian
Mixture
Commonly
many classifiers are used for Markov
emotion
Models(GMM),Hidden
recognition
like GuassianVector
MixtureMachines(SVM),KModels (GMM),
Models(HMM),Support
Nearest Markov
Neighbour(K-NN)
etc.As Support
specifiedVector
by
Hidden
Models (HMM),
Kulkarni
&
Gadhe
(2011),since
the
performance
of
Machines (SVM), K-Nearest Neighbour (K-NN) etc.
SVMis
better,it
is
taken
in
our
project
and
the
As specified by Kulkarni & Gadhe (2011), since the
emotional states
are classified
in is
real
time.in our project
performance
of SVM
is better, it
taken
Some
of
the
important
research
concerns
of emotion
and the emotional states are classified in real time.
recognition system are discussed below:
recognitionresearch
systemsconcerns
are influenced
by
SomeEmotion
of the important
of emotion
speakers and language dependent information.But
recognition system are discussed below:
these systemshaveto be speaker and language
independent (Koolagudi & Rao, 2010).
∑ Emotion recognition systems are influenced by
There is no standard speech corpus to recognise
speakers and language dependent information.
emotions.Some speech corpus are recorded using
But these systemshaveto
be speaker some
and language
experienced
artists
whereas
with
independent
(Koolagudi
&
Rao,
2010).
inexperienced one.And some speech corpus had
limited Emotional utterances.That is 2-3 emotional
states.They do not contain different classes of
emotions (Ververidis & Kotropoulos, 2006)
Feature extraction is one of the most important
factors
in
emotion
recognition
systems.Identification and selection of particular
features from a list of features is a complex
8
International Journal of System and Software Engineering
∑ There is no standard speech corpus to recognise
emotions. Some speech corpus are recorded using
experienced artists whereas some with inexperienced
one. And some speech corpus had limited Emotional
utterances. That is 2-3 emotional states.They do not
contain different classes of emotions (Ververidis &
Kotropoulos, 2006)
∑ Feature extraction is one of the most important factors in emotion recognition systems.Identification
and selection of particular features from a list of
features is a complex task. Along with features,
computational models and techniques must also be
identified which could make the performance better
(Ayadi et al., 2011).
∑ Emotion Recognition system should recognise
emotions in real time. It has to be robust and noise
should be eliminated from the speech utterances
which is a complex task.
Since these are some of the concerns,a proper measure
has been taken to avoid these problems and to make the
system more accurate.
System Implementation
Dealing with speaker’s emotion is one of the main
challenging tasks in speech technology. The design of
emotion recognition system basically contains different
modules which are depicted in Fig. 2. They are Speech
acquisition, Feature extraction from signals, Training
the machine with feature sets and classifying emotions
through SVM.
The input to the system will be given as a .wav file from
any emotional database. Here we use LDC and UGC
databases which contain emotional speech utterances
with different emotional states. From the given input,the
speech features like Pitch, MFCC and MEDC values are
extracted and they are fed into LIBSVM classifier along
with their class labels. The machine is trained in such a
way to classify all possible emotions. Once the system
is trained well with acted speakers, whenever a real
time speech is given as an input the SVM classifier will
automatically predict the emotional states by referring to
the training sets. If not, again the machine will be trained
iteratively until it gives accurate results.
LIBSVM classifier along with their class labels.The
machine is trained in such a way to classify all possible
emotions.Once the system is trained well with acted
speakers,whenever a real time speech is given as an
input the SVM classifier will automatically predict the
emotional states by referring to the training sets.If
Volumeiterativelyuntil
2 Issue 1 June 2014
not,again the machine will be trained
it
gives accurate results.
Fig. 2: Emotion Recognition System
Emotional
Label
Speech
Database
(Trained)
SVM Trainer
Audio
processing
and
segmentation
Speech
Feature
Extractor
SVM
classifier
Real Time
Speech
EMOTIONAL STATES
Speech Acquisition
The first module of emotion recognition system involves
preprocessing of the given speech utterance. Since the
input is given as continuous speech signal,it has to be
segmented to individual frames for classification purpose.
This step generally involves the digitization of the given
input speech.
Feature Extraction
Feature extraction is the process by which the
measurements of the given input can be taken to
differentiate among emotional classes. In previous
works,several features have been extracted from the
input speech signals such as pitch, energy, formants,
DTW etc. In order to reduce the complexity of our system
and also to predict emotions in real time we extract the
features like Pitch, MFCC (Ma, Huang & Li, 2005) and
MEDC from utterances.To extract these features many
algorithms and techniques are used.
Pitch Features
Pitch is a fundamental frequency F0 of a speech signal.
It generally represents the vibration produced by vocal
cords during sound production.Usually the pitch signals
have two characteristics pitch frequency and glottal air
signal.It
vocal cor
signals h
glottal ai
number
estimatio
use YIN
values.
<C level>
Mel
widely u
introduce
purpose
behaviou
are iden
diagram f
Real-time Speech Emotion Recognition Using Support Vector Machine
velocity. The pitch frequency is given by the number of
harmonics present in spectrum. The estimation of pitch
is one of the complex tasks.Here we use YIN Pitch
Estimation algorithm to detect pitch values.
MFCC
Mel Frequency Cepstral Coefficients are most widely
used feature in speech recognition. It was introduced by
David and Mermelstein.The main purpose of the MFCC
processor is to mimic the behaviour of the human ears.The
main steps of MFCCSamples
are identified (Khalifa et al., 2004)
Speech
and the blockFraming
diagram for MFCC
Windowingis shown in
FFTFig. 3.
N -1
Xk =
Â
n=0
xn e
- i 2p k
n
N
9
k = 0,  , N - 1.
• Mel Filter Bank: This step maps each frequency
from frequency spectrum to Mel scale.The Mel filter
bank will usually consist of overlapping triangular
filters with cut off frequencies which is determined
by center frequency of two filters.
The Mel filters are graphically shown in Fig. 4.
Fig. 4:
Mel Filter Bank with Overlapping Filters
Utterances
Fig. 3:
MFCC and MEDC Feature Extraction
Frequency
domain
Samples
Speech
Utterances
MFCC
Framing
Feature
Extraction
MEDC
MFCC
Feature
Feature
Extraction
Extraction
DCTWindowing
Mean log
Energies DCT
Mel Frequency
FFT
Wrapping
Frequency
domain
Mel
spectrum
Mel Frequency
Wrapping
Mel
spectrum
MEDC
Mean logMEDC feature
<Figure head>Fig.
3: MFCC and
Feature
Energies
extraction
Extraction
Usually the speech input is recorded with a sample rate
of Usually
16000 the
Hz speech
through
forrate of
inputmicrophone.The
is recorded with asteps
sample
<Figure
head>Fig.
3: MFCC
and MEDC feature
calculating
MFCC
are
described
below:
16000
Hz through microphone.The steps for calculating
extraction
Framing:
this step,the
continuous input speech is
MFCC areIndescribed
below:
segmented into N sample frames.The first frame
Usually
input
is recorded
with a sample
rate
∑ Framing:
In this
step,
the
continuous
input
consists
oftheNspeech
samples,second
frame
consists
of Mspeech
of
16000
Hz
through
microphone.The
steps
for
segmented
N sample
samplesisafter
N,thirdinto
frame
containsframes.The
2M and sofirst
on.frame
calculating
MFCC
are
described
below:
Here we
frame oftheN signal
with
time frame
lengthconsists
of 20- of M
consists
samples,
second
Framing:
In
this
step,the
continuous
input
40ms.Therefore
the
frame
length
of
16Khz
signal
willso ison.
samples after N, third frame contains 2Mspeech
and
segmented
into
N
sample
frames.The
first
frame
have 0.025*16000=400
samples.
Here we frame the
signal with time length of 20consists
of
N
samples,second
frame
consists
of M
Windowing:
This stepthe
in frame
processing
isof used
to
40ms.after
Therefore
length 2M
16Khz
signal
samples
N,third
frame
contains
and
so
windowwill
each
individual
frame =inorder
to remove the on.
have
0.025*16000
400
samples.
the signal with time window
length ofis 20startHere
and we
endframe
discontinuities.Hamming
40ms.Therefore
the
frame
length
of
16Khz
signaltowill
∑ used
Windowing:
This
step in processing is used
winmostly
to remove
distortion.
have
0.025*16000=400
samples.
dow
each
individual
frame
inorder
to
remove
Fast Fourier Transform: Here FFT algorithm is the
This
step Nin samples
processing
is window
used to is
start
and end
discontinuities.Hamming
usedWindowing:
for converting
the
from
time
window
each
individual
frame
inorder
to
remove
domain
to used
frequency
domain.It
mostly
to remove
distortion.is used for the
start
and
end
discontinuities.Hamming
window
evaluation frequency spectrum of speech.In
this is
∑ Fastused
Fourier
Transform:
Here FFT algorithm is used
mostly
to
remove
distortion.
project we use Cooley-Tukey algorithm of FFT.
for Fourier
converting
the N samples
fromalgorithm
time domain
Fast
Transform:
Here FFT
is
to frequency
domain.
used forfrom
evaluation
used
for converting
the ItNissamples
time
domain
to spectrum
frequency
domain.It
used wefor
frequency
of speech.
In thisisproject
use
evaluation
frequency
spectrum
of
speech.In
this
Cooley-Tukey algorithm of FFT.
project we use Cooley-Tukey algorithm of FFT.
Mel Filter Bank: This step maps each frequency
from frequency spectrum to Mel scale.The Mel
filter bank will usually consist of overlapping
triangular filters with cut off frequencies which is
<Figure head>Fig. 4: Mel filter bank with overlapping
filters
• Cepstrum: Here the obtained Mel spectrum is converted back to time domain with the help of DCT
Cepstrum:
Here the obtained Mel spectrum is
algorithm.
converted back to time domain with the help of
<Figure
head>Fig.
4: Mel
filter bank
with overlapping
The
result
we
obtain is Mel
Frequency
Cepstral
Coefficient
DCT
algorithm.
filters
vector
values.
The result we obtain is Mel Frequency Cepstral
Coefficient vector values.
Cepstrum:
MEDC
<C
level>
MEDC Here the obtained Mel spectrum is
converted
back toDynamic
time domain
with theis help
Mel Energy Spectrum
Coefficient
an of
DCT
algorithm.
another
important
spectral
feature Coeffi
that iscient
obtained
Mel Energy
Spectrum
Dynamic
is anfrom
another
The
result
we obtain
is Mel
Frequency
Cepstral
speech
utterances
(Khalifa
et
al.,
2004).The
process
of
important spectral feature that is obtained from speech
Coefficient
vector
values.
MEDC
techniques
is
same
as
MFCC
(Fig.
3).Here
the
utterances
al., 2004).The process of MEDC
<C level>(Khalifa
MEDCet of
magnitude
spectrum
each speech utterances
techniques
is same as MFCCDynamic
(Fig. 3). Here
the magnitude
Mel Energy
Coefficient
is an
iscalculated
from Spectrum
FFT algorithm and they
are equally
spectrum
of
each
speech
utterances
is
calculated
from
FFT
another
important
spectral
featuremean
that is
placed
in mel
frequency
scales.The
logobtained
energiesfrom
and they
are
equally
placed
in mel process
frequency
speechobtained
utterances
(Khalifa
et al.,
2004).The
of
ofalgorithm
output
from
filters
are
calculated.
MEDC
techniques
is
same
as
MFCC
(Fig.
3).Here
the
scales.
The
mean
log
energies
of
output
obtained
from
En (i) where i=1,2…N
magnitude
spectrum
of Mel
each Energy
speechSpectrum
utterances
filters
are calculated.
The
obtained
values
represent
iscalculated
from
FFT
algorithm
and
they
are
equally
Dynamic Coefficient
En (i) vector
where values.
i = 1, 2…N
placed in mel frequency scales.The mean log energies
of output
obtained
filters are
calculated.
The
obtained
valuesfrom
represent
Mel
Energy Spectrum
<B
level>SVM
TRAINING
E
(i)
where
i=1,2…N
n
Dynamic
Coeffi
cient
vector
values.
During the training phase of emotion recognition
The the
obtained
values
represent
system,
extracted
feature
valuesMel
areEnergy
labelledSpectrum
into
Dynamic
Coefficient
vector
values.
different
emotional
classes.Here
the
feature
values
are
SVM Training
stored in database with their corresponding class
<B level>SVM
TRAINING
labels.The
SVM kernel
functions are used for training
During
the
training
phasephase
of emotion
recognition
system,
During
the
training
of emotion
recognition
process.
the
extracted
feature
values
are
labelled
thesystem,
extracted
feature
values
are
labelled
into
different
SVM is a very simple and efficient classifyinginto
different
emotional
classes.Here
the
feature
emotionalwhich
classes.
Here
feature
values
arevalues
stored
in
algorithm
is used
forthepattern
recognition(Fig.
5) are
stored
in
database
with
their
corresponding
class
(Pao
etal.,
2006).Support
Machines
database
with their
correspondingVector
class labels.
The SVM
labels.The
SVM
kernel
functions
used formain
training
wereintroduced
byare
Vladimir
Vapnik
inare
1995.The
kernel
functions
used for
training
process.
process.
aim of this algorithm is to obtain a function f(x),which
a very simpleHyper
and efficient
is
used SVM
to is determine
planesclassifying
or
algorithm
which
is
used
for
pattern
recognition(Fig.
boundaries.Thesehyper planes are used to separate 5)
(Pao classes
etal.,of data
2006).Support
different
input points. Vector Machines
10
International Journal of System and Software Engineering
SVM is a very simple and efficient classifying algorithm
which is used for pattern recognition (Fig. 5) (Pao et al.,
2006). Support Vector Machines were introduced by
Vladimir Vapnik in 1995.The main aim of this algorithm
is to obtain a function f(x), which is used to determine
Hyper planes or boundaries. These hyper planes are used
to separate different classes of data input points.
Fig. 5:
SVM Classifier
Volume 2 Issue 1 June 2014
LIBSVM is the most widely used tool for SVM
classification and regression purpose. During training
phase, feature values of speech signals that are obtained
from emotional databases speech utterances are fed into
LIBSVM along with their class labels. SVM model is
developed for each emotional state by using their feature
values. Here more than 80 utterances of speech from LDC
datasets are used for developing SVM training model.
<B level>
SVM CLASSIFICATION
FOR REAL
SVM
Classification
for Real Time Speech
<Figure head>Fig. 5: SVM Classifier
Though SVM follows a linear classification, it can also
perform non-linear mapping of data points by using its
Though SVM follows a linear classification,it can also
kernel
functions
(Hsu,
Chang
&Classifier
Lin, points
2010). by
Theusing
SVMitshas
<Figure
head>Fig.
5:
SVM
perform
non-linear
mapping
of
data
three kernel
functions namely,
kernel
functions(Hsu,
Chang & Lin, 2010).The SVM
Though
SVM
follows
a
linear classification,it can also
has
kernel functions
1.three
Polynomial
kernel namely,
perform non-linear
mapping
of data points by using its
1.Polynomial kernel
2.
Linera
kernel
kernel
functions(Hsu,
Chang
Lin, 2010).The SVM
<Figure2.Linera
head>Fig.
5: SVM&Classifier
kernel
has3.three
kernel
functions
namely,
Radial3.Radial
Basis kernel
Basis kernel
1.Polynomial
kernel
Though
SVM
follows
linear
classification,it
also
In training phasea we
use
Radial Basiscankernel
2.Linera
kernel
In
training
phase
we
use
Radial
Basis
kernel
function
perform
non-linear
mapping
of
data
points
by
using
function because it restricts the training data to its
lie
Basis
kernel
kernel
functions(Hsu,
Chang
& Lin,
because
it3.Radial
restricts
the
training
data
to lie within
within
the
specified
boundaries
as 2010).The
depicted
inSVM
Fig.the
Inthree
training
phase
we
use in
Radial
kernel
has
kernel
functions
namely,
specified
boundaries
asless
depicted
Fig.difficulties
6.Basis
The RBF
kernel
6.The
RBF
kernel
has
numerical
when
function
because
it
restricts
the
training
data
to lie
1.Polynomial
kernel
compared
with
otherdifficulties
kernel
functions.
has less numerical
when compared with
other
within the 2.Linera
specifiedkernel
boundaries as depicted in Fig.
kernel functions.
6.The RBF3.Radial
kernel has
lesskernel
numerical difficulties when
Basis
compared
with
other
In training
phase
we functions.
use
Radial
Basis kernel
Fig.
6: kernel
SVM
Kernel
Fuctions
function because it restricts the training data to lie
within the specified boundaries as depicted in Fig.
6.The RBF kernel has less numerical difficulties when
compared with other kernel functions.
(a)RadialBasis Function (b) RBF mapping
<Figure head>Fig. 6: SVM kernel fuctions
(a) Radial Basis Function (b) RBF mapping
(a)RadialBasis
Function
(b) RBF
LIBSVM
is themost
widely
used mapping
tool for SVM
<Figure
head>Fig.
6:
SVM
kernel
fuctions
classification and regression purpose.During
training
phase,feature values of speech signals that areobtained
LIBSVM
is themost
widely
usedutterances
tool for are
SVM
from emotional
databases
speech
fed
classification
and
regression
purpose.During
training
(a)RadialBasis
Function
(b)
RBF
mapping
into LIBSVM along with their class labels.SVM model
phase,feature
values
speech
signals
<Figure
head>Fig.
6:ofSVM
kernel
fuctions
is developed
for each
emotional
statethat
byareobtained
using their
TIME SPEECH
Oncethe
thetraining
trainingmodel
modelhas
has been
been developed
developed with
more
Once
more
<B
level>
SVM CLASSIFICATION
FORwith
REAL
training
datasets,it
is easy to predict the
emotional
training
datasets, it is easy to predict the emotional
TIME
SPEECH
states in
real time.The real time input speech is taken
states
in
time.model
The since
real
time
input
speech
is more
taken
Once
thereal
training
has been
developed
withwill
through
microphone
noise
distortion
be
through
microphone
since
noise
distortion
will
be
less.
training
datasets,it
is
easy
to
predict
the
emotional
less.Speech features are extracted from signals and
states
in
real
time.The
real time
speech
iswith
taken
Speech
features
are
extracted
frominput
signals
and
the
<B
SVM
FOR
REAL
withlevel>
the
help
of CLASSIFICATION
LIBSVM,these
features
values
are
through
microphone
since
noise
distortion
will
be
TIME
help
of SPEECH
LIBSVM,
these features
are compared
compared
with predicted
valuesvalues
of SVM
training
less.Speech
features
are
extracted
from
signals
and
Once
theand
training
model
been
developed
withand
more
models
the
emotions
are
classified
automatically.
with
predicted
values
of has
SVM
training
models
the
with
the
help
of
LIBSVM,these
features
values
are
training
datasets,it
is
easy
to
predict
the
emotional
emotions are classified automatically.
compared
with
predicted
valuesinput
of speech
SVM
training
states
in real
time.The
real time
is
taken
<A level>EXPERIMENTATION
AND
RESULT
models
and
the
emotions
are
classified
automatically.
through
microphone
since
noise
distortion
will
be
The performance ofand
speech
emotion recognition
Experimentation
Result
less.Speech
features
are
extracted
from
signals
and
depends upon many factors such as the quality of
<A
level>EXPERIMENTATION
AND RESULT
with
thethat
help
of LIBSVM,these
features
values
are
speech
is obtained
without noise,features
extracted
The
performance
of
speech
emotion
recognition
depends
The
performance
of
speech
emotion
recognition
compared
withsignals
predicted
of SVM
training
from speech
and values
the classification
method
depends
upon
many
factors
such
asautomatically.
the
quality
of
upon
many
factors
such
as the
quality
of speech
that
models
and
the emotions
are
classified
used.The
spectrogram
specifying
frequency
for each
that
is
obtained
without
noise,features
extracted
isspeech
obtained
without
noise,
features
extracted
from
frame is given in Fig. 7.
from
speech
the classification
method
<A level>EXPERIMENTATION
AND
RESULT
speech
signals signals
and the and
classification
method
used.
The
used.The
spectrogram
specifying
frequency
for
The performance
of frequency
speech for
emotion
recognition
spectrogram
specifying
each frame
is each
given
frame is given
in Fig. 7.
independs
Fig. 7. upon many factors such as the quality of
speech that is obtained without noise,features extracted
7: Spectrogram
Frame with
from Fig.
speech
signals and of
theEach
classification
method
Corresponding
Frequency
used.The spectrogram
specifying
frequency for each
frame is given in Fig. 7.
<Figure head>Fig. 7: Spectrogram of each frame with
corresponding frequency
<Figure
head>Fig. 7:Database
Spectrogram of each frame with
<B
level>Emotional
corresponding
Here in this project we makefrequency
use of UGA and LDC
datasets (www.ldc.upenn.edu).The UGA datasets
<B
level>Emotional
haveemotions
actedDatabase
by students of university of
Here
in
this
project
we
make
of
anddataset
LDC
<Figure
head>Fig.
7:
Spectrogram
of UGA
eachLDC
frame
with
Georgia.It contains nearly
800 use
samples.In
datasets
(www.ldc.upenn.edu).The
UGA
datasets
corresponding
frequency
each speech utteranceisrecorded for one to two
haveemotions
actedsegments.For
by studentseach
of emotional
universityclass
of
seconds
with
60ms
Emotional
Database
Georgia.It
contains
nearly
800
samples.In
LDC
dataset
<B
level>Emotional
there
are about 25 Database
utterances spoken and acted by
each
speech
utteranceisrecorded
forUGA
one and
to LDC
two
Here
in
this project
we make use of
speakers.
Here
in
this
project
we
make
use
of
UGA
and
LDC
seconds
60ms segments.For each UGA
emotional
class
datasets with
(www.ldc.upenn.edu).The
datasets
there
are
about
25
utterances
spoken
and
acted
by
datasets
(www.ldc.upenn.edu).
The
UGA
datasets
haveemotions
acted
by
students
of
university
of
<B level> Emotion Recognition with LIBSVM have
speakers.
emotions
acted
by
students
of
university
of
Georgia.It
Georgia.It
contains
nearly
800
samples.In
LDC
dataset
During training phase
Emotion Recognition
each
speech
utteranceisrecorded
for
one
tonamely
two
contains
nearly
800
samples.
In LDC
each
speech
system,we
train
the
classifier
withdataset
emotions
<B
level>with
Recognition
with
LIBSVM
seconds
60ms segments.For
each
emotional
Anger,Happy,Sad,Neutral,Disgust,Surprised.The
utterance
isEmotion
recorded
for one to two
seconds
with class
60ms
During
Emotion
Recognition
there
aretraining
about
25phase
utterances
spoken
and
acted
by
LIBSVM
is trained
with of
MFCC
and MEDC
feature
system,we
train
the
classifier
with
emotions
namely
speakers.
vectors using RBF kernel functions.Gender
Anger,Happy,Sad,Neutral,Disgust,Surprised.The
Independent files are used during experimentation.The
LIBSVM
is
trained
with
andin
MEDC
<B
level>
Emotion
with
LIBSVM
result
obtained
by Recognition
SVMMFCC
is given
Table feature
1 with
vectors
using
RBF
kernel
functions.Gender
During
Emotion
Recognition
confusiontraining
matrix phase
which of
represents
classification
of
Real-time Speech Emotion Recognition Using Support Vector Machine
segments.For each emotional class there are about 25
utterances spoken and acted by speakers.
Emotion Recognition with LIBSVM
During training phase of Emotion Recognition system,we
train the classifier with emotions namely Anger,Happy,
Sad,Neutral,Disgust,Surprised.The LIBSVM is trained
with MFCC and MEDC feature vectors using RBF
kernel functions.Gender Independent files are used
during experimentation.The result obtained by SVM is
given in Table 1 with confusion matrix which represents
classification of given class.
Table 1: Confusion Matrix of SVM for Training
Models with RBF Kernel Function c = 9
Emotion
Label
Happy
Sad
Fear
Surprise
Neutral
Anger
Emotion Recognition(%)
H
S
100
0
0
100
0
4.7
2.9
0
0
0
0
0
F
0
0
95.3
18.6
0
13.6
S
0
0
0
78.5
0
0
N
A
0
0
0
0
100
0
0
0
0
0
0
86.4
During real time recognition, gender independent speech
and gender dependent speech are taken from speakers and
their feature values are fed into LIBSVM. By changing
the cost value c = 9 and d = 10.05 of RBF kernel
functions (Onen & Alpaydin, 2011),we could predict the
classification of emotional states. Their average values
are specified in Table 2.
Table 2: Recognition(%) for Speaker Independent
and Speaker Dependent Utterances
Emotions
Speaker independent
Speaker Dependent
Happy
Sad
Fear
Surprise
Neutral
Anger
56.2
59.8
53.6
55.2
60.3
51.3
79.6
82.5
69.2
68.9
85.1
74.4
Conclusion and Future Works
Today, the Speech Emotion Recognition has become one
of the most important research areas. It plays an important
11
role in Human Computer Interaction. Recognising
emotions in real time is a complex task. The type and
the number of emotional classes, feature selection,
classification algorithm are the important factors of this
system. Here we have used SVM for obtaining higher
classification accuracy. MFCC and MEDC feature values
are extracted from speech utterances and emotional states
are classified using SVM RBF kernel function.For training
set with acted speakers we obtained recognition rate by
81% with LDC datasets. During Real time recognition of
speech the accuracy is nearly 60% which is obtained by
changing the values of RBF cost value c = 9 and degree
d = 10.05.Since emotion recognition is done in real time,
distortion of noise makes the system with less accuracy.
In our future works, we work on experimenting the system
by combining some other feature values with MFCC
and MEDC inorder to make the performance of system
better.This emotion recognition system is implemented in
voice mail application which queues up the calls based on
emotions.
References
Ayadi, M. E., Kamel, M. S. & Karray, F (2011).Survey on
speech emotion recognition: Features, classification
schemes, and databases. Pattern Recognition, 44(3),
572-587.
Emotional Prosody Speech and Transcripts from the
Linguistic Data Consortium. (2002). Retrieved from
http://www.ldc.upenn.edu/Catalog/catalogEntry.
jsp?catalogId = LDC2002S28.
Hsu, C.W., Chang, C. C. & Lin, C. J. (2010). A Practical
Guide to Support Vector Classification. Department
of Computer Science & Information Engineering,
National Taiwan University, Taiwan.
Khalifa, O., Khan, S., Islam, M. R., Faizal, M. & Dol, D.
(2004). Text Independent Automatic Speaker
Recognition. 3rd International Conference on Electrical&
Computer Engineering, Dhaka, Bangladesh.
Kulkarni, P. N. & Gadhe, D. L. (2011). Comparison
between SVM & Other Classifiers for SER.
International Journal of Research and Technology,
January, 2(1), 1-6.
Koolagudi, S. G. & Rao, K. S (2010). Real Life Emotion
Classification using VOP and Pitch Based Spectral
Features. India: Jadavpur University.
Lin, Y. L. & Wei, G. (2005). Speech Emotion Recognition
based on HMM and SVM. Paper Presented on 2005
12
International Journal of System and Software Engineering
at Fourth International Conference on Machine
Learning.
Ma, J., Huang, D. & Li, F. (2005). SVM based recognition
of chinese vowels. Artificial Intelligence, 3802,
812-819.
Onen, M. G. & Alpaydin, E. (2011).Multiple kernel
learning algorithms. Journal of Machine Learning
Research, July, 12, 2211-2268.
Volume 2 Issue 1 June 2014
Pao, T., Chen, Y., Yeh, J. & Li, P. (2006). Mandarin
Emotional Speech Recognition based on SVM and
NN. Paper presented on 2006 at 18th International
Conference on Pattern Recognition (ICPR’06), (1,
pp. 1096-1100).
Ververidis, D. & Kotropoulos, C. (2006). A State of the Art
Review on Emotional Speech Databases. Presented
at 11th Australian International Conference on
Speech Science and Technology, Auckland, New
Zealand.