Download Real-time Speech Emotion Recognition Using Support Vector

Article can be accessed online at http://www.publishingindia.com Real-time Speech Emotion Recognition Using Support Vector Machine P. Vijayalakshmi*, A. Anny Leema** Abstract In this paper we present an approach for Real-time emotion recognition from speech using Support Vector Machine (SVM) as a classification technique. Automatic Speech Emotion Recognition (ASER) is an upcoming research area in the field of Human Computer Interaction Intelligence (HCII). Human emotions can be detected from their speech signals by extracting some of the speech acoustic and prosodic features like pitch, Mel frequency Cepstral Coefficient (MFCC)and Mel Energy Spectrum Dynamic Coefficient (MEDC). Here SVM classifier is used to classify the emotions as anger, fear, neutral, sad, disgust, happy and boredom. UGA and LDC datasets are used for offline analysis of emotions using LIBSVM kernel functions.With this analysis the machine is trained and designed for detecting emotions in real time speech. Keyword: Support Vector Machine, Speech Signal, Experimentation, Emotion Analysis. Controller (PDC) Introduction Human Computer Interaction Intelligence is an emerging field which is used to improve the interactions between users and computers by making computers more respondable to the user’s needs. Emotion Recognition is a recent research topic in the field of HCI which aims to achieve a more natural interaction between machine and humans. Today’s HCI system has been developed to identify who is speaking or what he/she is speaking. If the computers are given an ability to detect human emotions then they can know how he/she is speaking and can respond accurately and naturally like humans do. Automatic Emotion Recognition can be done in two ways; Speech signals and Image gestures. Since speech is considered as a most powerful tool for communication, dealing with speech emotions is one of the most difficult tasks. In recent years, many researches have been done to recognise emotions from human speech. Many methods and classification algorithms like Neural networks (NN), K-Nearest Neighbour (k-nn), Support vector machine (SVM), Hidden Markov model (HMM) have been developed to classify human emotions based on training datasets (Ayadi, Kamel & Karray, 2011). However, many applications which are used today require affective information from real time data.So the consideration of real-time constraints is also important for emotion recognition from speech. Good results can be obtained by using standard classifier algorithms. Some applications where real time emotions can be used are in call centres, robots, gaming, in smartphones where songs can be played based on human emotion etc. In this approach, basic features of speech signals like pitch, MFCC and MEDC are extracted from both offline and real time speech and they are classified into different emotional classes by using SVM classifier. We use SVM since it has better classification performance than other classifiers (Kulkarni & Gadhe, 2011). Support Vector Machine is a supervised learning algorithm which addresses general problem of learning to discriminate between positive an dnegative members of given n-dimensional vectors. The * Master of Computer Applications Master of Computer Applications B.S.Abdur Rahman University, Chennai, Tamil Nadu, India. Email: [email protected] ** B.S. Abdur Rahman University, Chennai, Tamil Nadu, India. Email: [email protected] Real-time Speech Emotion Recognition Using Support Vector Machine SVM can be used for both classification and regression purposes. The classification can be done linearly or nonlinearly. Here the kernel functions of SVM are used to recognise emotions with more accuracy. 7 Rusell introduced Circumflex of effect (Fig. 1).He mapped emotions in two dimensional axes as pleasure/displeasure in horizontal axis and high/low arousal in vertical axis. Fig. 1: Dimensional Model of Emotions This paper is organized as follows: the second section tells on related work, the third focuses on discussesthe experimentation andsection results obtained, andthe implementation of the system. The next section discusses the finalsection summarizes main conclusions. the experimentation and results obtained, and the final <A level>RELATED WORK section summarizes main conclusions. Under this point the focus is on the literature that is Related Work available for speech and emotions. Under thisSpeech point the focus is on the literature that is Speech primary means of communication availableisforthe speech and emotions. between humans.It is a complex signal containing information about message,speaker,language, Speech emotional states and so on. Speech is referred in terms of the speech production and speech used inbetween vocal Speech is theperceptionof primary meansthe of sounds communication language.Speech production defines how the spoken humans. It is a complex signal containing information words are processed and how their phonetics are about message, speaker, language, emotional states and formulatedwhereas speech perception refers to process so on. by which humans can interpret and understand the sound present in human language. Speech is referred in terms of the speech production and speech perception of the sounds used in vocal language. Emotions Speech production defines how the in spoken words Emotions are defined as changes physical andare processed and how their phonetics are formulated psychologicalfeeling that influences behaviourwhereas and speech perception refers to process by which humans thought of humans.It is associated with temperament,personality,mood,motivation,energy can interpret and understand the sound present in human etc.The emotions can be categorised by grouping them language. with dimensions. Emotions <C level>Dimensional model of emotions Dimensional model usually conceptualize emotions by defining 2 or 3 as dimensions.All Emotionsthem are indefined changes in dimensional physical and models incorporate arousal and valence.In 1980,James psychological feeling that influences behaviour and Rusell introduced Circumflex of effect (Fig. 1).He thought ofemotions humans.Itinis two associated with temperament, mapped dimensional axes as personality, mood, motivation, energy The emotions pleasure/displeasure in horizontal axisetc. and high/low can be categorised by grouping them with dimensions. arousal in vertical axis. Dimensional Model of Emotions Dimensional model usually conceptualize emotions by defining them in 2 or 3 dimensions. All dimensional models incorporate arousal and valence. In 1980, James <Figure head>Fig. Emotions 1: Dimensional model of Existing System Existing System Wesurvey surveyexisting existing emotional emotional classification classification techniques We techniques and methods used for feature and and methods used for feature extractionextraction and selection. in Lin research Lin &speech Wei Asselection.As discussed indiscussed research by & Weiby(2005), (2005),speech prosodic and acoustic features are prosodic and acoustic features are extracted in time extracted in time domain and frequency domain.Here domain and frequency domain. Here inorder to reduce inorder to reduce complexity we choose only basic complexity wepitch,energy choose only features like pitch, features like andbasic formants. energy and formants. Commonly many classifiers are used for emotion recognition like Guassian Mixture Commonly many classifiers are used for Markov emotion Models(GMM),Hidden recognition like GuassianVector MixtureMachines(SVM),KModels (GMM), Models(HMM),Support Nearest Markov Neighbour(K-NN) etc.As Support specifiedVector by Hidden Models (HMM), Kulkarni & Gadhe (2011),since the performance of Machines (SVM), K-Nearest Neighbour (K-NN) etc. SVMis better,it is taken in our project and the As specified by Kulkarni & Gadhe (2011), since the emotional states are classified in is real time.in our project performance of SVM is better, it taken Some of the important research concerns of emotion and the emotional states are classified in real time. recognition system are discussed below: recognitionresearch systemsconcerns are influenced by SomeEmotion of the important of emotion speakers and language dependent information.But recognition system are discussed below: these systemshaveto be speaker and language independent (Koolagudi & Rao, 2010). ∑ Emotion recognition systems are influenced by There is no standard speech corpus to recognise speakers and language dependent information. emotions.Some speech corpus are recorded using But these systemshaveto be speaker some and language experienced artists whereas with independent (Koolagudi & Rao, 2010). inexperienced one.And some speech corpus had limited Emotional utterances.That is 2-3 emotional states.They do not contain different classes of emotions (Ververidis & Kotropoulos, 2006) Feature extraction is one of the most important factors in emotion recognition systems.Identification and selection of particular features from a list of features is a complex 8 International Journal of System and Software Engineering ∑ There is no standard speech corpus to recognise emotions. Some speech corpus are recorded using experienced artists whereas some with inexperienced one. And some speech corpus had limited Emotional utterances. That is 2-3 emotional states.They do not contain different classes of emotions (Ververidis & Kotropoulos, 2006) ∑ Feature extraction is one of the most important factors in emotion recognition systems.Identification and selection of particular features from a list of features is a complex task. Along with features, computational models and techniques must also be identified which could make the performance better (Ayadi et al., 2011). ∑ Emotion Recognition system should recognise emotions in real time. It has to be robust and noise should be eliminated from the speech utterances which is a complex task. Since these are some of the concerns,a proper measure has been taken to avoid these problems and to make the system more accurate. System Implementation Dealing with speaker’s emotion is one of the main challenging tasks in speech technology. The design of emotion recognition system basically contains different modules which are depicted in Fig. 2. They are Speech acquisition, Feature extraction from signals, Training the machine with feature sets and classifying emotions through SVM. The input to the system will be given as a .wav file from any emotional database. Here we use LDC and UGC databases which contain emotional speech utterances with different emotional states. From the given input,the speech features like Pitch, MFCC and MEDC values are extracted and they are fed into LIBSVM classifier along with their class labels. The machine is trained in such a way to classify all possible emotions. Once the system is trained well with acted speakers, whenever a real time speech is given as an input the SVM classifier will automatically predict the emotional states by referring to the training sets. If not, again the machine will be trained iteratively until it gives accurate results. LIBSVM classifier along with their class labels.The machine is trained in such a way to classify all possible emotions.Once the system is trained well with acted speakers,whenever a real time speech is given as an input the SVM classifier will automatically predict the emotional states by referring to the training sets.If Volumeiterativelyuntil 2 Issue 1 June 2014 not,again the machine will be trained it gives accurate results. Fig. 2: Emotion Recognition System Emotional Label Speech Database (Trained) SVM Trainer Audio processing and segmentation Speech Feature Extractor SVM classifier Real Time Speech EMOTIONAL STATES Speech Acquisition The first module of emotion recognition system involves preprocessing of the given speech utterance. Since the input is given as continuous speech signal,it has to be segmented to individual frames for classification purpose. This step generally involves the digitization of the given input speech. Feature Extraction Feature extraction is the process by which the measurements of the given input can be taken to differentiate among emotional classes. In previous works,several features have been extracted from the input speech signals such as pitch, energy, formants, DTW etc. In order to reduce the complexity of our system and also to predict emotions in real time we extract the features like Pitch, MFCC (Ma, Huang & Li, 2005) and MEDC from utterances.To extract these features many algorithms and techniques are used. Pitch Features Pitch is a fundamental frequency F0 of a speech signal. It generally represents the vibration produced by vocal cords during sound production.Usually the pitch signals have two characteristics pitch frequency and glottal air signal.It vocal cor signals h glottal ai number estimatio use YIN values. <C level> Mel widely u introduce purpose behaviou are iden diagram f Real-time Speech Emotion Recognition Using Support Vector Machine velocity. The pitch frequency is given by the number of harmonics present in spectrum. The estimation of pitch is one of the complex tasks.Here we use YIN Pitch Estimation algorithm to detect pitch values. MFCC Mel Frequency Cepstral Coefficients are most widely used feature in speech recognition. It was introduced by David and Mermelstein.The main purpose of the MFCC processor is to mimic the behaviour of the human ears.The main steps of MFCCSamples are identified (Khalifa et al., 2004) Speech and the blockFraming diagram for MFCC Windowingis shown in FFTFig. 3. N -1 Xk = Â n=0 xn e - i 2p k n N 9 k = 0,  , N - 1. • Mel Filter Bank: This step maps each frequency from frequency spectrum to Mel scale.The Mel filter bank will usually consist of overlapping triangular filters with cut off frequencies which is determined by center frequency of two filters. The Mel filters are graphically shown in Fig. 4. Fig. 4: Mel Filter Bank with Overlapping Filters Utterances Fig. 3: MFCC and MEDC Feature Extraction Frequency domain Samples Speech Utterances MFCC Framing Feature Extraction MEDC MFCC Feature Feature Extraction Extraction DCTWindowing Mean log Energies DCT Mel Frequency FFT Wrapping Frequency domain Mel spectrum Mel Frequency Wrapping Mel spectrum MEDC Mean logMEDC feature <Figure head>Fig. 3: MFCC and Feature Energies extraction Extraction Usually the speech input is recorded with a sample rate of Usually 16000 the Hz speech through forrate of inputmicrophone.The is recorded with asteps sample <Figure head>Fig. 3: MFCC and MEDC feature calculating MFCC are described below: 16000 Hz through microphone.The steps for calculating extraction Framing: this step,the continuous input speech is MFCC areIndescribed below: segmented into N sample frames.The first frame Usually input is recorded with a sample rate ∑ Framing: In this step, the continuous input consists oftheNspeech samples,second frame consists of Mspeech of 16000 Hz through microphone.The steps for segmented N sample samplesisafter N,thirdinto frame containsframes.The 2M and sofirst on.frame calculating MFCC are described below: Here we frame oftheN signal with time frame lengthconsists of 20- of M consists samples, second Framing: In this step,the continuous input 40ms.Therefore the frame length of 16Khz signal willso ison. samples after N, third frame contains 2Mspeech and segmented into N sample frames.The first frame have 0.025*16000=400 samples. Here we frame the signal with time length of 20consists of N samples,second frame consists of M Windowing: This stepthe in frame processing isof used to 40ms.after Therefore length 2M 16Khz signal samples N,third frame contains and so windowwill each individual frame =inorder to remove the on. have 0.025*16000 400 samples. the signal with time window length ofis 20startHere and we endframe discontinuities.Hamming 40ms.Therefore the frame length of 16Khz signaltowill ∑ used Windowing: This step in processing is used winmostly to remove distortion. have 0.025*16000=400 samples. dow each individual frame inorder to remove Fast Fourier Transform: Here FFT algorithm is the This step Nin samples processing is window used to is start and end discontinuities.Hamming usedWindowing: for converting the from time window each individual frame inorder to remove domain to used frequency domain.It mostly to remove distortion.is used for the start and end discontinuities.Hamming window evaluation frequency spectrum of speech.In this is ∑ Fastused Fourier Transform: Here FFT algorithm is used mostly to remove distortion. project we use Cooley-Tukey algorithm of FFT. for Fourier converting the N samples fromalgorithm time domain Fast Transform: Here FFT is to frequency domain. used forfrom evaluation used for converting the ItNissamples time domain to spectrum frequency domain.It used wefor frequency of speech. In thisisproject use evaluation frequency spectrum of speech.In this Cooley-Tukey algorithm of FFT. project we use Cooley-Tukey algorithm of FFT. Mel Filter Bank: This step maps each frequency from frequency spectrum to Mel scale.The Mel filter bank will usually consist of overlapping triangular filters with cut off frequencies which is <Figure head>Fig. 4: Mel filter bank with overlapping filters • Cepstrum: Here the obtained Mel spectrum is converted back to time domain with the help of DCT Cepstrum: Here the obtained Mel spectrum is algorithm. converted back to time domain with the help of <Figure head>Fig. 4: Mel filter bank with overlapping The result we obtain is Mel Frequency Cepstral Coefficient DCT algorithm. filters vector values. The result we obtain is Mel Frequency Cepstral Coefficient vector values. Cepstrum: MEDC <C level> MEDC Here the obtained Mel spectrum is converted back toDynamic time domain with theis help Mel Energy Spectrum Coefficient an of DCT algorithm. another important spectral feature Coeffi that iscient obtained Mel Energy Spectrum Dynamic is anfrom another The result we obtain is Mel Frequency Cepstral speech utterances (Khalifa et al., 2004).The process of important spectral feature that is obtained from speech Coefficient vector values. MEDC techniques is same as MFCC (Fig. 3).Here the utterances al., 2004).The process of MEDC <C level>(Khalifa MEDCet of magnitude spectrum each speech utterances techniques is same as MFCCDynamic (Fig. 3). Here the magnitude Mel Energy Coefficient is an iscalculated from Spectrum FFT algorithm and they are equally spectrum of each speech utterances is calculated from FFT another important spectral featuremean that is placed in mel frequency scales.The logobtained energiesfrom and they are equally placed in mel process frequency speechobtained utterances (Khalifa et al., 2004).The of ofalgorithm output from filters are calculated. MEDC techniques is same as MFCC (Fig. 3).Here the scales. The mean log energies of output obtained from En (i) where i=1,2…N magnitude spectrum of Mel each Energy speechSpectrum utterances filters are calculated. The obtained values represent iscalculated from FFT algorithm and they are equally Dynamic Coefficient En (i) vector where values. i = 1, 2…N placed in mel frequency scales.The mean log energies of output obtained filters are calculated. The obtained valuesfrom represent Mel Energy Spectrum SVM TRAINING E (i) where i=1,2…N n Dynamic Coeffi cient vector values. During the training phase of emotion recognition The the obtained values represent system, extracted feature valuesMel areEnergy labelledSpectrum into Dynamic Coefficient vector values. different emotional classes.Here the feature values are SVM Training stored in database with their corresponding class SVM TRAINING labels.The SVM kernel functions are used for training During the training phasephase of emotion recognition system, During the training of emotion recognition process. the extracted feature values are labelled thesystem, extracted feature values are labelled into different SVM is a very simple and efficient classifyinginto different emotional classes.Here the feature emotionalwhich classes. Here feature values arevalues stored in algorithm is used forthepattern recognition(Fig. 5) are stored in database with their corresponding class (Pao etal., 2006).Support Machines database with their correspondingVector class labels. The SVM labels.The SVM kernel functions used formain training wereintroduced byare Vladimir Vapnik inare 1995.The kernel functions used for training process. process. aim of this algorithm is to obtain a function f(x),which a very simpleHyper and efficient is used SVM to is determine planesclassifying or algorithm which is used for pattern recognition(Fig. boundaries.Thesehyper planes are used to separate 5) (Pao classes etal.,of data 2006).Support different input points. Vector Machines 10 International Journal of System and Software Engineering SVM is a very simple and efficient classifying algorithm which is used for pattern recognition (Fig. 5) (Pao et al., 2006). Support Vector Machines were introduced by Vladimir Vapnik in 1995.The main aim of this algorithm is to obtain a function f(x), which is used to determine Hyper planes or boundaries. These hyper planes are used to separate different classes of data input points. Fig. 5: SVM Classifier Volume 2 Issue 1 June 2014 LIBSVM is the most widely used tool for SVM classification and regression purpose. During training phase, feature values of speech signals that are obtained from emotional databases speech utterances are fed into LIBSVM along with their class labels. SVM model is developed for each emotional state by using their feature values. Here more than 80 utterances of speech from LDC datasets are used for developing SVM training model. SVM CLASSIFICATION FOR REAL SVM Classification for Real Time Speech <Figure head>Fig. 5: SVM Classifier Though SVM follows a linear classification, it can also perform non-linear mapping of data points by using its Though SVM follows a linear classification,it can also kernel functions (Hsu, Chang &Classifier Lin, points 2010). by Theusing SVMitshas <Figure head>Fig. 5: SVM perform non-linear mapping of data three kernel functions namely, kernel functions(Hsu, Chang & Lin, 2010).The SVM Though SVM follows a linear classification,it can also has kernel functions 1.three Polynomial kernel namely, perform non-linear mapping of data points by using its 1.Polynomial kernel 2. Linera kernel kernel functions(Hsu, Chang Lin, 2010).The SVM <Figure2.Linera head>Fig. 5: SVM&Classifier kernel has3.three kernel functions namely, Radial3.Radial Basis kernel Basis kernel 1.Polynomial kernel Though SVM follows linear classification,it also In training phasea we use Radial Basiscankernel 2.Linera kernel In training phase we use Radial Basis kernel function perform non-linear mapping of data points by using function because it restricts the training data to its lie Basis kernel kernel functions(Hsu, Chang & Lin, because it3.Radial restricts the training data to lie within within the specified boundaries as 2010).The depicted inSVM Fig.the Inthree training phase we use in Radial kernel has kernel functions namely, specified boundaries asless depicted Fig.difficulties 6.Basis The RBF kernel 6.The RBF kernel has numerical when function because it restricts the training data to lie 1.Polynomial kernel compared with otherdifficulties kernel functions. has less numerical when compared with other within the 2.Linera specifiedkernel boundaries as depicted in Fig. kernel functions. 6.The RBF3.Radial kernel has lesskernel numerical difficulties when Basis compared with other In training phase we functions. use Radial Basis kernel Fig. 6: kernel SVM Kernel Fuctions function because it restricts the training data to lie within the specified boundaries as depicted in Fig. 6.The RBF kernel has less numerical difficulties when compared with other kernel functions. (a)RadialBasis Function (b) RBF mapping <Figure head>Fig. 6: SVM kernel fuctions (a) Radial Basis Function (b) RBF mapping (a)RadialBasis Function (b) RBF LIBSVM is themost widely used mapping tool for SVM <Figure head>Fig. 6: SVM kernel fuctions classification and regression purpose.During training phase,feature values of speech signals that areobtained LIBSVM is themost widely usedutterances tool for are SVM from emotional databases speech fed classification and regression purpose.During training (a)RadialBasis Function (b) RBF mapping into LIBSVM along with their class labels.SVM model phase,feature values speech signals <Figure head>Fig. 6:ofSVM kernel fuctions is developed for each emotional statethat byareobtained using their TIME SPEECH Oncethe thetraining trainingmodel modelhas has been been developed developed with more Once more SVM CLASSIFICATION FORwith REAL training datasets,it is easy to predict the emotional training datasets, it is easy to predict the emotional TIME SPEECH states in real time.The real time input speech is taken states in time.model The since real time input speech is more taken Once thereal training has been developed withwill through microphone noise distortion be through microphone since noise distortion will be less. training datasets,it is easy to predict the emotional less.Speech features are extracted from signals and states in real time.The real time speech iswith taken Speech features are extracted frominput signals and the the help of CLASSIFICATION LIBSVM,these features values are through microphone since noise distortion will be TIME help of SPEECH LIBSVM, these features are compared compared with predicted valuesvalues of SVM training less.Speech features are extracted from signals and Once theand training model been developed withand more models the emotions are classified automatically. with predicted values of has SVM training models the with the help of LIBSVM,these features values are training datasets,it is easy to predict the emotional emotions are classified automatically. compared with predicted valuesinput of speech SVM training states in real time.The real time is taken <A level>EXPERIMENTATION AND RESULT models and the emotions are classified automatically. through microphone since noise distortion will be The performance ofand speech emotion recognition Experimentation Result less.Speech features are extracted from signals and depends upon many factors such as the quality of <A level>EXPERIMENTATION AND RESULT with thethat help of LIBSVM,these features values are speech is obtained without noise,features extracted The performance of speech emotion recognition depends The performance of speech emotion recognition compared withsignals predicted of SVM training from speech and values the classification method depends upon many factors such asautomatically. the quality of upon many factors such as the quality of speech that models and the emotions are classified used.The spectrogram specifying frequency for each that is obtained without noise,features extracted isspeech obtained without noise, features extracted from frame is given in Fig. 7. from speech the classification method <A level>EXPERIMENTATION AND RESULT speech signals signals and the and classification method used. The used.The spectrogram specifying frequency for The performance of frequency speech for emotion recognition spectrogram specifying each frame is each given frame is given in Fig. 7. independs Fig. 7. upon many factors such as the quality of speech that is obtained without noise,features extracted 7: Spectrogram Frame with from Fig. speech signals and of theEach classification method Corresponding Frequency used.The spectrogram specifying frequency for each frame is given in Fig. 7. <Figure head>Fig. 7: Spectrogram of each frame with corresponding frequency <Figure head>Fig. 7:Database Spectrogram of each frame with Emotional corresponding Here in this project we makefrequency use of UGA and LDC datasets (www.ldc.upenn.edu).The UGA datasets Emotional haveemotions actedDatabase by students of university of Here in this project we make of anddataset LDC <Figure head>Fig. 7: Spectrogram of UGA eachLDC frame with Georgia.It contains nearly 800 use samples.In datasets (www.ldc.upenn.edu).The UGA datasets corresponding frequency each speech utteranceisrecorded for one to two haveemotions actedsegments.For by studentseach of emotional universityclass of seconds with 60ms Emotional Database Georgia.It contains nearly 800 samples.In LDC dataset Emotional there are about 25 Database utterances spoken and acted by each speech utteranceisrecorded forUGA one and to LDC two Here in this project we make use of speakers. Here in this project we make use of UGA and LDC seconds 60ms segments.For each UGA emotional class datasets with (www.ldc.upenn.edu).The datasets there are about 25 utterances spoken and acted by datasets (www.ldc.upenn.edu). The UGA datasets haveemotions acted by students of university of Emotion Recognition with LIBSVM have speakers. emotions acted by students of university of Georgia.It Georgia.It contains nearly 800 samples.In LDC dataset During training phase Emotion Recognition each speech utteranceisrecorded for one tonamely two contains nearly 800 samples. In LDC each speech system,we train the classifier withdataset emotions with Recognition with LIBSVM seconds 60ms segments.For each emotional Anger,Happy,Sad,Neutral,Disgust,Surprised.The utterance isEmotion recorded for one to two seconds with class 60ms During Emotion Recognition there aretraining about 25phase utterances spoken and acted by LIBSVM is trained with of MFCC and MEDC feature system,we train the classifier with emotions namely speakers. vectors using RBF kernel functions.Gender Anger,Happy,Sad,Neutral,Disgust,Surprised.The Independent files are used during experimentation.The LIBSVM is trained with andin MEDC Emotion with LIBSVM result obtained by Recognition SVMMFCC is given Table feature 1 with vectors using RBF kernel functions.Gender During Emotion Recognition confusiontraining matrix phase which of represents classification of Real-time Speech Emotion Recognition Using Support Vector Machine segments.For each emotional class there are about 25 utterances spoken and acted by speakers. Emotion Recognition with LIBSVM During training phase of Emotion Recognition system,we train the classifier with emotions namely Anger,Happy, Sad,Neutral,Disgust,Surprised.The LIBSVM is trained with MFCC and MEDC feature vectors using RBF kernel functions.Gender Independent files are used during experimentation.The result obtained by SVM is given in Table 1 with confusion matrix which represents classification of given class. Table 1: Confusion Matrix of SVM for Training Models with RBF Kernel Function c = 9 Emotion Label Happy Sad Fear Surprise Neutral Anger Emotion Recognition(%) H S 100 0 0 100 0 4.7 2.9 0 0 0 0 0 F 0 0 95.3 18.6 0 13.6 S 0 0 0 78.5 0 0 N A 0 0 0 0 100 0 0 0 0 0 0 86.4 During real time recognition, gender independent speech and gender dependent speech are taken from speakers and their feature values are fed into LIBSVM. By changing the cost value c = 9 and d = 10.05 of RBF kernel functions (Onen & Alpaydin, 2011),we could predict the classification of emotional states. Their average values are specified in Table 2. Table 2: Recognition(%) for Speaker Independent and Speaker Dependent Utterances Emotions Speaker independent Speaker Dependent Happy Sad Fear Surprise Neutral Anger 56.2 59.8 53.6 55.2 60.3 51.3 79.6 82.5 69.2 68.9 85.1 74.4 Conclusion and Future Works Today, the Speech Emotion Recognition has become one of the most important research areas. It plays an important 11 role in Human Computer Interaction. Recognising emotions in real time is a complex task. The type and the number of emotional classes, feature selection, classification algorithm are the important factors of this system. Here we have used SVM for obtaining higher classification accuracy. MFCC and MEDC feature values are extracted from speech utterances and emotional states are classified using SVM RBF kernel function.For training set with acted speakers we obtained recognition rate by 81% with LDC datasets. During Real time recognition of speech the accuracy is nearly 60% which is obtained by changing the values of RBF cost value c = 9 and degree d = 10.05.Since emotion recognition is done in real time, distortion of noise makes the system with less accuracy. In our future works, we work on experimenting the system by combining some other feature values with MFCC and MEDC inorder to make the performance of system better.This emotion recognition system is implemented in voice mail application which queues up the calls based on emotions. References Ayadi, M. E., Kamel, M. S. & Karray, F (2011).Survey on speech emotion recognition: Features, classification schemes, and databases. Pattern Recognition, 44(3), 572-587. Emotional Prosody Speech and Transcripts from the Linguistic Data Consortium. (2002). Retrieved from http://www.ldc.upenn.edu/Catalog/catalogEntry. jsp?catalogId = LDC2002S28. Hsu, C.W., Chang, C. C. & Lin, C. J. (2010). A Practical Guide to Support Vector Classification. Department of Computer Science & Information Engineering, National Taiwan University, Taiwan. Khalifa, O., Khan, S., Islam, M. R., Faizal, M. & Dol, D. (2004). Text Independent Automatic Speaker Recognition. 3rd International Conference on Electrical& Computer Engineering, Dhaka, Bangladesh. Kulkarni, P. N. & Gadhe, D. L. (2011). Comparison between SVM & Other Classifiers for SER. International Journal of Research and Technology, January, 2(1), 1-6. Koolagudi, S. G. & Rao, K. S (2010). Real Life Emotion Classification using VOP and Pitch Based Spectral Features. India: Jadavpur University. Lin, Y. L. & Wei, G. (2005). Speech Emotion Recognition based on HMM and SVM. Paper Presented on 2005 12 International Journal of System and Software Engineering at Fourth International Conference on Machine Learning. Ma, J., Huang, D. & Li, F. (2005). SVM based recognition of chinese vowels. Artificial Intelligence, 3802, 812-819. Onen, M. G. & Alpaydin, E. (2011).Multiple kernel learning algorithms. Journal of Machine Learning Research, July, 12, 2211-2268. Volume 2 Issue 1 June 2014 Pao, T., Chen, Y., Yeh, J. & Li, P. (2006). Mandarin Emotional Speech Recognition based on SVM and NN. Paper presented on 2006 at 18th International Conference on Pattern Recognition (ICPR’06), (1, pp. 1096-1100). Ververidis, D. & Kotropoulos, C. (2006). A State of the Art Review on Emotional Speech Databases. Presented at 11th Australian International Conference on Speech Science and Technology, Auckland, New Zealand.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Real-time Speech Emotion Recognition Using Support Vector