* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Classification Techniques for Speech Recognition: A Review
Survey
Document related concepts
Transcript
International Journal of Emerging Technology and Advanced Engineering Website: www.ijetae.com (ISSN 2250-2459, ISO 9001:2008 Certified Journal, Volume 5, Issue 2, February 2015) Classification Techniques for Speech Recognition: A Review Mayur R Gamit1, Prof. Kinnal Dhameliya2, Dr. Ninad S. Bhatt3 1 M.Tech student, 2Assistant Professor, E&C Department, C.G.P.I.T, Bardoli, India 3 Associate Professor, E&C Department, CKPCET, Surat, India Some typical applications of such speech recognition are voice-recognized passwords, voice repertory dialers, automated call-type recognition, call distribution by voice commands, directory listing retrieval, credit card sales validation, speech to text processing, automated data entry etc. Figure 1, shows basic representation of speech recognition system which consist of pre-processing, feature extraction, classification block as shown below. Abstract— Speech Processing is emerged as one of the important application area of digital signal processing. Various fields for research in speech processing are speech recognition, speaker recognition, speech synthesis, speech coding etc. Speech recognition is the process of automatically recognizing the spoken words of person based on information content in speech signal. This paper introduces a brief survey on Automatic Speech Recognition and discusses the various classification techniques that have been accomplished in this wide area of speech processing. The objective of this review paper is to summarize some of the well-known methods that are widely used in several stages of speech recognition system. Input speech Keywords—Feature Extraction, Acoustic phonetic Approach, Pattern Recognition,Artificial intelligence Approach, Speech Recognition. Recognized speech Preprocessing Feature Extraction Classification Figure 1. Basic model of Speech Recognition system I. INTRODUCTION A fundamental distinctive unit of a language is a phoneme. Different languages contain different types of phoneme sets. Syllables contain one or more phonemes, while words are formed with one or more syllables, concatenated to form phrases and sentences. One broad classification for English is in terms of vowels, consonants, diphthongs, affricates and semi-vowels [3]. The speech recognition system can be classified by the type of speech. They are continuous speech, isolated word, connected word and spontaneous speech. Automatic speech recognition (ASR) has been the most investigated topic in speech processing since early 1960s [1]. Speech recognition is a popular and active area of research, used to translate words spoken by humans so as to make them computer recognizable. It usually involves extraction of features from speech signal and representing them using an appropriate data model. ASR system involves two phases. Training phase and Testing phase. In training phase, known speech is recorded and parametric representation of the speech is extracted and stored in the speech database. In the testing phase, for the given input speech signal the features are extracted and ASR system compares it with the reference templates to recognize the utterance. There are usually two categories for isolated and continuous speech recognition: Speaker dependent and Speaker independent [2].Speaker dependent method involves training a system that recognize each of the words uttered single or multiple times by specific set of speakers, while for speaker independent training is not applicable and words are recognized by analyzing their inherent acoustical properties. The main challenges of speech recognition involves modeling the variation of the same word as spoken by different speakers depending on speaking styles, accents, social dialects, gender, vocabulary size ,recognition environment etc. II. PRE-PROCESSING At the time of recording speech the interference due to noise mainly occurs. The performance can be degraded mainly due to noise. Before feeding the speech signal to feature extraction block the noise contained in speech signal must be removed. Preprocessing does this task. It removes the noise based on zero-crossing rate and energy. The separation of voiced and unvoiced speech based on both energy and zero-crossing rate gives the best result [4]. The start point and end points are determined based on energy and zero-crossing rates. The output speech contains the information and noise is eliminated. 58 International Journal of Emerging Technology and Advanced Engineering Website: www.ijetae.com (ISSN 2250-2459, ISO 9001:2008 Certified Journal, Volume 5, Issue 2, February 2015) III. FEATURE EXTRACTION TECHNIQUES Speech Recognition In speech recognition, feature extraction is the most important phase and the systems performance mainly depends on this block. The main task of feature extraction phase is to extract the feature from speech signal and representing them using an appropriate data model of the input signal. Various feature extraction techniques that are commonly used in speech recognition are as follows [7] [9] [11] [4]. A. Mel-Frequency Cepstrum Coefficients (MFCC) B. Linear Predictive Coding (LPC) C. Linear Prediction Cepstral Coefficients (LPCC) D. Perceptual Linear Prediction (PLP) E. Linear Discriminant Analysis (LDA) F. Discrete Wavelet Transform (DWT) G. Relative Spectral (RASTA-PLP) H. Principal Component analysis (PCA) Techniques Acoustic Phonetic Approach Template based Approach IV. CLASSIFICATION TECHNIQUES In speech recognition there are three approaches [19][21]. A. Acoustic Phonetic Approach B. Pattern Recognition Approach C. Artificial Intelligence Approach Pattern Recognition Approach Artificial Intelligence Approach Model based Approach Rule based Approach HMM SVM DTW Bayesian VQ Knowledge based Approach Figure 2. Classification techniques in speech recognition[3] 1) Hidden Markov Model (HMM) Hidden Markov model (HMM) is the most powerful parametric model at the acoustic level. The HMM is popular statistical tool for modeling a wide range of time series data [10]. An HMM is a doubly stochastic process with an underlying Markov process that is not observable. The semantic of the model is usually encapsulated in the Hidden part for instance in ASR, an HMM can be used to model a word in the task-dependent vocabulary, where each state of the hidden part represents a phoneme[1]. The HMM can be described as follows: A. Acoustic Phonetic Approach In Acoustic Phonetic approach the speech recognition were based on finding speech sounds and providing appropriate labels to these sounds[2] [21].This is the basis of the acousticphonetic approach which postulates that there exist finite, distinctive phonetic units called phonemes and these units are broadly characterized by a set of acoustics properties present in speech. B. Pattern Recognition Approach The Pattern Recognition approach involves two essential steps namely, pattern training and pattern testing [19]. The essential feature of this approach is that it uses a well formulated mathematical framework and establishes consistent speech pattern representations for reliable pattern comparison. Thisapproach contains many techniques such as HMM, DTW, SVM, VQ etc. A set S of Q states, S= {S1, S2…SQ}, which are distinct values that the discrete hidden stochastic process can take. An initial state probability distribution is given by the 𝜋 = {P (Si |t=0), Si ∈S}, where t is a discrete time index. An observation or the feature space F of states, which is γi = {P (Sj =qt)}, it is a last state probability. 59 International Journal of Emerging Technology and Advanced Engineering Website: www.ijetae.com (ISSN 2250-2459, ISO 9001:2008 Certified Journal, Volume 5, Issue 2, February 2015) A probability distribution that characterizes the state transition probabilities which is aij = {Sj |Si}. A set of probability distribution that describe the statistical properties of the observations for each state model: b (k) = {P (Ot | qt)}. The summation of all these probability must be equal to one. The sum of all the transition probability, initial state, last state and observation of states probability is equal to zero. The most popular algorithms are the forward-backward and the Viterbi algorithms [1]. These belong to the class of unsupervised learning techniques, since they perform unsupervised parameter estimation of probability distribution. This technique is widely used because of its high recognition accuracy [10]. It is a process of mapping vectors from a large vector space to a finite number of regions in that space. Each region is called a cluster and can be represented by its center called a codeword. The collection of all codeword is called a codebook. The Euclidean distance is calculated from the input vectors by using the equation: ( ) ( ) ∑( ) ( ) Vector quantization involves extraction of features from training and testing data and VQ codebook model is built for all speech samples [5]. The distance between the input feature vectors and the code words are calculated and those having minimum distance can be selected as the recognized word. 2) Dynamic Time Warping (DTW) The time alignment of different utterances is the core problem for distance measurement in speech recognition. A small shift leads to incorrect identification. Dynamic time warping is an algorithm for measuring similarity between two sequences which may vary in time or speed. DTW is a method that finds an optimal match between two given sequences with certain restrictions. The sequences are warped nonlinearly in the time dimension [19]. DTW was recognized as the most suitable method for speech recognition because of its capability to cope with different speaking speeds. The Euclidean can be calculated as follows [17]: ( ) 4) Support Vector Machine (SVM) Support vector machine is a simple and effective algorithm for classification of speech or speaker recognition [14]. SVM is a binary nonlinear classifier capable of guessing whether an input vector x belongs to a class-1 or class-2 category[15]. The margin known as soft margin is determined. The distance (R) between the sample points and margin are calculated. Samples ( ) Feature space Various sections of the utterances are stretched and compressed so as to find alignment that result in best possible match between test and reference vectors feature by feature basis[6].The local distance measure is the distance between features at a pair of frames while the global distance from beginning of utterance until last pair of frames. R Decision logic Figure 3. Decision logic in SVM As shown in above figure 3, the feature space consists of the features extracted from the speech signal. The decision boundary or soft-margin is determined to classify the problems. The distance between the margin and samples are computed. The sample that is nearer to margin or having least distance is chosen. 3) Vector Quantization (VQ) Vector Quantization is the pattern classification technique applied to speech data to forms a representative set of feature vectors. In the training phase, a speech specific VQ codebook is formed for each speech uttered by clustering the speech training acoustic vectors. Each group is represented by its centroid point. Vector quantization is a lossy data compression method based on principle of block coding. It is a fixed-to-fixed length algorithm [18]. 5) Polynomial Classifier Polynomials have the excellent properties as classifiers. Polynomial classifiers are universal approximators to the bayes classifiers because of weierstrass approximation theorem[13]. 60 International Journal of Emerging Technology and Advanced Engineering Website: www.ijetae.com (ISSN 2250-2459, ISO 9001:2008 Certified Journal, Volume 5, Issue 2, February 2015) The structure of classifiers is shown in figure 4. It consists of various blocks. The input features vectorsX1, X2 …XN are given to the classifier. The speaker model is given by and P(x) is the vector of the polynomial basis terms of the input feature vectors [12]. For each input feature vectors Xi the score is produced by ( ) . The output is averaged to obtain the score. X1….XN Average P(x) Output Output layer(s) Ht Score Ht-1 τ Ht-2 τ τ Ht-3 Figure 4. Structure of Polynomial classifier [12] The score equation is given as follows [13]: Input ∑ ( ) It τ It-1 τ It-2 ( ) Figure 5. Time Delay Neural Network [1] Where, = input test feature, w = speaker model, ( )= vector of polynomial basis terms of the input test vectors. Time-delay neural networks (TDNNs) represent an effective attempt to train a static multilayer perceptron (MLP) for time sequence processing. An example of a TDNN is shown in figure 5. The input layer has been enlarged to accept as many input patterns as the fixed sequence length to be processed at each time step. There are three layers input, hidden and output layers. The input vector enters the network from the leftmost set of input neurons. At each time step the inputs are shifted to right through the unit time delay. The outputs of the input layers are feed to the right most of the hidden layer and procedure for all the subsequent layers. Finally the output layer gives the output. This technique gives the ability to deal with more complicated time dependences. The back propagation algorithm can be used to train such a network. Recurrent Neural Networks (RNNs) is a better concept of artificial neural network. C. Artificial Intelligence Approach The artificial intelligence approach attempts to mechanize the recognition procedure according to the way a person applies its intelligence in visualizing, analyzing and finally making a decision on the measured acoustic features. The artificial intelligence approach is a hybrid of the acoustic phonetic approach and pattern recognition approach [7] [19].The hybrid concept of both Hidden Markov Model and Artificial Neural Network is also applied in speech recognition [1]. The various methods in artificial intelligence are Multi-layer Perceptron (MLP), Self-Organizing Map (SOM), Back-propagation Neural Network (BPNN), Time Delay Neural Network (TDNNs) [1][20]. 2) Self-Organizing Map (SOM) Kohonen proposed a neural network architecture which can be automatically generate self-organizing properties during unsupervised learning called Self-Organizing Map (SOM) [8]. 1) Time Delay Neural Network (TDNNs) Time Delay Neural Network is an advanced version of the artificial neural network. 61 International Journal of Emerging Technology and Advanced Engineering Website: www.ijetae.com (ISSN 2250-2459, ISO 9001:2008 Certified Journal, Volume 5, Issue 2, February 2015) SOM uses Euclidean distance to measure the distance between data vectors. The input vectors are normalized between -1 and +1 before it is feed into the network. Single Word Error Rate (SWER) and Command Success Rate (CSR) are other measures of accuracy [2] [3]. A. Word Error Rate (WER) Word error rate is a common metric of the performance of speech recognition. The general difficulty of measuring performance lies in the fact that the recognized word sequence can have a different length from the reference word sequence. The WER is derived from the Levenshtein distanceworking at the word level instead of the phoneme level. The word error rate can be computed as [2]. The learning algorithm is as follows[8]: Each nodes weight is initialized between 0 to1. A vector is chosen at random from a set of training data. Every node is examined to calculate which nodes weights are most likely the input vectors. The winning node is commonly known as Best Matching Unit (BMU). The radius of neighborhood of BMU is calculated. More the node closest to BMU more the weights are updated. This procedure repeats until it is closer to the input vectors. This technique is used to classify the features to reduce the feature vectors and complexity in speech recognition. WER = (S + D + I)/N (4) Where, S is the number of substitutions, D is the number of deletions, I is the number of insertions, N is the number of words in the reference. When reporting the performance of a speech recognition system, sometimes Word Recognition Rate (WRR) is used instead of Word Error Rate (WER). 3) Multilayer Perceptron (MLP) This MLP is a type of neural network which has one input layer, more than one hidden layer and one output layer that contains neurons [4]. As shown in figure 6, there are three input neurons, four hidden neurons and three output neurons. Weights associated to input and hidden layer are wij and uij are hidden and output weights. The input is given from left side of input neurons. The output of input neurons are multiplied with weights and feed as input to hidden neurons. Same operation is performed for hidden and output layer. This technique is commonly used in speech recognition systems [16]. WRR = 1- WER = (N-S-D-I)/N = (H-I)/N (5) Where, H = (N-S-D) is the correctly recognized words. VI. CONCLUSION In this paper different classification technique are introduced. Several advanced concept of classification techniques have been used recently in speech recognition systems. The problem always arises due to variation of the speech in time and environmental noise makes the recognition accuracy difficult. This paper describes the different classification techniques that can be helpful in speech recognition approach. REFERENCES [1] Inputs Outputs wij [2] uij Figure 6. Multilayer perceptron [9] [3] V. PERFORMANCE MEASURING PARAMETERS The performance of the speech recognition system is often described in terms of accuracy and speed. Accuracy may be measured in terms of Word Error Rate (WER), where speed is measured with the real time factor. [4] 62 Xian Tang, “Hybrid Hidden Markov Model and Artificial Neural Network for Automatic Speech Recognition,” Pacific-Asia Conference on Circuits, Communication and System, IEEE Computer society, 2009. Nidhi Desai, Prof. Kinnal Dhameliya, “Feature Extraction and Classification Techniques for Speech Recognition: A Review,” International Journal of Emerging Technology and Advanced Engineering (IJETAE), Vol. 3, Issue 12, December 2013. Shanthi Therese S, Chelpa Lingam, “Review of Feature Extraction Techniques in Automatic Speech Recognition,” International Journal of Scientific Engineering and Technology (IJSET), Vol. 2, Issue 6, pp.479-484, June 2013. Bishnu Prasad Das, Ranjan Parekh, “Recognition of Isolated Words using Features based on LPC, MFCC, ZCR and STE with Neural Network Classifiers,” International Journal of Modern Engineering Research,(IJMER) Vol. 2, Issue 3,pp.854-858, June 2012. International Journal of Emerging Technology and Advanced Engineering Website: www.ijetae.com (ISSN 2250-2459, ISO 9001:2008 Certified Journal, Volume 5, Issue 2, February 2015) [5] [6] [7] [8] [9] [10] [11] [12] [13] Revathi, Y. Venkataramani, “Speaker Independent Continuous Speech and Isolated Digit Recognition using VQ and HMM,” IEEE conference, pp. 198-202, 2011. P. G. N. Priyadarshani, N. G. J. Dias, AmalPunchihewa, “Dynamic Time Warping Based Speech Recognition for Isolated Sinhala Words,” IEEE Journal, pp. 892-895, 2012. Santosh K. Gaikwad, Bharti W. Gawali, PravinYannawar, “A Review on Speech Recognition Technique,” International Journal of Computer Applications (IJCA), Vol. 10, Issue 3, Nov. 2010. Goh Kia Eng, Abdul Manan Ahmad, “Malay Speech Recognition using Self-Organizing Map and Multilayer Perceptron”, Proceeding of the Postgraduate Annual Research Seminar, 2005. S. Karpagavali, P.V. Sabitha, “Isolated Tamil Words Speech Recognition using Linear Predictive Coding and Networks,” International Journal of Computer Science and Management Research, Vol.1, Issue 5, December 2012. Ahmad A. M. Abushariah, Teddy S. Gunawan, Mohammad A. M. Abushariah, “English Digit Speech Recognition System Based on Hidden Markov Model,” International Conference on Computer and Communication Engineering, IEEE, May 2010. UtpalBhattacharjee, “A Comparative Study of LPCC and MFCC Features for the Recognition of Assamese Phonemes,” International Journal of Engineering Research and Technology (IJERT), Vol.2, Issue 1, January 2013. W. M. Campbell, K.T. Assaleh, C.C. Broun, “Speaker Recognition with Polynomial Classifiers”, IEEE Transaction on Speech and Audio Processing, Vol.10, No.4, May 2002. Hemant A. Patil, Robin Jain, Prakhar Jain, “Identification of Speakers from their Hum”, Springer- Verlag Berlin Heidelberg,pp.461-468, 2008. [14] C. Sunitha Ram, Dr. R. Ponnusamy, “An Effective Automatic Speech Emotion Recognition for Tamil Language using Support Vector Machine”, International Conference on Issues and Challenges in Intelligent Computing Techniques, 2014. [15] Huo Chun bao, Zhang Caijuan, “The Research of speaker recognition based on GMM and SVM”, International Conference on System Science and Engineering, IEEE, pp.373-375, July 2012. [16] A. Hmich, A. Badri, A. Sahel, “Automatic Speaker Identification by using Neural Network”, IEEE conference, 2010. [17] L. Muda, M. Begam, I. Elamvazuthi, “Voice Recognition Algorithms using Mel Frequency Cepstral Coefficient (MFCC) and Dynamic Time Warping (DTW) Techniques”, Journal of Computing, Vol.2, Issue 3, March 2010. [18] Md. R. Hasan, M. Jamil, Md. G. Rabbani, Md. S. Rahman, “Speaker Identification using Mel Frequency Cepstral Coefficients”, 3rd International Conference on Electrical & Computer Engineering, Dhaka, December 2004. [19] Sanjivani S. Bhabad, Gajanan K. Kharate, “Overview of Technical Progress in Speech Recognition”, International Journal of Advanced Research in Computer Science and Software Engineering (IJARCSSE), Vol.3, Issue 3, March 2013. [20] Divyesh S. Mistry, Prof. A.V. Kulkarni, “Overview: Speech Recognition Technology, Mel-Frequency Cepstral Coefficients (MFCC), Artificial Neural Network (ANN)”, International Journal of Engineering Research & Technology (IJERT), Vol.2, Issue 10, October 2013. [21] Ranu Dixit, NavdeepKaur, “Speech Recognition Using Stochastic Approach: A Review”, International Journal of Innovative Research in Science, Engineering and Technology (IJIRSET), Vol.2, Issue 2, February 2013. 63