Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Speaker Adaptation of HMM using Linear Regression (Part 2 of 2) Authors: C.J. Leggetter P.C. Woodland Presented by: 陳亮宇 Date: 2007/02/13 CUED/F-INFENG/TR. 181 June 1994 Outline 1. 2. 3. 4. 2 Special cases of MLLR Implementation issues Evaluation of MLLR adaptation Conclusions 1. Special cases of MLLR Least squares regression Reducing the number of regression parameters – – – 3 Regression matrix structure Full matrix without offset Diagonal scaling matrix 1.1 Least squares regression (1 of 3) If all states of a regression class have the same covariance matrices ( ), then: Assumption: each state only has one mixture 4 1.1 Least squares regression (2 of 3) If each frame is assigned to exactly one mixture (state), then: Therefore: where 5 1.1 Least squares regression (3 of 3) 6 Define the matrices X and Y as: Then the previous equation can be simplified to: Finally this is the least squares estimate: 1.2 Regression matrix structure The n x (n+1) regression matrix changes the mean in 2 ways: – – The first column provides an offset element (shifting) The remaining n x n portion provides a scaling based on the current mean values offset 7 w11 W wn1 w12 w1,n 1 wn 2 wn ,n 1 1.2.1 Full matrix without offset 8 If the offset term ω = 0, the effect of offset (shifting) can be ignored. If there is sufficient inter-dependency between the mean components, the effect of ignoring the offset column may be small 1.2.2 Using a diagonal scaling matrix (1 of 3) If the scaling portion of the regression matrix is diagonal – 9 All features are independent Then the equation becomes 1.2.2 Using a diagonal scaling matrix (2 of 3) Rewrite the regression matrix as a vector as follows: The equation for differentiation becomes: where 10 nxn matrices 1.2.2 Using a diagonal scaling matrix (3 of 3) Take this equation h(ot, s) and differentiate with respect to ŵs and make it = 0 Solve for ŵs: Extension to multiple states per regression class: This can be extended to multiple mixtures and observations 11 2. Implementation Issues 12 Estimation of the transforms Mixture component alignments Computation 2.1 Estimation of the transforms Accumulation of statistics Using the statistics to calculate the transform 13 For each speech frame in the adaptation data For each mixture component in the adaptation data { Find probability of frame belonging to mixture: Determine W associated with the mixture Record , frame, mixture in accumulator } For each regression matrix W Use accumulator contents to calculate W 2.2 Mixture component alignments (1 of 2) 14 Apply forward/backward algorithm Forward Backward 2.2 Mixture component alignments (2 of 2) 15 The total likelihood of the models generating the observation sequence can be found from either α or β: The probability of occupying state i at time t is: The probability of occupation of mixture component j of state i at time t is: 2.3 Computation For adapting M mixture components using R regression matrices, with an observation vector of length n, the computations required are: – – By using high degree of tying (less regression classes), number of inversions can be kept small G matrices may be ill-conditioned – 16 4n2M + nM + n2 multiplications nR matrix inversions SVD is suggested for computing matrix inversion 3. Evaluation: experimental setup ARPA Resource Management RM1 database 25ms frames, with a frame advance of 10ms 39 MFCC SI was trained using 3990 utterances Triphone models Each state has 2 mixtures Single state silence model 17 Variable amount of adaptation utterances (10 ~ 600) 40 utterances from each of the 12 speakers for testing Adaptation uses forward/backward alignment Regression tree was built by top-down approach Split according to phonetic definitions (max 47 classes) Silence model was not adapted Static supervised adaptation 1 iteration 3.1 Exp 1: number of regression classes Note: 40 utterances were used for adaptation Best result 18 3.2 Exp 2: amount of adaptation data 19 3.3 Exp 3: diagonal regression matrix 20 10 utterances are used With enough data, full matrix is more effective than diagonal matrix For full matrix, main diagonal terms are dominant, but off-diagonal terms are responsible for the interdependencies between components More data is needed for diagonal matrix to achieve similar performance Therefore, full matrix is better 4. Conclusions 21 MLLR – uses small amount of data to adapt large number of mixture means Both full and diagonal regression matrices can improve performance of SI system, but full matrix is more effective. Best performance is achieved by matching the number of regression classes with number of adaptation data.