Download Speaker adaptation of HMM using linear regression (part 2 of 2)

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Time series wikipedia , lookup

Regression toward the mean wikipedia , lookup

Coefficient of determination wikipedia , lookup

Regression analysis wikipedia , lookup

Linear regression wikipedia , lookup

Transcript
Speaker Adaptation of HMM using
Linear Regression
(Part 2 of 2)
Authors: C.J. Leggetter
P.C. Woodland
Presented by: 陳亮宇
Date: 2007/02/13
CUED/F-INFENG/TR. 181
June 1994
Outline
1.
2.
3.
4.
2
Special cases of MLLR
Implementation issues
Evaluation of MLLR adaptation
Conclusions
1. Special cases of MLLR


Least squares regression
Reducing the number of regression
parameters
–
–
–
3
Regression matrix structure
Full matrix without offset
Diagonal scaling matrix
1.1 Least squares regression (1 of 3)

If all states of a regression class have the same covariance
matrices (
), then:
Assumption: each state only has one mixture
4
1.1 Least squares regression (2 of 3)

If each frame is assigned to exactly one mixture (state), then:

Therefore:
where
5
1.1 Least squares regression (3 of 3)
6

Define the matrices X and Y as:

Then the previous equation can be simplified to:

Finally this is the least squares estimate:
1.2 Regression matrix structure

The n x (n+1) regression matrix changes the mean in 2 ways:
–
–
The first column provides an offset element (shifting)
The remaining n x n portion provides a scaling based on the current mean
values
offset
7
 w11

W  
 wn1
w12  w1,n 1 

 
 
wn 2  wn ,n 1 
1.2.1 Full matrix without offset
8

If the offset term ω = 0, the effect of offset (shifting) can be
ignored.

If there is sufficient inter-dependency between the mean
components, the effect of ignoring the offset column may be
small
1.2.2 Using a diagonal scaling matrix (1 of 3)

If the scaling portion of the regression matrix is diagonal
–

9
All features are independent
Then the equation becomes
1.2.2 Using a diagonal scaling matrix (2 of 3)

Rewrite the regression matrix as
a vector as follows:

The equation for differentiation becomes:
where
10
nxn
matrices
1.2.2 Using a diagonal scaling matrix (3 of 3)

Take this equation h(ot, s) and differentiate with respect to ŵs
and make it = 0
Solve for ŵs:

Extension to multiple states per regression class:

This can be extended to multiple mixtures and observations

11
2. Implementation Issues



12
Estimation of the transforms
Mixture component alignments
Computation
2.1 Estimation of the transforms
Accumulation
of statistics
Using the
statistics to
calculate the
transform
13
For each speech frame in the adaptation data
For each mixture component in the adaptation data
{
Find probability of frame belonging to mixture:
Determine W associated with the mixture
Record
, frame, mixture in accumulator
}
For each regression matrix W
Use accumulator contents to calculate W
2.2 Mixture component alignments (1 of 2)
14

Apply forward/backward algorithm

Forward

Backward
2.2 Mixture component alignments (2 of 2)
15

The total likelihood of the models generating the observation
sequence can be found from either α or β:

The probability of occupying state i at time t is:

The probability of occupation of mixture component j of state i
at time t is:
2.3 Computation

For adapting M mixture components using R
regression matrices, with an observation vector of
length n, the computations required are:
–
–


By using high degree of tying (less regression
classes), number of inversions can be kept small
G matrices may be ill-conditioned
–
16
4n2M + nM + n2 multiplications
nR matrix inversions
SVD is suggested for computing matrix inversion
3. Evaluation: experimental setup







ARPA Resource Management
RM1 database
25ms frames, with a frame
advance of 10ms
39 MFCC
SI was trained using 3990
utterances
Triphone models
Each state has 2 mixtures
Single state silence model








17
Variable amount of adaptation
utterances (10 ~ 600)
40 utterances from each of the
12 speakers for testing
Adaptation uses
forward/backward alignment
Regression tree was built by
top-down approach
Split according to phonetic
definitions (max 47 classes)
Silence model was not adapted
Static supervised adaptation
1 iteration
3.1 Exp 1: number of regression classes
Note: 40
utterances were
used for
adaptation
Best result
18
3.2 Exp 2: amount of adaptation data
19
3.3 Exp 3: diagonal regression matrix





20
10 utterances are used
With enough data, full matrix is
more effective than diagonal matrix
For full matrix, main diagonal terms
are dominant, but off-diagonal
terms are responsible for the
interdependencies between
components
More data is needed for diagonal
matrix to achieve similar
performance
Therefore, full matrix is better
4. Conclusions



21
MLLR – uses small amount of data to adapt
large number of mixture means
Both full and diagonal regression matrices
can improve performance of SI system, but
full matrix is more effective.
Best performance is achieved by matching
the number of regression classes with
number of adaptation data.