Download Score Normalization techniques for text- independent

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Foundations of statistics wikipedia , lookup

Psychometrics wikipedia , lookup

Transcript
Score Normalization techniques for textindependent speaker verification
Sai Nitish Satyavolu
IIT Kanpur
Overview of Speaker Verification
Figure 1: Block diagram of a speaker verification system. [1]
•
•
•
•
Background model: A model trained with large amount of data;
used to represent general, speaker-independent characteristics.
Speaker model: A model trained with speech data from a particular
speaker; used to represent the characteristics of that speaker.
Score normalization: A transformation performed on the scores to
improve the performance of the speaker verification system.
Decision making: Accepting or rejecting the claimant speaker
based on a decision threshold.
Need for score normalization
Accepting or rejecting a claimant speaker purely based on the log likelihood
scores from the UBM (background model) and the hypothesized speaker
model is highly error-prone. This is due to various reasons like:
1. Limited modeling capability of the models.
2. Variations in inter-speaker scores.
The true and imposter score distributions of a particular speaker may
not coincide with the respective score distributions of another speaker.
This leads to huge error when decision is made based on a global, speakerindependent threshold.
3. Variations in inter-session scores.
The test utterances from the same speaker may be scored differently
against a model corresponding to the same or a different speaker. Hence,
there can be overlap between the imposter and true score distributions.
Problems 2, 3 can be solved to some extent using score normalization
techniques.
Score normalization techniques
• Z-norm (or Zero normalization)
 Compensated for inter-speaker score variation.
 Normalization statistics are computed offline.
 Allows us to use a global, speaker-independent
decision threshold.
• T-norm (or Test normalization)
 Compensates for inter-session score variation.
 Normalization statistics are computed online.
 Attempts to reduce the overlap between imposter
and true score distributions of each client-speaker.
Z-norm (or Zero Normalization)
• Allows us to use a global, speaker-independent decision
threshold by aligning the imposter score distributions of
all client speakers to zero mean and scaling them to unit
variance.
• Here, imposter score distributions are aligned instead of
true score distributions as practically it is not possible to
accurately compute the mean and variance parameters of
the true score distributions, owing to the non-availability of
sufficient true utterances.
• On the other hand, there is no limit to the availability of
imposter utterances, and hence the normalization statistics
of imposter distributions can be estimated accurately.
Z-norm - Implementation
Normalization formula:
𝐿𝐿𝑅 𝑌𝑡𝑒𝑠𝑡, 𝑆
𝑛𝑜𝑟𝑚
𝐿𝐿𝑅 𝑌𝑡𝑒𝑠𝑡, 𝑆 − 𝜇(𝑆)
=
𝜎(𝑆)
Where,
LLR = Log likelihood ratio
Ytest = Test utterance
S = Claimed speaker
𝜇, 𝜎 = Normalization statistics
For estimating the normalization statistics, a set of imposter
trials are scored against the target model and the mean and the
standard deviation of the scores are computed.
Z-norm – Implementation
Algorithm (In pseudo code):
%Estimate normalization statistics
for each client_speaker do
I[] = score imposter trials against client_speaker;
μ (client_speaker) = mean(I);
σ (client_speaker) = std(I);
end
%Perform score normalization
for each test_trial do
LLR = LogLikelihood(test_trial, S);
LLRnorm = (LLR – μ(S))/σ(S);
end
T-norm (or Test Normalization)
• In T-norm, the normalization statistics are estimated for
each test utterance individually. Hence, it is an online
approach.
• The estimation of these normalization statistics is carried
out on the same utterance as the target speaker test.
Therefore, an acoustic mismatch between the test utterance
and normalization utterances, possible in Z-norm, is
avoided.
• For estimating the normalization statistics, each test
utterance is scored against a fixed set of imposter speaker
models. From this set of scores, the mean and standard
deviation are computed.
• Better performance can be achieved by allowing each client
speaker to have its own set of imposter models. This
technique is called speaker-adaptive T-norm (or AT-norm).
T-norm - Implementation
Normalization formula:
𝐿𝐿𝑅 𝑌𝑡𝑒𝑠𝑡, 𝑆
𝑛𝑜𝑟𝑚
𝐿𝐿𝑅 𝑌𝑡𝑒𝑠𝑡, 𝑆 − 𝜇(𝑌𝑡𝑒𝑠𝑡)
=
𝜎(𝑌𝑡𝑒𝑠𝑡)
Where,
LLR = Log likelihood ratio
Ytest = Test utterance
S = Claimed speaker
𝜇, 𝜎 = Normalization statistics
Note that in T-norm, the normalization statistics depend on the
test utterance. This is in contrast to Z-norm where they depend
on the claimed speaker identity.
T-norm - Implementation
Algorithm (In pseudo code):
for each test_trial do
%Estimate normalization statistics
I[] = score test_trial against all imposter models;
μ (test_trial) = mean(I);
σ (test_trial) = std(I);
end
%Perform score normalization
LLR = LogLikelihood(test_trial, S);
LLRnorm = (LLR – μ(test_trial))/σ(test_trial);
Experiment
Z-norm and T-norm techniques were implemented on the
NIST 2004 SRE. [2]
Details of the experiment are as follows:
Features used: MFCC features
Model used: 128 mixture GMM
Scoring measure: Log Likelihood ratio
Number of client speakers: 40 (all female)
Results
Figure 2: DET curves for the NIST 2004 SRE
Results (cont.)
Method
EER (%)
Min. DCF
Baseline
17.8
0.1717
Z-norm
20.0
0.1854
T-norm
14.7
0.1438
Results (cont.)
Figure 3: Plots showing imposter and true score distributions
of baseline and Z-norm techniques.
Conclusion
• As evident from the DET plot, there is significant
performance improvement in T-norm over the baseline
technique. However, there is no positive improvement in Znorm. The EER value of Z-norm is higher than the baseline
value.
• Better results could be obtained by fusing the two
normalization techniques described.
• Speaker adaptive T-norm [1] performs better than T-norm
and Z-norm. Its success depends on the method of
selection of client-specific imposter speaker sets.
References
[1] Srikanth N and Rajesh M Hegde, “On line Client-wise cohort set
selection for speaker verification using iterative normalization of
confusion matrices”, pp. 576--580, 2010 European Signal Processing
Conference, EUSIPCO-2010, August 2010, Aalborg, Denmark.
[2] NIST Speech Group, ”The 2004 NIST Speaker Recognition
Evaluation Plan”, www.itl.nist.gov/iad/mig/tests/sre/2004/