Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Score Normalization techniques for textindependent speaker verification Sai Nitish Satyavolu IIT Kanpur Overview of Speaker Verification Figure 1: Block diagram of a speaker verification system. [1] • • • • Background model: A model trained with large amount of data; used to represent general, speaker-independent characteristics. Speaker model: A model trained with speech data from a particular speaker; used to represent the characteristics of that speaker. Score normalization: A transformation performed on the scores to improve the performance of the speaker verification system. Decision making: Accepting or rejecting the claimant speaker based on a decision threshold. Need for score normalization Accepting or rejecting a claimant speaker purely based on the log likelihood scores from the UBM (background model) and the hypothesized speaker model is highly error-prone. This is due to various reasons like: 1. Limited modeling capability of the models. 2. Variations in inter-speaker scores. The true and imposter score distributions of a particular speaker may not coincide with the respective score distributions of another speaker. This leads to huge error when decision is made based on a global, speakerindependent threshold. 3. Variations in inter-session scores. The test utterances from the same speaker may be scored differently against a model corresponding to the same or a different speaker. Hence, there can be overlap between the imposter and true score distributions. Problems 2, 3 can be solved to some extent using score normalization techniques. Score normalization techniques • Z-norm (or Zero normalization) Compensated for inter-speaker score variation. Normalization statistics are computed offline. Allows us to use a global, speaker-independent decision threshold. • T-norm (or Test normalization) Compensates for inter-session score variation. Normalization statistics are computed online. Attempts to reduce the overlap between imposter and true score distributions of each client-speaker. Z-norm (or Zero Normalization) • Allows us to use a global, speaker-independent decision threshold by aligning the imposter score distributions of all client speakers to zero mean and scaling them to unit variance. • Here, imposter score distributions are aligned instead of true score distributions as practically it is not possible to accurately compute the mean and variance parameters of the true score distributions, owing to the non-availability of sufficient true utterances. • On the other hand, there is no limit to the availability of imposter utterances, and hence the normalization statistics of imposter distributions can be estimated accurately. Z-norm - Implementation Normalization formula: 𝐿𝐿𝑅 𝑌𝑡𝑒𝑠𝑡, 𝑆 𝑛𝑜𝑟𝑚 𝐿𝐿𝑅 𝑌𝑡𝑒𝑠𝑡, 𝑆 − 𝜇(𝑆) = 𝜎(𝑆) Where, LLR = Log likelihood ratio Ytest = Test utterance S = Claimed speaker 𝜇, 𝜎 = Normalization statistics For estimating the normalization statistics, a set of imposter trials are scored against the target model and the mean and the standard deviation of the scores are computed. Z-norm – Implementation Algorithm (In pseudo code): %Estimate normalization statistics for each client_speaker do I[] = score imposter trials against client_speaker; μ (client_speaker) = mean(I); σ (client_speaker) = std(I); end %Perform score normalization for each test_trial do LLR = LogLikelihood(test_trial, S); LLRnorm = (LLR – μ(S))/σ(S); end T-norm (or Test Normalization) • In T-norm, the normalization statistics are estimated for each test utterance individually. Hence, it is an online approach. • The estimation of these normalization statistics is carried out on the same utterance as the target speaker test. Therefore, an acoustic mismatch between the test utterance and normalization utterances, possible in Z-norm, is avoided. • For estimating the normalization statistics, each test utterance is scored against a fixed set of imposter speaker models. From this set of scores, the mean and standard deviation are computed. • Better performance can be achieved by allowing each client speaker to have its own set of imposter models. This technique is called speaker-adaptive T-norm (or AT-norm). T-norm - Implementation Normalization formula: 𝐿𝐿𝑅 𝑌𝑡𝑒𝑠𝑡, 𝑆 𝑛𝑜𝑟𝑚 𝐿𝐿𝑅 𝑌𝑡𝑒𝑠𝑡, 𝑆 − 𝜇(𝑌𝑡𝑒𝑠𝑡) = 𝜎(𝑌𝑡𝑒𝑠𝑡) Where, LLR = Log likelihood ratio Ytest = Test utterance S = Claimed speaker 𝜇, 𝜎 = Normalization statistics Note that in T-norm, the normalization statistics depend on the test utterance. This is in contrast to Z-norm where they depend on the claimed speaker identity. T-norm - Implementation Algorithm (In pseudo code): for each test_trial do %Estimate normalization statistics I[] = score test_trial against all imposter models; μ (test_trial) = mean(I); σ (test_trial) = std(I); end %Perform score normalization LLR = LogLikelihood(test_trial, S); LLRnorm = (LLR – μ(test_trial))/σ(test_trial); Experiment Z-norm and T-norm techniques were implemented on the NIST 2004 SRE. [2] Details of the experiment are as follows: Features used: MFCC features Model used: 128 mixture GMM Scoring measure: Log Likelihood ratio Number of client speakers: 40 (all female) Results Figure 2: DET curves for the NIST 2004 SRE Results (cont.) Method EER (%) Min. DCF Baseline 17.8 0.1717 Z-norm 20.0 0.1854 T-norm 14.7 0.1438 Results (cont.) Figure 3: Plots showing imposter and true score distributions of baseline and Z-norm techniques. Conclusion • As evident from the DET plot, there is significant performance improvement in T-norm over the baseline technique. However, there is no positive improvement in Znorm. The EER value of Z-norm is higher than the baseline value. • Better results could be obtained by fusing the two normalization techniques described. • Speaker adaptive T-norm [1] performs better than T-norm and Z-norm. Its success depends on the method of selection of client-specific imposter speaker sets. References [1] Srikanth N and Rajesh M Hegde, “On line Client-wise cohort set selection for speaker verification using iterative normalization of confusion matrices”, pp. 576--580, 2010 European Signal Processing Conference, EUSIPCO-2010, August 2010, Aalborg, Denmark. [2] NIST Speech Group, ”The 2004 NIST Speaker Recognition Evaluation Plan”, www.itl.nist.gov/iad/mig/tests/sre/2004/