Speaker Recognition using Deep Neural Networks with reduced Complexity
MetadataShow full metadata
The goal of this research is to develop a small footprint text-independent speaker recognition system for a closed set of a relatively small number of speakers (e.g., 15-20). The problem was inspired by its potential application to the International Space Station (ISS) to determine which astronaut is speaking at a given time. In this research, the so- called Direct DNN based approach is used in which the output layer posterior probabilities are used to determine the identity of the speaker. Consistent with the small footprint design goal, a baseline DNN model was developed with just enough hidden layers and enough hidden units per layer, thereby reducing the total number of parameters, and by careful design to avoid the common problem of overfitting and to optimize algorithmic aspects including context-based training, activation functions, regularization, and learning rate. This baseline model was evaluated on two commercially available databases, clean speech TIMIT and multi-handset speech database HTIMIT, and on noise added TIMIT database that we created using four types of noises at three different signal-to-noise ratios (SNRs). The speaker recognition accuracy of the baseline is 100% for TIMIT, 96.75% for HTIMIT, and 100%, 98.75% and 98.125% for noise added TIMIT database at 20 dB, 10 dB and 5 dB SNR, respectively. This demonstrates that the baseline system has an error-free performance in relatively clean speech and a robust performance under telephone handset variability and in acoustic background noise. The baseline model has a total of 2.4M parameters. The rest of the work was devoted to reducing the complexity of the DNN system by reducing the number of parameters without causing significant loss in performance. Initially, we used an adaptive pruning method where the parameters of all the layers are pruned simultaneously and the pruned system is retrained. The performance of this technique was evaluated on all the above- mentioned speech databases. We then developed a novel and enhanced pruning technique called Sequential Layer Specific (SLS) pruning. The SLS pruning technique performs pruning sequentially in multiple stages and in a layer-specific manner, followed by retraining after each pruning stage, while ensuring no or only minor performance loss in each pruning stage. The SLS pruning technique is significantly more effective than the adaptive pruning technique in terms of both model complexity reduction and speaker recognition performance loss. For the SLS pruned model, the speaker recognition accuracy is 100% for TIMIT database with 31X complexity reduction; 94.75% for multi- handset database HTIMIT with 4.5X complexity reduction. For noise added TIMIT database, with 1.7X complexity reduction there is no additional drop in speaker recognition accuracy relative to the baseline DNN in both 5 dB and 10 dB SNR and a 99.37% accuracy is achieved with 3X complexity reduction in 20 dB SNR. For cases where the speaker recognition accuracy is less than 100%, a higher “accuracy” is obtained using the “Top-two” performance metric in which recognition success is declared if the correct speaker lies in the top two choices predicted by the DNN model.