Lightweight CNN for Robust Voice Activity Detection

Published in International Conference on Speech and Computer (SPECOM), 2020

Recommended citation: T. Alam and A. Khan, “Lightweight CNN for Robust Voice Activity Detection,” in International Conference on Speech and Computer(SPECOM). Springer, 2020, pp. 1–12. https://link.springer.com/chapter/10.1007/978-3-030-60276-5_1

Abstract. Voice activity detection (VAD) is an important prepossessing step in many speech related applications. Convolutional neural networks (CNN) are widely used for different audio classification tasks and have been adopted successfully for this. In this work, we propose a lightweight CNN architecture for real time voice activity detection. We use strong data augmentation and regularization for improving the performance of the model. Using knowledge distillation approach, we transfer knowledge from a larger CNN model which leads to better generalization ability and robust performance of the CNN architecture in noisy conditions. The resulting network obtains 62.6% relative improvements in EER compared to a deep feedforward neural network (DNN) of comparable parameter count on a noisy test dataset.