Two-Dimensional Convolutional Recurrent Neural Networks for Speech Activity Detection

Date

2019-09

Advisors

Journal Title

Journal ISSN

ISSN

DOI

Volume Title

Publisher

International Speech Communication Association

Type

Conference

Peer reviewed

Yes

Abstract

Speech Activity Detection (SAD) plays an important role in mobile communications and automatic speech recognition (ASR). Developing efficient SAD systems for real-world applications is a challenging task due to the presence of noise. We propose a new approach to SAD where we treat it as a two-dimensional multilabel image classification problem. To classify the audio segments, we compute their Short-time Fourier Transform spectrograms and classify them with a Convolutional Recurrent Neural Network (CRNN), traditionally used in image recognition. Our CRNN uses a sigmoid activation function, max-pooling in the frequency domain, and a convolutional operation as a moving average filter to remove misclassified spikes. On the development set of Task 1 of the 2019 Fearless Steps Challenge, our system achieved a decision cost function (DCF) of 2.89%, a 66.4% improvement over the baseline. Moreover, it achieved a DCF score of 3.318% on the evaluation dataset of the challenge, ranking first among all submissions.

Description

Keywords

Speech activity detection, Voice activity detection, Convolutional recurrent neural networks

Citation

Vafeiadis, A., Fanioudakis, E., Potamitis, I., Votis, K., Giakoumis, D.,Tzovaras, D., Chen, L., Hamzaoui, R. (2019) Two-dimensional convolutional recurrent neural networks for speech activity detection. 20th Annual Conference of the International Speech Communication Association (INTERSPEECH 2019), Graz, Austria, Sep. 2019.

Rights

Research Institute