TUTCRIS - Tampereen teknillinen yliopisto

TUTCRIS

Sound Event Localization and Detection of Overlapping Sources Using Convolutional Recurrent Neural Networks

Tutkimustuotosvertaisarvioitu

Standard

Sound Event Localization and Detection of Overlapping Sources Using Convolutional Recurrent Neural Networks. / Adavanne, Sharath; Politis, Archontis; Nikunen, Joonas; Virtanen, Tuomas.

julkaisussa: IEEE Journal of Selected Topics in Signal Processing, Vuosikerta 13, Nro 1, 03.2019, s. 34-48.

Tutkimustuotosvertaisarvioitu

Harvard

Adavanne, S, Politis, A, Nikunen, J & Virtanen, T 2019, 'Sound Event Localization and Detection of Overlapping Sources Using Convolutional Recurrent Neural Networks', IEEE Journal of Selected Topics in Signal Processing, Vuosikerta. 13, Nro 1, Sivut 34-48. https://doi.org/10.1109/JSTSP.2018.2885636

APA

Vancouver

Author

Adavanne, Sharath ; Politis, Archontis ; Nikunen, Joonas ; Virtanen, Tuomas. / Sound Event Localization and Detection of Overlapping Sources Using Convolutional Recurrent Neural Networks. Julkaisussa: IEEE Journal of Selected Topics in Signal Processing. 2019 ; Vuosikerta 13, Nro 1. Sivut 34-48.

Bibtex - Lataa

@article{eeca350eb1dc4dac90a9876fde79e75c,
title = "Sound Event Localization and Detection of Overlapping Sources Using Convolutional Recurrent Neural Networks",
abstract = "In this paper, we propose a convolutional recurrent neural network for joint sound event localization and detection (SELD) of multiple overlapping sound events in three-dimensional (3D) space. The proposed network takes a sequence of consecutive spectrogram time-frames as input and maps it to two outputs in parallel. As the first output, the sound event detection (SED) is performed as a multi-label classification task on each time-frame producing temporal activity for all the sound event classes. As the second output, localization is performed by estimating the 3D Cartesian coordinates of the direction-of-arrival (DOA) for each sound event class using multi-output regression. The proposed method is able to associate multiple DOAs with respective sound event labels and further track this association with respect to time. The proposed method uses separately the phase and magnitude component of the spectrogram calculated on each audio channel as the feature, thereby avoiding any method- and array-specific feature extraction. The method is evaluated on five Ambisonic and two circular array format datasets with different overlapping sound events in anechoic, reverberant and real-life scenarios. The proposed method is compared with two SED, three DOA estimation, and one SELD baselines. The results show that the proposed method is generic and applicable to any array structures, robust to unseen DOA values, reverberation, and low SNR scenarios. The proposed method achieved a consistently higher recall of the estimated number of DOAs across datasets in comparison to the best baseline. Additionally, this recall was observed to be significantly better than the best baseline method for a higher number of overlapping sound events.",
keywords = "Direction-of-arrival estimation, Estimation, Task analysis, Azimuth, Microphone arrays, Recurrent neural networks, Sound event detection, direction of arrival estimation, convolutional recurrent neural network",
author = "Sharath Adavanne and Archontis Politis and Joonas Nikunen and Tuomas Virtanen",
note = "EXT={"}Politis, Archontis{"}",
year = "2019",
month = "3",
doi = "10.1109/JSTSP.2018.2885636",
language = "English",
volume = "13",
pages = "34--48",
journal = "IEEE Journal of Selected Topics in Signal Processing",
issn = "1932-4553",
publisher = "Institute of Electrical and Electronics Engineers",
number = "1",

}

RIS (suitable for import to EndNote) - Lataa

TY - JOUR

T1 - Sound Event Localization and Detection of Overlapping Sources Using Convolutional Recurrent Neural Networks

AU - Adavanne, Sharath

AU - Politis, Archontis

AU - Nikunen, Joonas

AU - Virtanen, Tuomas

N1 - EXT="Politis, Archontis"

PY - 2019/3

Y1 - 2019/3

N2 - In this paper, we propose a convolutional recurrent neural network for joint sound event localization and detection (SELD) of multiple overlapping sound events in three-dimensional (3D) space. The proposed network takes a sequence of consecutive spectrogram time-frames as input and maps it to two outputs in parallel. As the first output, the sound event detection (SED) is performed as a multi-label classification task on each time-frame producing temporal activity for all the sound event classes. As the second output, localization is performed by estimating the 3D Cartesian coordinates of the direction-of-arrival (DOA) for each sound event class using multi-output regression. The proposed method is able to associate multiple DOAs with respective sound event labels and further track this association with respect to time. The proposed method uses separately the phase and magnitude component of the spectrogram calculated on each audio channel as the feature, thereby avoiding any method- and array-specific feature extraction. The method is evaluated on five Ambisonic and two circular array format datasets with different overlapping sound events in anechoic, reverberant and real-life scenarios. The proposed method is compared with two SED, three DOA estimation, and one SELD baselines. The results show that the proposed method is generic and applicable to any array structures, robust to unseen DOA values, reverberation, and low SNR scenarios. The proposed method achieved a consistently higher recall of the estimated number of DOAs across datasets in comparison to the best baseline. Additionally, this recall was observed to be significantly better than the best baseline method for a higher number of overlapping sound events.

AB - In this paper, we propose a convolutional recurrent neural network for joint sound event localization and detection (SELD) of multiple overlapping sound events in three-dimensional (3D) space. The proposed network takes a sequence of consecutive spectrogram time-frames as input and maps it to two outputs in parallel. As the first output, the sound event detection (SED) is performed as a multi-label classification task on each time-frame producing temporal activity for all the sound event classes. As the second output, localization is performed by estimating the 3D Cartesian coordinates of the direction-of-arrival (DOA) for each sound event class using multi-output regression. The proposed method is able to associate multiple DOAs with respective sound event labels and further track this association with respect to time. The proposed method uses separately the phase and magnitude component of the spectrogram calculated on each audio channel as the feature, thereby avoiding any method- and array-specific feature extraction. The method is evaluated on five Ambisonic and two circular array format datasets with different overlapping sound events in anechoic, reverberant and real-life scenarios. The proposed method is compared with two SED, three DOA estimation, and one SELD baselines. The results show that the proposed method is generic and applicable to any array structures, robust to unseen DOA values, reverberation, and low SNR scenarios. The proposed method achieved a consistently higher recall of the estimated number of DOAs across datasets in comparison to the best baseline. Additionally, this recall was observed to be significantly better than the best baseline method for a higher number of overlapping sound events.

KW - Direction-of-arrival estimation

KW - Estimation

KW - Task analysis

KW - Azimuth

KW - Microphone arrays

KW - Recurrent neural networks

KW - Sound event detection

KW - direction of arrival estimation

KW - convolutional recurrent neural network

U2 - 10.1109/JSTSP.2018.2885636

DO - 10.1109/JSTSP.2018.2885636

M3 - Article

VL - 13

SP - 34

EP - 48

JO - IEEE Journal of Selected Topics in Signal Processing

JF - IEEE Journal of Selected Topics in Signal Processing

SN - 1932-4553

IS - 1

ER -