Deep Neural Networks for Sound Event Detection
|Tila||Julkaistu - 22 tammikuuta 2019|
|Nimi||Tampere University Dissertations|
In this thesis, we propose to apply the modern machine learning methods called deep learning for SED. The relationship between the commonly used timefrequency representations for SED (such as mel spectrogram and magnitude spectrogram) and the target sound event labels are highly complex. Deep learning methods such as deep neural networks (DNN) utilize a layered structure of units to extract features from the given sound representation input with increased abstraction at each layer. This increases the network’s capacity to eﬃciently learn the highly complex relationship between the sound representation and the target sound event labels. We found that the proposed DNN approach performs signiﬁcantly better than the established classiﬁer techniques for SED such as Gaussian mixture models.
In a time-frequency representation of an audio recording, a sound event can often be recognized as a distinct pattern that may exhibit shifts in both dimensions. The intra-class variability of the sound events may cause to small shifts in the frequency domain content, and the time domain shift results from the fact that a sound event can occur at any time for a given audio recording. We found that convolutional neural networks (CNN) are useful to learn shift-invariant ﬁlters that are essential for robust modeling of sound events. In addition, we show that recurrent neural networks (RNN) are eﬀective in modeling the long-term temporal characteristics of the sound events. Finally, we combine the convolutional and recurrent layers in a single classiﬁer called convolutional recurrent neural networks (CRNN), which emphasizes the beneﬁts of both and provides state-of-the-art results in multiple SED benchmark datasets.
Aside from learning the mappings between the time-frequency representations and the sound event labels, we show that deep learning methods can also be utilized to learn a direct mapping between the the target labels and a lower level representation such as the magnitude spectrogram or even the raw audio signals. In this thesis, the feature learning capabilities of the deep learning methods and the empirical knowledge on the human auditory perception are proposed to be integrated through the means of layer weight initialization with ﬁlterbank coeﬃcients. This results with an optimal, ad-hoc ﬁlterbank that is obtained through gradient based optimization of the original coeﬃcients to improve the SED performance.