Tampere University of Technology

TUTCRIS Research Portal

Time Difference of Arrival Estimation of Speech Signals Using Deep Neural Networks with Integrated Time-frequency Masking

Research output: Chapter in Book/Report/Conference proceedingConference contributionScientificpeer-review

Standard

Time Difference of Arrival Estimation of Speech Signals Using Deep Neural Networks with Integrated Time-frequency Masking. / Pertilä, Pasi; Parviainen, Mikko.

2019 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2019 - Proceedings. IEEE, 2019. p. 436-440.

Research output: Chapter in Book/Report/Conference proceedingConference contributionScientificpeer-review

Harvard

Pertilä, P & Parviainen, M 2019, Time Difference of Arrival Estimation of Speech Signals Using Deep Neural Networks with Integrated Time-frequency Masking. in 2019 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2019 - Proceedings. IEEE, pp. 436-440, IEEE International Conference on Acoustics, Speech, and Signal Processing, Brighton, United Kingdom, 12/05/19. https://doi.org/10.1109/ICASSP.2019.8682574

APA

Pertilä, P., & Parviainen, M. (2019). Time Difference of Arrival Estimation of Speech Signals Using Deep Neural Networks with Integrated Time-frequency Masking. In 2019 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2019 - Proceedings (pp. 436-440). IEEE. https://doi.org/10.1109/ICASSP.2019.8682574

Vancouver

Pertilä P, Parviainen M. Time Difference of Arrival Estimation of Speech Signals Using Deep Neural Networks with Integrated Time-frequency Masking. In 2019 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2019 - Proceedings. IEEE. 2019. p. 436-440 https://doi.org/10.1109/ICASSP.2019.8682574

Author

Pertilä, Pasi ; Parviainen, Mikko. / Time Difference of Arrival Estimation of Speech Signals Using Deep Neural Networks with Integrated Time-frequency Masking. 2019 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2019 - Proceedings. IEEE, 2019. pp. 436-440

Bibtex - Download

@inproceedings{bf3f083cc3d748f1b7bb273e769e5a97,
title = "Time Difference of Arrival Estimation of Speech Signals Using Deep Neural Networks with Integrated Time-frequency Masking",
abstract = "The Time Difference of Arrival (TDoA) of a sound wavefront impinging on a microphone pair carries spatial information about the source. However, captured speech typically contains dynamic non-speech interference sources and noise. Therefore, the TDoA estimates fluctuate between speech and interference. Deep Neural Networks (DNNs) have been applied for Time-Frequency (TF) masking for Acoustic Source Localization (ASL) to filter out non-speech components from a speaker location likelihood function. However, the type of TF mask for this task is not obvious. Secondly, the DNN should estimate the TDoA values, but existing solutions estimate the TF mask instead. To overcome these issues, a direct formulation of the TF masking as a part of a DNN-based ASL structure is proposed. Furthermore, the proposed network operates in an online manner, i.e., producing estimates frame-by-frame. Combined with the use of recurrent layers it exploits the sequential progression of speaker related TDoAs. Training with different microphone spacings allows model re-use for different microphone pair geometries in inference. Real-data experiments with smartphone recordings of speech in interference demonstrate the network's generalization capability.",
keywords = "Acoustic Source Localization, Microphone Arrays, Recurrent Neural Networks, Time-Frequency Masking",
author = "Pasi Pertil{\"a} and Mikko Parviainen",
year = "2019",
month = "5",
day = "1",
doi = "10.1109/ICASSP.2019.8682574",
language = "English",
pages = "436--440",
booktitle = "2019 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2019 - Proceedings",
publisher = "IEEE",

}

RIS (suitable for import to EndNote) - Download

TY - GEN

T1 - Time Difference of Arrival Estimation of Speech Signals Using Deep Neural Networks with Integrated Time-frequency Masking

AU - Pertilä, Pasi

AU - Parviainen, Mikko

PY - 2019/5/1

Y1 - 2019/5/1

N2 - The Time Difference of Arrival (TDoA) of a sound wavefront impinging on a microphone pair carries spatial information about the source. However, captured speech typically contains dynamic non-speech interference sources and noise. Therefore, the TDoA estimates fluctuate between speech and interference. Deep Neural Networks (DNNs) have been applied for Time-Frequency (TF) masking for Acoustic Source Localization (ASL) to filter out non-speech components from a speaker location likelihood function. However, the type of TF mask for this task is not obvious. Secondly, the DNN should estimate the TDoA values, but existing solutions estimate the TF mask instead. To overcome these issues, a direct formulation of the TF masking as a part of a DNN-based ASL structure is proposed. Furthermore, the proposed network operates in an online manner, i.e., producing estimates frame-by-frame. Combined with the use of recurrent layers it exploits the sequential progression of speaker related TDoAs. Training with different microphone spacings allows model re-use for different microphone pair geometries in inference. Real-data experiments with smartphone recordings of speech in interference demonstrate the network's generalization capability.

AB - The Time Difference of Arrival (TDoA) of a sound wavefront impinging on a microphone pair carries spatial information about the source. However, captured speech typically contains dynamic non-speech interference sources and noise. Therefore, the TDoA estimates fluctuate between speech and interference. Deep Neural Networks (DNNs) have been applied for Time-Frequency (TF) masking for Acoustic Source Localization (ASL) to filter out non-speech components from a speaker location likelihood function. However, the type of TF mask for this task is not obvious. Secondly, the DNN should estimate the TDoA values, but existing solutions estimate the TF mask instead. To overcome these issues, a direct formulation of the TF masking as a part of a DNN-based ASL structure is proposed. Furthermore, the proposed network operates in an online manner, i.e., producing estimates frame-by-frame. Combined with the use of recurrent layers it exploits the sequential progression of speaker related TDoAs. Training with different microphone spacings allows model re-use for different microphone pair geometries in inference. Real-data experiments with smartphone recordings of speech in interference demonstrate the network's generalization capability.

KW - Acoustic Source Localization

KW - Microphone Arrays

KW - Recurrent Neural Networks

KW - Time-Frequency Masking

U2 - 10.1109/ICASSP.2019.8682574

DO - 10.1109/ICASSP.2019.8682574

M3 - Conference contribution

SP - 436

EP - 440

BT - 2019 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2019 - Proceedings

PB - IEEE

ER -