Tampere University of Technology

TUTCRIS Research Portal

Distant speech separation using predicted time-frequency masks from spatial features

Research output: Contribution to journalArticleScientificpeer-review

Standard

Distant speech separation using predicted time-frequency masks from spatial features. / Pertilä, Pasi; Nikunen, Joonas.

In: Speech Communication, Vol. 68, 2015, p. 97-106.

Research output: Contribution to journalArticleScientificpeer-review

Harvard

APA

Vancouver

Author

Pertilä, Pasi ; Nikunen, Joonas. / Distant speech separation using predicted time-frequency masks from spatial features. In: Speech Communication. 2015 ; Vol. 68. pp. 97-106.

Bibtex - Download

@article{abe28bc30e0f477b8c04b75b0193d237,
title = "Distant speech separation using predicted time-frequency masks from spatial features",
abstract = "Speech separation algorithms are faced with a difficult task of producing high degree of separation without containing unwanted artifacts. The time-frequency (T-F) masking technique applies a real-valued (or binary) mask on top of the signal's spectrum to filter out unwanted components. The practical difficulty lies in the mask estimation. Often, using efficient masks engineered for separation performance leads to presence of unwanted musical noise artifacts in the separated signal. This lowers the perceptual quality and intelligibility of the output. Microphone arrays have been long studied for processing of distant speech. This work uses a feed-forward neural network for mapping microphone array's spatial features into a T-F mask. Wiener filter is used as a desired mask for training the neural network using speech examples in simulated setting. The T-F masks predicted by the neural network are combined to obtain an enhanced separation mask that exploits the information regarding interference between all sources. The final mask is applied to the delay-and-sum beamformer (DSB) output. The algorithm's objective separation capability in conjunction with the separated speech intelligibility is tested with recorded speech from distant talkers in two rooms from two distances. The results show improvement in instrumental measure for intelligibility and frequency-weighted SNR over complex-valued non-negative matrix factorization (CNMF) source separation approach, spatial sound source separation, and conventional beamforming methods such as the DSB and minimum variance distortionless response (MVDR).",
keywords = "Beamforming, Microphone arrays, Neural networks, Speech separation, Time-frequency masking",
author = "Pasi Pertil{\"a} and Joonas Nikunen",
year = "2015",
doi = "10.1016/j.specom.2015.01.006",
language = "English",
volume = "68",
pages = "97--106",
journal = "Speech Communication",
issn = "0167-6393",
publisher = "Elsevier",

}

RIS (suitable for import to EndNote) - Download

TY - JOUR

T1 - Distant speech separation using predicted time-frequency masks from spatial features

AU - Pertilä, Pasi

AU - Nikunen, Joonas

PY - 2015

Y1 - 2015

N2 - Speech separation algorithms are faced with a difficult task of producing high degree of separation without containing unwanted artifacts. The time-frequency (T-F) masking technique applies a real-valued (or binary) mask on top of the signal's spectrum to filter out unwanted components. The practical difficulty lies in the mask estimation. Often, using efficient masks engineered for separation performance leads to presence of unwanted musical noise artifacts in the separated signal. This lowers the perceptual quality and intelligibility of the output. Microphone arrays have been long studied for processing of distant speech. This work uses a feed-forward neural network for mapping microphone array's spatial features into a T-F mask. Wiener filter is used as a desired mask for training the neural network using speech examples in simulated setting. The T-F masks predicted by the neural network are combined to obtain an enhanced separation mask that exploits the information regarding interference between all sources. The final mask is applied to the delay-and-sum beamformer (DSB) output. The algorithm's objective separation capability in conjunction with the separated speech intelligibility is tested with recorded speech from distant talkers in two rooms from two distances. The results show improvement in instrumental measure for intelligibility and frequency-weighted SNR over complex-valued non-negative matrix factorization (CNMF) source separation approach, spatial sound source separation, and conventional beamforming methods such as the DSB and minimum variance distortionless response (MVDR).

AB - Speech separation algorithms are faced with a difficult task of producing high degree of separation without containing unwanted artifacts. The time-frequency (T-F) masking technique applies a real-valued (or binary) mask on top of the signal's spectrum to filter out unwanted components. The practical difficulty lies in the mask estimation. Often, using efficient masks engineered for separation performance leads to presence of unwanted musical noise artifacts in the separated signal. This lowers the perceptual quality and intelligibility of the output. Microphone arrays have been long studied for processing of distant speech. This work uses a feed-forward neural network for mapping microphone array's spatial features into a T-F mask. Wiener filter is used as a desired mask for training the neural network using speech examples in simulated setting. The T-F masks predicted by the neural network are combined to obtain an enhanced separation mask that exploits the information regarding interference between all sources. The final mask is applied to the delay-and-sum beamformer (DSB) output. The algorithm's objective separation capability in conjunction with the separated speech intelligibility is tested with recorded speech from distant talkers in two rooms from two distances. The results show improvement in instrumental measure for intelligibility and frequency-weighted SNR over complex-valued non-negative matrix factorization (CNMF) source separation approach, spatial sound source separation, and conventional beamforming methods such as the DSB and minimum variance distortionless response (MVDR).

KW - Beamforming

KW - Microphone arrays

KW - Neural networks

KW - Speech separation

KW - Time-frequency masking

U2 - 10.1016/j.specom.2015.01.006

DO - 10.1016/j.specom.2015.01.006

M3 - Article

VL - 68

SP - 97

EP - 106

JO - Speech Communication

JF - Speech Communication

SN - 0167-6393

ER -