Tampere University of Technology

TUTCRIS Research Portal

Vocal Effort Based Speaking Style Conversion Using Vocoder Features and Parallel Learning

Research output: Contribution to journalArticleScientificpeer-review

Standard

Vocal Effort Based Speaking Style Conversion Using Vocoder Features and Parallel Learning. / Seshadri, Shreyas; Juvela, Lauri; Räsänen, Okko; Alku, Paavo.

In: IEEE Access, Vol. 7, 2019, p. 17230-17246.

Research output: Contribution to journalArticleScientificpeer-review

Harvard

APA

Vancouver

Author

Seshadri, Shreyas ; Juvela, Lauri ; Räsänen, Okko ; Alku, Paavo. / Vocal Effort Based Speaking Style Conversion Using Vocoder Features and Parallel Learning. In: IEEE Access. 2019 ; Vol. 7. pp. 17230-17246.

Bibtex - Download

@article{66c418abb3cf47b295195db9b12a090f,
title = "Vocal Effort Based Speaking Style Conversion Using Vocoder Features and Parallel Learning",
abstract = "Speaking style conversion (SSC) is the technology of converting natural speech signals from one style to another. In this study, we aim to provide a general SSC system for converting styles with varying vocal effort and focus on normal-to-Lombard conversion as a case study of this problem. We propose a parametric approach that uses a vocoder to extract speech features. These features are mapped using parallel machine learning models from utterances spoken in normal style to the corresponding features of Lombard speech. Finally, the mapped features are converted to a Lombard speech waveform with the vocoder. A total of three vocoders (GlottDNN, STRAIGHT, and Pulse model in log domain (PML)) and three machine learning mapping methods (standard GMM, Bayesian GMM, and feed-forward DNN) were compared in the proposed normal-to-Lombard style conversion system. The conversion was evaluated using two subjective listening tests measuring perceived Lombardness and quality of the converted speech signals, and by using an instrumental measure called Speech Intelligibility in Bits (SIIB) for speech intelligibility evaluation under various noise levels. The results of the subjective tests show that the system is able to convert normal speech into Lombard speech and that there is a trade-off between quality and Lombardness of the mapped utterances. The GlottDNN and PML stand out as the best vocoders in terms of quality and Lombardness, respectively, whereas the DNN is the best mapping method in terms of Lombardness. PML with the standard GMM seems to give a good compromise between the two attributes. The SIIB experiments indicate that intelligibility of converted speech compared to that of normal speech improved in noisy conditions most effectively when DNN mapping was used with STRAIGHT and PML.",
keywords = "Bayesian GMM, DNN, GlottDNN, Lombard speech, pulse model in log domain, speaking style conversion, vocal effort",
author = "Shreyas Seshadri and Lauri Juvela and Okko R{\"a}s{\"a}nen and Paavo Alku",
year = "2019",
doi = "10.1109/ACCESS.2019.2895923",
language = "English",
volume = "7",
pages = "17230--17246",
journal = "IEEE Access",
issn = "2169-3536",
publisher = "Institute of Electrical and Electronics Engineers",

}

RIS (suitable for import to EndNote) - Download

TY - JOUR

T1 - Vocal Effort Based Speaking Style Conversion Using Vocoder Features and Parallel Learning

AU - Seshadri, Shreyas

AU - Juvela, Lauri

AU - Räsänen, Okko

AU - Alku, Paavo

PY - 2019

Y1 - 2019

N2 - Speaking style conversion (SSC) is the technology of converting natural speech signals from one style to another. In this study, we aim to provide a general SSC system for converting styles with varying vocal effort and focus on normal-to-Lombard conversion as a case study of this problem. We propose a parametric approach that uses a vocoder to extract speech features. These features are mapped using parallel machine learning models from utterances spoken in normal style to the corresponding features of Lombard speech. Finally, the mapped features are converted to a Lombard speech waveform with the vocoder. A total of three vocoders (GlottDNN, STRAIGHT, and Pulse model in log domain (PML)) and three machine learning mapping methods (standard GMM, Bayesian GMM, and feed-forward DNN) were compared in the proposed normal-to-Lombard style conversion system. The conversion was evaluated using two subjective listening tests measuring perceived Lombardness and quality of the converted speech signals, and by using an instrumental measure called Speech Intelligibility in Bits (SIIB) for speech intelligibility evaluation under various noise levels. The results of the subjective tests show that the system is able to convert normal speech into Lombard speech and that there is a trade-off between quality and Lombardness of the mapped utterances. The GlottDNN and PML stand out as the best vocoders in terms of quality and Lombardness, respectively, whereas the DNN is the best mapping method in terms of Lombardness. PML with the standard GMM seems to give a good compromise between the two attributes. The SIIB experiments indicate that intelligibility of converted speech compared to that of normal speech improved in noisy conditions most effectively when DNN mapping was used with STRAIGHT and PML.

AB - Speaking style conversion (SSC) is the technology of converting natural speech signals from one style to another. In this study, we aim to provide a general SSC system for converting styles with varying vocal effort and focus on normal-to-Lombard conversion as a case study of this problem. We propose a parametric approach that uses a vocoder to extract speech features. These features are mapped using parallel machine learning models from utterances spoken in normal style to the corresponding features of Lombard speech. Finally, the mapped features are converted to a Lombard speech waveform with the vocoder. A total of three vocoders (GlottDNN, STRAIGHT, and Pulse model in log domain (PML)) and three machine learning mapping methods (standard GMM, Bayesian GMM, and feed-forward DNN) were compared in the proposed normal-to-Lombard style conversion system. The conversion was evaluated using two subjective listening tests measuring perceived Lombardness and quality of the converted speech signals, and by using an instrumental measure called Speech Intelligibility in Bits (SIIB) for speech intelligibility evaluation under various noise levels. The results of the subjective tests show that the system is able to convert normal speech into Lombard speech and that there is a trade-off between quality and Lombardness of the mapped utterances. The GlottDNN and PML stand out as the best vocoders in terms of quality and Lombardness, respectively, whereas the DNN is the best mapping method in terms of Lombardness. PML with the standard GMM seems to give a good compromise between the two attributes. The SIIB experiments indicate that intelligibility of converted speech compared to that of normal speech improved in noisy conditions most effectively when DNN mapping was used with STRAIGHT and PML.

KW - Bayesian GMM

KW - DNN

KW - GlottDNN

KW - Lombard speech

KW - pulse model in log domain

KW - speaking style conversion

KW - vocal effort

U2 - 10.1109/ACCESS.2019.2895923

DO - 10.1109/ACCESS.2019.2895923

M3 - Article

VL - 7

SP - 17230

EP - 17246

JO - IEEE Access

JF - IEEE Access

SN - 2169-3536

ER -