Vocal Effort Based Speaking Style Conversion Using Vocoder Features and Parallel Learning
Research output: Contribution to journal › Article › Scientific › peer-review
|Number of pages||17|
|Publication status||Published - 2019|
|Publication type||A1 Journal article-refereed|
Speaking style conversion (SSC) is the technology of converting natural speech signals from one style to another. In this study, we aim to provide a general SSC system for converting styles with varying vocal effort and focus on normal-to-Lombard conversion as a case study of this problem. We propose a parametric approach that uses a vocoder to extract speech features. These features are mapped using parallel machine learning models from utterances spoken in normal style to the corresponding features of Lombard speech. Finally, the mapped features are converted to a Lombard speech waveform with the vocoder. A total of three vocoders (GlottDNN, STRAIGHT, and Pulse model in log domain (PML)) and three machine learning mapping methods (standard GMM, Bayesian GMM, and feed-forward DNN) were compared in the proposed normal-to-Lombard style conversion system. The conversion was evaluated using two subjective listening tests measuring perceived Lombardness and quality of the converted speech signals, and by using an instrumental measure called Speech Intelligibility in Bits (SIIB) for speech intelligibility evaluation under various noise levels. The results of the subjective tests show that the system is able to convert normal speech into Lombard speech and that there is a trade-off between quality and Lombardness of the mapped utterances. The GlottDNN and PML stand out as the best vocoders in terms of quality and Lombardness, respectively, whereas the DNN is the best mapping method in terms of Lombardness. PML with the standard GMM seems to give a good compromise between the two attributes. The SIIB experiments indicate that intelligibility of converted speech compared to that of normal speech improved in noisy conditions most effectively when DNN mapping was used with STRAIGHT and PML.
- Bayesian GMM, DNN, GlottDNN, Lombard speech, pulse model in log domain, speaking style conversion, vocal effort