Tampere University of Technology

TUTCRIS Research Portal

SylNet: An Adaptable End-to-End Syllable Count Estimator for Speech

Research output: Contribution to journalArticleScientificpeer-review


Original languageEnglish
Pages (from-to)1359-1363
Number of pages5
JournalIEEE Signal Processing Letters
Issue number9
Publication statusPublished - Sep 2019
Publication typeA1 Journal article-refereed


Automatic syllable count estimation (SCE) is used in a variety of applications ranging from speaking rate estimation to detecting social activity from wearable microphones or developmental research concerned with quantifying speech heard by language-learning children in different environments. The majority of previously utilized SCE methods have relied on heuristic digital signal processing (DSP) methods, and only a small number of bi-directional long short-term memory (BLSTM) approaches have made use of modern machine learning approaches in the SCE task. This letter presents a novel end-to-end method called SylNet for automatic syllable counting from speech, built on the basis of a recent developments in neural network architectures. We describe how the entire model can be optimized directly to minimize SCE error on the training data without annotations aligned at the syllable level, and how it can be adapted to new languages using limited speech data with known syllable counts. Experiments on several different languages reveal that SylNet generalizes to languages beyond its training data and further improves with adaptation. It also outperforms several previously proposed methods for syllabification, including end-to-end BLSTMs.


  • estimation theory, learning (artificial intelligence), natural language processing, neural net architecture, recurrent neural nets, speech processing, SylNet, adaptable end-to-end syllable count estimator, automatic syllable count estimation, wearable microphones, developmental research, language-learning children, heuristic digital signal processing methods, SCE task, automatic syllable counting, SCE error, training data, syllable level, speech data, end-to-end BLSTMs, machine learning approaches, SCE methods, speaking rate estimation, social activity detection, DSP methods, bi-directional short-term memory approaches, neural network architectures, limited speech data, Estimation, Training, Adaptation models, Speech processing, Signal processing algorithms, Training data, Channel estimation, syllable count estimation, end-to-end learning, deep learning

Publication forum classification

Field of science, Statistics Finland