Tampere University of Technology

TUTCRIS Research Portal

Robust speech recognition with spectrogram factorisation

Research output: Book/ReportDoctoral thesis

Details

Original languageEnglish
Place of PublicationTampere
PublisherTampere University of Technology
Number of pages93
ISBN (Electronic)978-952-15-3386-0
ISBN (Print)978-952-15-3371-6
StatePublished - 9 Oct 2014
Publication typeG5 Doctoral dissertation (article)

Publication series

NameTampere University of Techology. Publication
PublisherTampere University of Technology
Volume1249
ISSN (Print)1459-2045

Abstract

Communication by speech is intrinsic for humans. Since the breakthrough of mobile devices and wireless communication, digital transmission of speech has become ubiquitous. Similarly distribution and storage of audio and video data has increased rapidly. However, despite being technically capable to record and process audio signals, only a fraction of digital systems and services are actually able to work with spoken input, that is, to operate on the lexical content of speech. One persistent obstacle for practical deployment of automatic speech recognition systems is inadequate robustness against noise and other interferences, which regularly corrupt signals recorded in real-world environments. Speech and diverse noises are both complex signals, which are not trivially separable. Despite decades of research and a multitude of different approaches, the problem has not been solved to a sufficient extent. Especially the mathematically ill-posed problem of separating multiple sources from a single-channel input requires advanced models and algorithms to be solvable. One promising path is using a composite model of long-context atoms to represent a mixture of non-stationary sources based on their spectro-temporal behaviour. Algorithms derived from the family of non-negative matrix factorisations have been applied to such problems to separate and recognise individual sources like speech. This thesis describes a set of tools developed for non-negative modelling of audio spectrograms, especially involving speech and real-world noise sources. An overview is provided to the complete framework starting from model and feature definitions, advancing to factorisation algorithms, and finally describing different routes for separation, enhancement, and recognition tasks. Current issues and their potential solutions are discussed both theoretically and from a practical point of view. The included publications describe factorisation-based recognition systems, which have been evaluated on publicly available speech corpora in order to determine the efficiency of various separation and recognition algorithms. Several variants and system combinations that have been proposed in literature are also discussed. The work covers a broad span of factorisation-based system components, which together aim at providing a practically viable solution to robust processing and recognition of speech in everyday situations.

Publication forum classification

Field of science, Statistics Finland

Downloads statistics

No data available