TUTCRIS - Tampereen teknillinen yliopisto

TUTCRIS

Clotho: an Audio Captioning Dataset

Tutkimustuotosvertaisarvioitu

Standard

Clotho: an Audio Captioning Dataset. / Drossos, Konstantinos; Lipping, Samuel; Virtanen, Tuomas.

IEEE 45th International Conference on Acoustics, Speech, and Signal Processing (ICASSP). IEEE, 2020. s. 736-740 (Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing).

Tutkimustuotosvertaisarvioitu

Harvard

Drossos, K, Lipping, S & Virtanen, T 2020, Clotho: an Audio Captioning Dataset. julkaisussa IEEE 45th International Conference on Acoustics, Speech, and Signal Processing (ICASSP). Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, IEEE, Sivut 736-740, IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING, 1/01/00. https://doi.org/10.1109/ICASSP40776.2020.9052990

APA

Drossos, K., Lipping, S., & Virtanen, T. (2020). Clotho: an Audio Captioning Dataset. teoksessa IEEE 45th International Conference on Acoustics, Speech, and Signal Processing (ICASSP) (Sivut 736-740). (Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing). IEEE. https://doi.org/10.1109/ICASSP40776.2020.9052990

Vancouver

Drossos K, Lipping S, Virtanen T. Clotho: an Audio Captioning Dataset. julkaisussa IEEE 45th International Conference on Acoustics, Speech, and Signal Processing (ICASSP). IEEE. 2020. s. 736-740. (Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing). https://doi.org/10.1109/ICASSP40776.2020.9052990

Author

Drossos, Konstantinos ; Lipping, Samuel ; Virtanen, Tuomas. / Clotho: an Audio Captioning Dataset. IEEE 45th International Conference on Acoustics, Speech, and Signal Processing (ICASSP). IEEE, 2020. Sivut 736-740 (Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing).

Bibtex - Lataa

@inproceedings{bac89bfb28074b369efe4238b0048d5e,
title = "Clotho: an Audio Captioning Dataset",
abstract = "Audio captioning is the novel task of general audio content description using free text. It is an intermodal translation task (not speech-to-text), where a system accepts as an input an audio signal and outputs the textual description (i.e. the caption) of that signal. In this paper we present Clotho, a dataset for audio captioning consisting of 4981 audio samples of 15 to 30 seconds duration and 24 905 captions of eight to 20 words length, and a baseline method to provide initial results. Clotho is built with focus on audio content and caption diversity, and the splits of the data are not hampering the training or evaluation of methods. All sounds are from the Freesound platform, and captions are crowdsourced using Amazon Mechanical Turk and annotators from English speaking countries. Unique words, named entities, and speech transcription are removed with post-processing. Clotho is freely available online.",
author = "Konstantinos Drossos and Samuel Lipping and Tuomas Virtanen",
year = "2020",
doi = "10.1109/ICASSP40776.2020.9052990",
language = "English",
series = "Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing",
publisher = "IEEE",
pages = "736--740",
booktitle = "IEEE 45th International Conference on Acoustics, Speech, and Signal Processing (ICASSP)",

}

RIS (suitable for import to EndNote) - Lataa

TY - GEN

T1 - Clotho: an Audio Captioning Dataset

AU - Drossos, Konstantinos

AU - Lipping, Samuel

AU - Virtanen, Tuomas

PY - 2020

Y1 - 2020

N2 - Audio captioning is the novel task of general audio content description using free text. It is an intermodal translation task (not speech-to-text), where a system accepts as an input an audio signal and outputs the textual description (i.e. the caption) of that signal. In this paper we present Clotho, a dataset for audio captioning consisting of 4981 audio samples of 15 to 30 seconds duration and 24 905 captions of eight to 20 words length, and a baseline method to provide initial results. Clotho is built with focus on audio content and caption diversity, and the splits of the data are not hampering the training or evaluation of methods. All sounds are from the Freesound platform, and captions are crowdsourced using Amazon Mechanical Turk and annotators from English speaking countries. Unique words, named entities, and speech transcription are removed with post-processing. Clotho is freely available online.

AB - Audio captioning is the novel task of general audio content description using free text. It is an intermodal translation task (not speech-to-text), where a system accepts as an input an audio signal and outputs the textual description (i.e. the caption) of that signal. In this paper we present Clotho, a dataset for audio captioning consisting of 4981 audio samples of 15 to 30 seconds duration and 24 905 captions of eight to 20 words length, and a baseline method to provide initial results. Clotho is built with focus on audio content and caption diversity, and the splits of the data are not hampering the training or evaluation of methods. All sounds are from the Freesound platform, and captions are crowdsourced using Amazon Mechanical Turk and annotators from English speaking countries. Unique words, named entities, and speech transcription are removed with post-processing. Clotho is freely available online.

U2 - 10.1109/ICASSP40776.2020.9052990

DO - 10.1109/ICASSP40776.2020.9052990

M3 - Conference contribution

T3 - Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing

SP - 736

EP - 740

BT - IEEE 45th International Conference on Acoustics, Speech, and Signal Processing (ICASSP)

PB - IEEE

ER -