Tampere University of Technology

TUTCRIS Research Portal

Visual Voice Activity Detection based on Spatiotemporal Information and Bag of Words

Research output: Chapter in Book/Report/Conference proceedingConference contributionScientificpeer-review

Standard

Visual Voice Activity Detection based on Spatiotemporal Information and Bag of Words. / Patrona, Foteini; Iosifidis, Alexandros; Tefas, Anastasios; Pitas, Ioannis.

IEEE International Conference on Image Processing. 2015. p. 2334-2338.

Research output: Chapter in Book/Report/Conference proceedingConference contributionScientificpeer-review

Harvard

Patrona, F, Iosifidis, A, Tefas, A & Pitas, I 2015, Visual Voice Activity Detection based on Spatiotemporal Information and Bag of Words. in IEEE International Conference on Image Processing. pp. 2334-2338. https://doi.org/10.1109/ICIP.2015.7351219

APA

Patrona, F., Iosifidis, A., Tefas, A., & Pitas, I. (2015). Visual Voice Activity Detection based on Spatiotemporal Information and Bag of Words. In IEEE International Conference on Image Processing (pp. 2334-2338) https://doi.org/10.1109/ICIP.2015.7351219

Vancouver

Patrona F, Iosifidis A, Tefas A, Pitas I. Visual Voice Activity Detection based on Spatiotemporal Information and Bag of Words. In IEEE International Conference on Image Processing. 2015. p. 2334-2338 https://doi.org/10.1109/ICIP.2015.7351219

Author

Patrona, Foteini ; Iosifidis, Alexandros ; Tefas, Anastasios ; Pitas, Ioannis. / Visual Voice Activity Detection based on Spatiotemporal Information and Bag of Words. IEEE International Conference on Image Processing. 2015. pp. 2334-2338

Bibtex - Download

@inproceedings{31089295f843468099cc281f26136f87,
title = "Visual Voice Activity Detection based on Spatiotemporal Information and Bag of Words",
abstract = "A novel method for Visual Voice Activity Detection (V-VAD) that exploits local shape and motion information appearing at spatiotemporal locations of interest for facial region video description and the Bag of Words (BoW) model for facial region video representation is proposed in this paper. Facial region video classification is subsequently performed based on Single-hidden Layer Feedforward Neural (SLFN) network trained by applying the recently proposed kernel Extreme Learning Machine (kELM) algorithm on training facial videos depicting talking and non-talking persons. Experimental results on two publicly available V-VAD data sets, denote the effectiveness of the proposed method, since better generalization performance in unseen users is achieved, compared to recently proposed state-of-the-art methods.",
author = "Foteini Patrona and Alexandros Iosifidis and Anastasios Tefas and Ioannis Pitas",
year = "2015",
doi = "10.1109/ICIP.2015.7351219",
language = "English",
pages = "2334--2338",
booktitle = "IEEE International Conference on Image Processing",

}

RIS (suitable for import to EndNote) - Download

TY - GEN

T1 - Visual Voice Activity Detection based on Spatiotemporal Information and Bag of Words

AU - Patrona, Foteini

AU - Iosifidis, Alexandros

AU - Tefas, Anastasios

AU - Pitas, Ioannis

PY - 2015

Y1 - 2015

N2 - A novel method for Visual Voice Activity Detection (V-VAD) that exploits local shape and motion information appearing at spatiotemporal locations of interest for facial region video description and the Bag of Words (BoW) model for facial region video representation is proposed in this paper. Facial region video classification is subsequently performed based on Single-hidden Layer Feedforward Neural (SLFN) network trained by applying the recently proposed kernel Extreme Learning Machine (kELM) algorithm on training facial videos depicting talking and non-talking persons. Experimental results on two publicly available V-VAD data sets, denote the effectiveness of the proposed method, since better generalization performance in unseen users is achieved, compared to recently proposed state-of-the-art methods.

AB - A novel method for Visual Voice Activity Detection (V-VAD) that exploits local shape and motion information appearing at spatiotemporal locations of interest for facial region video description and the Bag of Words (BoW) model for facial region video representation is proposed in this paper. Facial region video classification is subsequently performed based on Single-hidden Layer Feedforward Neural (SLFN) network trained by applying the recently proposed kernel Extreme Learning Machine (kELM) algorithm on training facial videos depicting talking and non-talking persons. Experimental results on two publicly available V-VAD data sets, denote the effectiveness of the proposed method, since better generalization performance in unseen users is achieved, compared to recently proposed state-of-the-art methods.

U2 - 10.1109/ICIP.2015.7351219

DO - 10.1109/ICIP.2015.7351219

M3 - Conference contribution

SP - 2334

EP - 2338

BT - IEEE International Conference on Image Processing

ER -