Multimodal Video Analysis and Modeling
Research output: Book/Report › Doctoral thesis › Collection of Articles
|Publisher||Tampere University of Technology|
|Number of pages||69|
|Publication status||Published - 18 Nov 2016|
|Publication type||G5 Doctoral dissertation (article)|
|Name||Tampere University of Technology. Publication|
Fusion of information from multiple modalities is applied to recording environment classiﬁcation from video and audio as well as to sport type classiﬁcation from a set of multi-device videos, corresponding audio, and recording device motion sensor data. The environment classiﬁcation combines support vector machine (SVM) classiﬁers trained on various global visual low-level features with audio event histogram based environment classiﬁcation using k nearest neighbors (k-NN). Rule-based fusion schemes with genetic algorithm (GA)-optimized modality weights are compared to training a SVM classiﬁer to perform the multimodal fusion. A comprehensive selection of fusion strategies is compared for the task of classifying the sport type of a set of recordings from a common event. These include fusion prior to, simultaneously with, and after classiﬁcation; various approaches for using modality quality estimates; and fusing soft conﬁdence scores as well as crisp single-class predictions. Additionally, different strategies are examined for aggregating the decisions of single videos to a collective prediction from the set of videos recorded concurrently with multiple devices. In both tasks multimodal analysis shows clear advantage over separate classiﬁcation of the modalities.
Another part of the work investigates cross-modal pattern analysis and audio-based video editing. This study examines the feasibility of automatically timing shot cuts of multi-camera concert recordings according to music-related cutting patterns learnt from professional concert videos. Cut timing is a crucial part of automated creation of multicamera mashups, where shots from multiple recording devices from a common event are alternated with the aim at mimicing a professionally produced video. In the framework, separate statistical models are formed for typical patterns of beat-quantized cuts in short segments, differences in beats between consecutive cuts, and relative deviation of cuts from exact beat times. Based on music meter and audio change point analysis of a new recording, the models can be used for synthesizing cut times. In a user study the proposed framework clearly outperforms a baseline automatic method with comparably advanced audio analysis and wins 48.2 % of comparisons against hand-edited videos.