Personalized Assessment of Arousal and Valence from Videos
Abstract
Human behavior is influenced by numerous subjective factors such as the environment, culture, hormones, genes etc. This makes the development of a one-size-fits-All behavioral model for emotion recognition challenging, especially in the domain of affect recognition. In this paper we present a method to classify and assess arousal and valence from video in a personalized way. We represent the inherent information in the video independently through three semantically different types of signals, namely motion, appearance and physiology. We use a single-and multi-stream LSTM model for data fusion and classification, and compare our results against published values on a publicly available dataset consisting of 40 subjects. We further demonstrate that the personalized approach reaches better performance (Arousal: 78.16% avg. acc.; Valence 89.22% avg. acc.), while providing more insight into the role of each signal group. For arousal classification we can distinguish between subjects that show dominance of motion-related expressions against others that exhibit more static expressions. Fusion of all three signal types gave an advantage on very few subjects, a challenge that might be related to the video recordings being too short.