Detecting discussion scenes in instructional videos
Abstract
This paper addresses the problem of detecting discussion scenes in instructional videos using statistical approaches. Specifically, given a series of speech segments separated from the audio tracks of educational videos, we first model the instructor using a Gaussian mixture model (GMM), then a four-state transition machine is designed to extract discussion scenes in real-time based on detected instructor-student speaker change points. Meanwhile, we keep updating the GMM model to accommodate the instructor's voice variation along time. Promising experimental results have been achieved on five educational (IBM MicroMBA program) videos, and very interesting instruction/teaching patterns have been observed. The extracted scene information would facilitate the semantic indexing and structuralization of instructional video content.