We study media presence detection, that is, learning to recognize if a sound segment (typically lasting for a few seconds) of a long recorded stream contains media (TV) sound. This problem is difficult because non-media sound sources can be quite diverse (e.g. human voicing, non-vocal sounds and non-human sounds), and the recorded sound can be a mixture of media and non-media sound.
Different from speech recognition, where the recognizer needs to detect local phonetic variation, the key features used to distinguish media and non-media sounds are nonlocal features. Motivated by this, we propose a hierarchical model to learn representation of each pre-chunked segment within a long recorded stream jointly, and encourage every local representation to be not sensitive to variations within each segment. We also further explore the effects of techniques including stream based normalization and iteratively imputing missing labels of training dataset. Experimental results indicate that our proposed contextual based methods are effective for media presence detection.
Hierarchical Residual-pyramidal Model for Large Context Based Media Presence Detection
2019
Research areas