Long-term social interaction context: The key to egocentric addressee detection
2024
As embodied agents learn to interact, it is crucial for them to understand when, what, and to whom they should respond. While advances in natural-language processing and speech technologies have enabled conversational agents to focus on what to respond, they still struggle to determine when and to whom they should respond. In this paper, we address the addressee detection (Talking-To-Me, TTM) problem under the egocentric view. Instead of relying solely on short-term audio and video data, we propose a simple architecture SICNet with self/cross-modality attention that leverages long-term social interaction context. By leveraging long-term information, our approach has achieved a mean Average Precision (mAP) of 68.98% on the Ego4D TTM task, surpassing the previous state-of-the-art single-task model by 10.07%. We also conducted a detailed ablation study to demonstrate the effectiveness of each component in the long-term social interaction context.
Research areas