GRILLBot-v2: Generative models for multi-modal task-oriented assistance
We present our Alexa TaskBot Challenge virtual assistant, GRILLBot-v2, which advances multi-modal task-oriented conversations by leveraging generative lan- guage models and automatic extraction and augmentation of interactive task data. GRILLBot-v2 is a conversational system based on open-source software and pub- licly available data. The task of manually crafting engaging conversational content is expensive and time-consuming. Conversely, using solely generative models has the danger of hallucinations or forgetting conversational history. Therefore, we propose advances in task-oriented automated data extraction and augmentation to create multi-modal corpora, including tasks, domain knowledge, and associated videos. Specifically, we include rich content such as custom jokes, system-initiative questions, and multi-modal content. Furthermore, this rich data allows us to ground our generative models to create complex and interactive conversations. For example, we make meaningful improvements to task QA by grounding answer generation on conversation history and our knowledge and task corpora, achieving a factoid accuracy of 0.94 on our test set. We also use structured information within tasks (e.g., category tags for recipes) to create a mixed-initiative approach to guide users to more relevant search results, increasing the average conversation rating by over ∼ 15% (guided search) during the semi-finals. The success of GRILLBot-v2 during the competition motivates using synthetic content from generative models for tasks by grounding them on relevant knowledge and task data.