EMMA: A foundation model for embodied, interactive, multimodal task completion in 3D environments
2023
In this technical report, we present EMMA, a foundation model for embodied, interactive, and multimodal task completion in 3D environments. Different to previous Vision+Language (V+L) models, EMMA is an encoder-decoder architecture that encodes both images and videos (i.e., sequences of frames), and it is able to generate natural language tokens conditioned on specific task prompts. By treating every task as a natural language generation task, EMMA learns a language of actions that can be used for different tasks in the pipeline of an embodied AI system. We perform an extensive experimental evaluation to demonstrate the performance of our foundation model. First, despite being substantially smaller than other V+L models, EMMA is competitive (or superior) in terms of performance on several V+L state-of-the-art benchmarks demonstrating the value of our model design and multitask pretraining regime. Additionally, we showcase that a model trained on Alexa Arena data can perform zero-shot cross-domain transfer when asked to perform the same tasks in the real world. Moreover, EMMA shows strong generalization performance in novel missions with real users, achieving an average score of 4.06 (out of 5) over the generalization phase that lasted between the 16th and 22nd of March 2023.