EMMA: A foundation model for embodied, interactive, multimodal task completion in 3D environments

2023
Download Copy BibTeX
Copy BibTeX
In this technical report, we present EMMA, a foundation model for embodied, interactive, and multimodal task completion in 3D environments. Different to previous Vision+Language (V+L) models, EMMA is an encoder-decoder architecture that encodes both images and videos (i.e., sequences of frames), and it is able to generate natural language tokens conditioned on specific task prompts. By treating every task as a natural language generation task, EMMA learns a language of actions that can be used for different tasks in the pipeline of an embodied AI system. We perform an extensive experimental evaluation to demonstrate the performance of our foundation model. First, despite being substantially smaller than other V+L models, EMMA is competitive (or superior) in terms of performance on several V+L state-of-the-art benchmarks demonstrating the value of our model design and multitask pretraining regime. Additionally, we showcase that a model trained on Alexa Arena data can perform zero-shot cross-domain transfer when asked to perform the same tasks in the real world. Moreover, EMMA shows strong generalization performance in novel missions with real users, achieving an average score of 4.06 (out of 5) over the generalization phase that lasted between the 16th and 22nd of March 2023.

Latest news

The latest updates, stories, and more about Alexa Prize.
GB, MLN, Edinburgh
We’re looking for a Machine Learning Scientist in the Personalization team for our Edinburgh office experienced in generative AI and large models. You will be responsible for developing and disseminating customer-facing personalized recommendation models. This is a hands-on role with global impact working with a team of world-class engineers and scientists across the Edinburgh offices and wider organization. You will lead the design of machine learning models that scale to very large quantities of data, and serve high-scale low-latency recommendations to all customers worldwide. You will embody scientific rigor, designing and executing experiments to demonstrate the technical efficacy and business value of your methods. You will work alongside aRead more