Embodied Artificial Intelligence is an emerging field that focuses on creating intelligent agents that can perceive, navigate, and manipulate objects in their environment. While current smart assistants are limited to speech and text-based interactions, developing embodied agents that can engage in natural dialogue and complete physical tasks presents a significant challenge—but one necessary to the advancement of AI. We tackle this challenge within the Amazon Arena Environment: building an agent that grounds dialogue context to both images and actions. ScottyBot is capable of holding conversations to complete missions and deliver on users’ requests. Our modular, full-stack approach requires the integration of state-of-the-art natural language and computer vision models to track/execute interactions and to maintain natural conversations.