Alexa Prize SimBot Challenge | Malvina Nikandrou

I was part of the Heriot-Watt University Team for the Alexa Prize Simbot Challenge. We were one of the finalist teams.

About the challenge

The SimBot Challenge is a voice-controlled game set in the simulated environment of Alexa Arena. The objective of the game is to complete various missions by directing a virtual agent. As players cannot directly interact with the environment, the agent executes their instructions while also communicating to offer suggestions or ask for clarifications. During the competition, real users got to play the game on Alexa Echo Show devices and rate the bots based on their overall experience.

Our approach

Our approach is based on developing the EMMA model with the following architecture:

EMMA is a transformer-based encoder-decoder model. Vision and language inputs are embedded through modality-specific layers, concatenated into a sequence of embeddings, and fed into a single-stream encoder. All tasks are formulated as language generation by incorporating sentinel tokens that allow referencing specific image frames and regions.

The model is pretrained on multiple vision and language tasks in order to obtain language grounding capabilities. The downstream instruction following task is decomposed into Visual Ambiguity Detection and Policy in order to decouple clarification requests from action prediction.

To ensure the robustness of our system when interacting with real users, we developed additional functionalities through separate modules. These include out-of-domain detection to catch adversarial or out-of-scope user utterances, in-game QA to answer frequent questions about the game or the agent, and a symbolic search and memory module to efficiently locate objects in the environment. Check our paper for more details!

High-level overview of the Alexa Skill showcasing the flow of a new request through all components.

Takeaways

We developed an encoder-decoder model that encodes both images and videos (i.e., sequences of frames), and it is able to generate natural language tokens conditioned on specific task prompts.
We observed strong generalization performance to unseen missions, as well as zero-shot transfer from the Alexa Arena to real-world scenes.
One limitation of our system is that it was trained with the expectations of users providing concrete instructions. As a result, it sometimes struggles when users instead provide high-level mission goals.
After the challenge, we focused on the core EMMA component showing further benefits from joint finetuning across the downstream embodied tasks. This work has been accepted as a paper at EMNLP 2023.

About the challenge

Our approach

Takeaways

References

2023