Robotics

Amazon releases code, datasets for developing embodied AI agents

With Alexa Arena, developers can create simulated missions in which humans interact with virtual robots, providing a natural way to build generalizable AI models.

By Govind Thattai, Qiaozi (QZ) Gao

March 20, 2023

4 min read

Alexa Arena is a new embodied-AI framework developed to push the boundaries of human-robot interaction. It offers an interactive, user-centric framework for creating robotic tasks that involve navigating multiroom simulated environments and manipulating all types of objects in real time. In a gamelike setting, users can interact with virtual robots through natural-language dialogue, helping the robots complete their tasks. The framework currently includes a large set of multiroom layouts for a home, a warehouse, and a lab.

Two Alexa AI papers present novel methodologies that use vision and language understanding to improve embodied task completion in simulated environments.

Arena enables the training and evaluation of embodied-AI models, along with the generation of new training data based on the human-robot interactions. It can thus contribute to the development of generalizable embodied agents with a wide variety of AI capabilities, such as task planning, visual dialogue, multimodal reasoning, task completion, teachable AI, and conversational understanding.

We have publicly released (a) the code repository for Arena, which includes the simulation engine artifacts and a machine learning (ML) toolbox for model training and visual inferencing; (b) comprehensive datasets for training embodied agents; and (c) benchmark ML models that incorporate vision and language planning for task completion. In addition, we have also launched a new leaderboard for Arena, to evaluate the performance of embodied agents on unseen tasks.

The simulation engine of Alexa Arena is built using the Unity game engine and includes 330+ assets spanning both commonplace objects in homes (such as refrigerators and chairs) and uncommon objects (such as forklifts and floppy disks). Arena also features more than 200,000 multiroom scenes, each with a unique combination of room specifications and furniture arrangement.

In addition, each scene can randomize the robot’s initial location, the placement of movable objects (such as computers and books), floor materials, wall colors, etc., to provide the rich set of visual variations needed to train embodied agents through both supervised and reinforcement learning methods.

Arena game.png — An example of a game built with Arena, with, at left, a virtual room seen from a simulated robot’s perspective and, at right, dialogue between the robot and the human operator.

To make games more engaging, Arena includes live background animations and sounds, user-friendly graphics, smooth robot navigation with live visuals and support for multiple viewpoints, views that can be switched between first-party and third-party cameras, the hazards and preconditions that can be incorporated into task completion criteria, a mini-map showing the location of the robot within a scene, and a configurable hint-generation mechanism. After the execution of every action in the environment, Arena generates a rich set of metadata, such as images from RGB and depth cameras, segmentation maps, robot location, and error codes.

Long-horizon robotic tasks (such as “make a hot cup of tea”) can be authored in Arena, using a new challenge definition format (CDF) to specify the initial states of objects (such as “cabinet doors are closed”), goal conditions to be satisfied (such as “cup is filled with milk or water”), and textual hints planted at specific locations in the scene (such as “check the fridge for milk”).

Publicly released TEACh dataset contains more than 3,000 dialogues and associated visual data from a simulated environment.

The Arena framework powers the Alexa Prize Simbot challenge, in which 10 university teams are competing to develop embodied-AI agents that complete tasks with guidance from Alexa customers. Customers with Echo Show or Fire TV devices interact with the agents through voice commands, helping the robots achieve goals displayed on-screen. The challenge finals will take place in early May 2023.

The code repository for Arena includes two datasets: (a) an instruction-following dataset, containing 46,000 human-annotated dialogue instructions, along with ground truth action trajectories and robot view images, and (b) a vision dataset containing 660,000 images from Arena scenes spanning 160+ semantic-object groups, collected by navigating the robot to various virtual locations and capturing images of the objects there from different perspectives and distances.

The data collection methodology that we used to create the instruction-following dataset is similar to the two-step procedure that we adopted in our earlier work on DialFRED, where we used demonstrative videos (generated by a symbolic planner) to create crowd-sourced natural-language instructions in the form of multiturn Q&A dialogues.

Sample data.png — Sample data from the Arena dataset.

Using the datasets mentioned above, we trained two embodied-agent models as benchmarks for Arena tasks. One is a neuro-symbolic model that uses the contextual history of past actions and a dedicated vision model:

Neural-symbolic approach.png — Overview of the neural-symbolic approach.

The other is an embodied vision-language (EVL) model that incorporates a joint vision-language encoder and a multihead model for task planning and mask prediction:

EVL model.png — Overview of the vision-language model.

To evaluate our benchmarks, we used a metric called mission success rate (MSR), which is the ratio of successfully completed tasks to total tasks, across all tasks in the evaluation set.

Amazon releases code, datasets for developing embodied AI agents

With Alexa Arena, developers can create simulated missions in which humans interact with virtual robots, providing a natural way to build generalizable AI models.

Related content

Work with us