Robotics

Amazon releases largest dataset for training "pick and place" robots

Dataset of images collected in an industrial setting features more than 190,000 objects, orders of magnitude more than previous datasets.

By Mani Nambi, Chaitanya Mitash, Fan Wang

April 10, 2023

3 min read

In an effort to improve the performance of robots that pick, sort, and pack products in warehouses, Amazon has publicly released the largest dataset of images captured in an industrial product-sorting setting. Where the largest previous dataset of industrial images featured on the order of 100 objects, the Amazon dataset, called ARMBench, features more than 190,000 objects. As such, it could be used to train “pick and place” robots that are better able to generalize to new products and contexts.

We describe ARMBench in a paper we will present later this spring at the International Conference on Robotics and Automation (ICRA).

The scenario in which the ARMBench images were collected involves a robotic arm that must retrieve a single item from a bin full of items and transfer it to a tray on a conveyor belt. The variety of objects and their configurations and interactions in the context of the robotic system make this a uniquely challenging task.

Pick-and-place scenario.png — The ARMBench pick-and-place scenario.

ARMBench contains image sets for three separate tasks: (1) object segmentation, or identifying the boundaries of different products in the same bin; (2) object identification, or determining which product image in a reference database corresponds to the highlighted product in an image; and (3) defect detection, or determining when the robot has committed an error, such as picking up multiple items rather than one or damaging an item during transfer.

The images in the dataset fall into three different categories:

the pick image is a top-down image of a bin filled with items, prior to robotic handling;
transfer images are captured from multiple viewpoints as the robot transfers an item to the tray;
the place image is a top-down image of the tray in which the selected item is placed.

Three views.png — Examples of, from left, a pick image, a transfer image, and a place image.

The object segmentation dataset contains more than 50,000 images, with anywhere from one to 50 manual object segmentations per image, for an average of about 10.5. The high degree of clutter, combined with the variety of the objects — some of which are even transparent or reflective — makes this a challenging and unique benchmark.

An example of an image from the object segmentation dataset, in which all the items in a bin have been hand-segmented.

The object identification dataset contains more than 235,000 labeled “pick activities”; each pick activity includes a pick image and three transfer images. There are also reference images and text descriptions of more than 190,000 products; in the object identification task, a model must learn to match one of these reference products to an object highlighted in pick and transfer images.

Some of the challenges posed by this task include differentiating between similar-looking products, matching across large variations in viewpoints, and fusing multimodal information such as images and text to make predictions.

Object identification.png — An example of a pick image *(left)* from the object recognition dataset and a set of reference images, one of which is a match for the highlighted object.

The defect detection dataset includes both still images and videos. The still images — more than 19,000 of them — were captured during the transfer phase and are intended to train defect detection models, which determine when a robot arm has inadvertently damaged an object or picked up more than one object.

Preliminary tests show a prototype pinch-grasping robot achieved a 10-fold reduction in damage on items such as books and boxes.

The 4,000 videos document pick-and-place activities that resulted in damage to a product. Certain types of product damage are best diagnosed through video, as they can occur at any point in the transfer process; multipick errors, by contrast, necessarily occur at the beginning of transfer and are visible in images. The dataset also contains images and videos for over 100,000 pick-and-place activities without any defects.

The stringent accuracy requirements for defect detection in warehouse settings requires exploration and improvement of several key computer vision technologies, such as image classification, anomaly detection, and detection of defect events in videos.

In our paper, we describe several approaches we adopted to building models for the ARMBench tasks, and we report our models’ performance on those tasks, to provide other researchers with performance benchmarks.

Defect data.png — Examples of defects captured in the dataset. Images *a – c* show product damage, whereas *d – f* show multipick defects.

We intend to continue to expand the number of images and videos in the ARMBench dataset and the range of products they depict. It is our hope that ARMBench can help improve the utility of robots that relieve warehouse workers — such as the hundreds of thousands of employees at Amazon fulfillment centers — from repetitive tasks.

We also hope that the scale and diversity of the ARMBench data and the quality of its annotations will make it useful for training other types of computer vision models — not just those that help control warehouse robots.

About the Author

Mani Nambi

Manikantan Nambi is a senior applied scientist with Amazon Robotics.

Chaitanya Mitash

Chaitanya Mitash is a computer vision scientist at Amazon.

Fan Wang

Amazon releases largest dataset for training "pick and place" robots

Dataset of images collected in an industrial setting features more than 190,000 objects, orders of magnitude more than previous datasets.

Related content

Work with us