Slot-Level Robotic Placement via Visual Imitation from Single Human Video

Abstract

The majority of modern robot learning methods focus on learning a set of pre-defined tasks with limited to no generalization to new tasks. Extending the robot skillset to novel tasks involves gathering an extensive amount of training data for additional tasks. In this paper, we address the problem of teaching new tasks to robots using human demonstration video for repetitive tasks (e.g., packing). This task requires understanding the human video and identifying which object is being manipulated (pick object) and where they are being placed (placement slot). In addition, it needs to re-identify the pick object and placement slots at the inference time along with the relative poses to enable robot execution of the task.

To tackle this, we propose SLeRP, a modular system that leverages several advanced visual foundation models and a novel slot-level placement detector Slot-Net, eliminating the need for expensive video demonstrations for training. We evaluate our system using a new benchmark with real-world videos. Results show SLeRP outperforms several baselines and can be deployed on a real robot.

SLeRP System Overview

System overview. SLeRP begins by analyzing the input human video, tracking the object throughout the sequence and identifying the placement slot. Next, we re-identify the object and the slot in the robot's view by correlating the human-view and robot-view images. Using depth images, we reconstruct the observations in 3D and compute a single 6-DoF object transformation \(T\) in the robot's view, enabling the robot to transfer the object into the slot. If more than one slot is present, we detect all applicable slots and compute one 6-DoF object transformation for each slot. Finally, such 6-DoF object transformations are sent to the downstream robot planning and control pipeline for real robot pick-and-place execution.

Step 1: parse the human video

Given the input human video (bottom), we run state-of-the-art hand-object detector and tracker to obtain the pick object mask and train a novel network SlotNet to identify the slot mask.

Step 2: correlate with robot view

Given the object and slot mask detected in the human video, we first re-identify the corresponding object and slot in robot view, and also find all similar empty slots. With corresponding object masks and slot masks, we first compute 2D keypoint matching among the detected object and mask local patches and then lift the observations to 3D to compute 6-DoF transforms.