The majority of modern robot learning methods focus on learning a set of pre-defined tasks with limited to no generalization to new tasks. Extending the robot skillset to novel tasks involves gathering an extensive amount of training data for additional tasks. In this paper, we address the problem of teaching new tasks to robots using human demonstration video for repetitive tasks (e.g., packing). This task requires understanding the human video and identifying which object is being manipulated (pick object) and where they are being placed (placement slot). In addition, it needs to re-identify the pick object and placement slots at the inference time along with the relative poses to enable robot execution of the task.
To tackle this, we propose SLeRP, a modular system that leverages several advanced visual foundation models and a novel slot-level placement detector Slot-Net, eliminating the need for expensive video demonstrations for training. We evaluate our system using a new benchmark with real-world videos. Results show SLeRP outperforms several baselines and can be deployed on a real robot.System overview. SLeRP begins by analyzing the input human video, tracking the object throughout the sequence and identifying the placement slot. Next, we re-identify the object and the slot in the robot's view by correlating the human-view and robot-view images. Using depth images, we reconstruct the observations in 3D and compute a single 6-DoF object transformation \(T\) in the robot's view, enabling the robot to transfer the object into the slot. If more than one slot is present, we detect all applicable slots and compute one 6-DoF object transformation for each slot. Finally, such 6-DoF object transformations are sent to the downstream robot planning and control pipeline for real robot pick-and-place execution.
Given the input human video (bottom), we run state-of-the-art hand-object detector and tracker to obtain the pick object mask and train a novel network SlotNet to identify the slot mask.
Given the object and slot mask detected in the human video, we first re-identify the corresponding object and slot in robot view, and also find all similar empty slots. With corresponding object masks and slot masks, we first compute 2D keypoint matching among the detected object and mask local patches and then lift the observations to 3D to compute 6-DoF transforms.
Qualitative comparison. We compare our method to baselines and present side-by-side results on three examples. For each example, the first column shows the input human video at the top and robot-view image in the bottom. The top row displays 2D object and slot re-identification results, while the bottom row shows 6-DoF relative pose predictions by projecting the object point cloud onto the slots. Unlike the baselines that can only predict one exact slot, our approach can also identify multiple slots. These results clearly demonstrate that our system outperforms the baselines, achieving accurate slot and transformation predictions.