We present a framework for specifying tasks involving spatial relations between objects using only ~5-10 demonstrations and then executing such tasks given point cloud observations of a novel pair of objects in arbitrary initial poses. Our method uses Neural Descriptor Fields (NDFs) to achieve this by assigning a consistent local coordinate frame to the task-relevant parts of objects in demonstrations and localizing the corresponding coordinate frame on unseen object instances. We propose an optimization method that uses multiple NDFs and a single annotated 3D keypoint in one of the demonstrations, to directly assign a set of consistent coordinate frames to the task-relevant object parts. We also propose an energy-based learning scheme to model the joint configuration of the objects that satisfies a desired relational task. We validate our pipeline on three multi-object rearrangement tasks in simulation and on a real robot. Results demonstrate that our method can infer relative transformations that satisfy the desired relation between novel objects in unseen initial poses using just a few demonstrations.
This work is supported by the NSF Institute for AI and Fundamental Interactions, DARPA Machine Common Sense, NSF grant 2214177, AFOSR grant FA9550-22-1-0249, ONR grant N00014-22-1-2740, MIT-IBM Watson Lab, MIT Quest for Intelligence and Sony. Anthony Simeonov and Yilun Du are supported in part by NSF Graduate Research Fellowships. We thank the members of the Improbable AI Lab and the Learning and Intelligent Systems Lab for the helpful discussions and feedback on the paper. This webpage template was recycled from here.