Shelving, Stacking, Hanging: Relational Pose
Diffusion for Multi-modal Rearrangement

CoRL 2023

Anthony Simeonov, Ankit Goyal*, Lucas Manuelli*, Lin Yen-Chen,
Alina Sarmiento, Alberto Rodriguez, Pulkit Agrawal**, Dieter Fox**
Massachusetts Institute of Technology, NVIDIA Research, Improbable AI Lab

*Equal contribution, **Equal advising
Work done in part during NVIDIA Research internship

Paper Code

We propose a system for rearranging objects in a scene to achieve a desired object-scene placing relationship, such as a book inserted in an open slot of a bookshelf. The pipeline generalizes to novel geometries, poses, and layouts of both scenes and objects, and is trained from demonstrations to operate directly on 3D point clouds. Our system overcomes challenges associated with the existence of many geometrically-similar rearrangement solutions for a given scene. By leveraging an iterative pose de-noising training procedure, we can fit multi-modal demonstration data and produce multi-modal outputs while remaining precise and accurate. We also show the advantages of conditioning on relevant local geometric features while ignoring irrelevant global structure that harms both generalization and precision. We demonstrate our approach on three distinct rearrangement tasks that require handling multi-modality and generalization over object shape and pose in both simulation and the real world.

Test-time Inference via Iterative Pose De-noising

Iterative pose de-noising for unseen simulated objects at test-time. Starting from diverse initial guess configurations of the objects relative to the scene, the inference process converges to a diverse set of final output rearrangement solutions.




Real-world Multi-modal Rearrangement via Pick-and-Place

Rearrangement in the real world using the Franka Panda arm. Each task features scene objects that offer multiple placement locations. RPDiff is used to produce a set of candidate placements and one of the predicted solutions is executed. Multiple executions in sequence show the ability to find multiple diverse solutions. Our neural network is trained in simulation and directly deployed in the real world (we do observe some performance gap due to sim2real distribution shift).




Training by perturbing object-scene point clouds and predicting corrective SE(3) transforms

We train on examples of point clouds showing properly configured object-scene pairs, obtained from procedurally generated rearrangement demonstrations on simulated objects. Training targets are generated by creating object point clouds with sequences of perturbation transforms applied. The network is trained to take in the noised object-scene point cloud and predict an SE(3) transform to apply to the object that takes a step back toward the original configuration. We crop the scene point cloud to improve generalization and precision by ignoring faraway details that are irrelevant for prediction and re-use features that describe local scene geometry across instances and spatial regions.

Our Related Projects

Check out our related projects on the topic of object rearrangement and local scene conditioning

We use neural descriptor fields (NDFs) to represent pairs of objects, label task-relevant local coordinate frames and detect correspondencing coordinate frames on unseen instances in arbitrary initial poses. NDFs enable this to work for novel object instances from just a handful of demonstrations. This enables relational rearrangement with pairs of unseen objects by aligning the detected frames to each other.
An end-to-end method for object rearrangement of unknown objects given an RGBD image of the original and final scenes. IFOR first learns an optical flow model to estimate relative object transformations from synthetic data. This flow is then used in an iterative minimization algorithm to achieve accurate positioning of previously unseen objects in cluttered scenes, and in the real world (while training only on synthetic data).
Local Neural Descriptor Fields (L-NDF), utilizes neural descriptors defined on the local geometry of the object to transfer manipulation demonstrations to novel objects at test time. By encoding geometry that is local to a spatial region rather than associating with an entire objects, L-NDF performs better with clutter and occlusions, and enables interacting with familiar object parts that are attached to unfamiliar objects.

External Related Projects

Check out other projects related to diffusion models, iterative prediction, and rearrangement

Combines a diffusion model and an object-centric transformer to construct structures given partial-view point clouds and high-level language goals, such as "set the table" and "make a line". Using use one multi-task model, this allows building physically-valid structures without step-by-step instructions.
A data-driven transformer-based iterative method for learning reeglar rearrangement of objects in messy rooms. Partly inspired by diffusion models, LEGO-Net starts with an initial messy state and iteratively denoises the position and orientation of objects to a regular state while reducing distance traveled.
A method for learning data-driven SE(3) cost functions as diffusion models, which allows integration with other costs into a single differentiable objective function. Specifically focused on learning SE(3) diffusion models for 6-DoF grasping, this framework enables joint grasp and motion optimization without needing to decouple grasp selection from trajectory generation.
A formulation for image restoration that avoids the ``regression to the mean'' effect by gradually improving image quality, similar to generative denoising diffusion models. Rather than predicting in a single step, which can result in an aggregation of all possible restoration explanations, InDI instead iteratively improves the image in small steps, resulting in better perceptual quality.



@article{simeonov2023rpdiff, author = {Simeonov, Anthony and Goyal, Ankit and Manuelli, Lucas and Yen-Chen, Lin and Sarmiento, Alina, and Rodriguez, Alberto and Agrawal, Pulkit and Fox, Dieter}, title = {Shelving, Stacking, Hanging: Relational Pose Diffusion for Multi-modal Rearrangement}, journal={Conference on Robot Learning}, year={2023} }


We would like to thank NVIDIA Seattle Robotics Lab members and the MIT Improbable AI Lab for their valuable feedback and support in developing this project. In particular, we would like to acknowledge Idan Shenfeld, Anurag Ajay, and Antonia Bronars for helpful suggestions on improving the clarity of the draft. This work was partly supported by Sony Research Awards and Amazon Research Awards. Anthony Simeonov is supported in part by the NSF Graduate Research Fellowship.

Send feedback and questions to Anthony Simeonov

Website template recycled from SIREN