SceneFun3D: Fine-Grained Functionality and Affordance Understanding in 3D Scenes
- Alexandros Delitzas ,
- Ayc¸a Takmaz ,
- Federico Tombari ,
- Robert Sumner ,
- Marc Pollefeys ,
- Francis Engelmann
CVPR 2024 |
Existing 3D scene understanding methods are heavily focused on 3D semantic and instance segmentation. However, identifying objects and their parts only constitutes an
intermediate step towards a more fine-grained goal, which is effectively interacting with the functional interactive elements (e.g., handles, knobs, buttons) in the scene to accomplish diverse tasks. To this end, we introduce SceneFun3D, a large-scale dataset with more than 14.8k highly accurate interaction annotations for 710 high-resolution real-world 3D indoor scenes. We accompany the annotations with motion parameter information, describing how to interact with these elements, and a diverse set of natural language descriptions of tasks that involve manipulating them in the scene context. To showcase the value of our dataset, we introduce three novel tasks, namely functionality segmentation, task-driven affordance grounding and 3D motion estimation, and adapt existing state-of-the-art methods to tackle them. Our experiments show that solving these tasks in real 3D scenes remains challenging despite recent progress in closed-set and open-set 3D scene understanding methods.