Text-driven object affordance for guiding grasp-type recognition in multimodal robot teaching

Machine Vision and Applications | , Vol 34(4)

In robot teaching, the grasping strategies taught to robots by users are critical information, because these strategies contain the implicit knowledge necessary to successfully perform a series of manipulations; however, limited practical knowledge exists on how to utilize linguistic information for supporting grasp-type recognition in multimodal teaching. This study focused on the effects of text-driven object affordance—a prior distribution of grasp types for each object—on image-based grasp-type recognition. To this end, we created the datasets of first-person grasping-hand images labeled with grasp types and object names and tested if the object affordance enhanced the performance of image-based recognition. We evaluated two scenarios with real and illusory objects to be grasped, considering a teaching condition in mixed reality, where the lack of visual object information can make image-based recognition challenging. The results show that object affordance guided the image-based recognition in two scenarios, that is, increasing the recognition accuracy by (1) excluding the unlikely grasp types from the candidates and (2) enhancing the likely grasp types. Additionally, the “enhancing effect” was more pronounced with greater grasp-type bias for each object in a test dataset. These results indicate the effectiveness of object affordance for guiding grasp-type recognition in multimodal robot teaching applications. The contributions of this study are (1) demonstrating the effectiveness of object affordance in guiding grasp-type recognition both with and without the real objects in images, (2) demonstrating the conditions under which the merits of object affordance are pronounced, and (3) providing a dataset of first-person grasping images labeled with possible grasp types for each object.