A Multimodal Learning-from-Observation Towards All-at-once Robot Teaching using Task Cohesion

2022 IEEE/SICE International Symposium on System Integration |

Published by IEEE

Publication

Multimodal Learning-from-Observation (LfO) is a promising robot teaching solution that enables teaching sequential operations by extracting what-to-do from language and how-to-do from demonstrations. While previous studies have focused on step-by-step instructions, all-at-once teaching allows users to teach the behavior more naturally. However, all-at-once teaching needs to bridge the gap between verbal instruction and robot execution in order to understand which instruction corresponds to which demonstration section. To this end, we introduce the notion of task cohesion, which connects verbal instructions to robot execution based on the concept of physical/semantic state transition. We solve the problem of grounding and over/under-segmentation of language and demonstration by considering the cost of recursive dynamic programming, which divides the demonstration and grounds it to the language. The what-to-do can be obtained from the language, and the how-to-do can be obtained by extracting the parameters necessary for the execution of the robot based on the information of the task cohesion from the demonstration segments where the language is grounded. The contributions of this study are as follows: (1) to introduce task cohesion, (2) to propose a recursive dynamic programming approach to align verbal instructions and human demonstration, and (3) to demonstrate the effectiveness of multi-modal all-at-once teaching by integrating them.