Memory-Guided Semantic Learning Network for Temporal Sentence Grounding
- Daizong Liu ,
- Xiaoye Qu ,
- Xing Di ,
- Yu Cheng ,
- Zichuan Xu ,
- Pan Zhou
Thirty-Sixth AAAI Conference on Artificial Intelligence (AAAI 2022) |
Temporal sentence grounding (TSG) is crucial and funda-mental for video understanding. Although existing methodstrain well-designed deep networks with large amount of data,we find that they can easily forget the rarely appeared casesin the training stage due to the off-balance data distribu-tion, which influences the model generalization and leads to undesirable performance. To tackle this issue, we proposea memory-augmented network, called Memory-Guided Semantic Learning Network (MGSL-Net), that learns and memorizes the rarely appeared content in TSG task. Specifically,MGSL-Net consists of three main parts: a cross-modal inter-action module, a memory augmentation module, and a het-erogeneous attention module. We first align the given video-query pair by a cross-modal graph convolutional network, andthen utilize memory module to record the cross-modal sharedsemantic features in the domain-specific persistent memory.During training, the memory slots are dynamically associatedwith both common and rare cases, alleviating the forgettingissue. In testing, the rare cases can thus be enhanced by re-trieving the stored memories, resulting in better generaliza-tion. At last, the heterogeneous attention module is utilized tointegrate the enhanced multi-modal features in both video andquery domains. Experimental results on three benchmarksshow the superiority of our method on both effectivenessand efficiency, which substantially improves the accuracy notonly on the entire dataset but also on the rare cases.