Synchronized Audio-Visual Generation with a Joint Generative Diffusion Model and Contrastive Loss

The rapid development of deep learning techniques has led to significant advancements in the fields of multimedia generation and synthesis. However, generating coherent and temporally aligned audio and video content remains a challenging task due to the complex relationships between visual and auditory information. In this work, we propose a joint generative diffusion model that addresses this challenge by simultaneously generating video and audio content, thus enabling better synchronization and temporal alignment. Our approach is based on guided sampling, which allows for more flexibility in conditional generation and improves the overall quality of the generated content. Furthermore, we introduce a joint contrastive loss, inspired by previous work that has successfully employed contrastive loss in conditional diffusion models. By incorporating this joint contrastive loss, our model achieves better performance in terms of quality and temporal alignment. Through extensive evaluations using both subjective and objective metrics, we demonstrate the effectiveness of our proposed joint generative diffusion model in generating high-quality, temporally aligned audio and video content.