High Fidelity Video Background Music Generation with Transformers

19 Sept 2024, 12:34
13m
Hannah-Vogt-Saal

Hannah-Vogt-Saal

Session 6. Multi-Modal Foundation Models 🇬🇧 Session 6. Multi-Modal Foundation Models

Speaker

Yongli Mou (RWTH Aachen University)

Description

The recent success of large language models (LLMs) like GPT and BERT has demonstrated the immense capabilities of transformer-based architectures on natural language processing (NLP) tasks, such as text generation, translation, and summarization, setting new benchmarks in Artificial Intelligence (AI) performance. Building on this momentum, the AI research community is increasingly focusing on extending the capabilities of LLMs to multi-modal data, giving rise to multimodal foundation models. The use of generative models for music generation has also been gaining popularity. In this study, we present the application of multi-modal foundation models in the domain of video background music generation. Current music generation models are predominantly controlled by a single input modality: text. Video input is one such modality, with remarkably different requirements for the generation of background music accompanying it. Even though alternative methods for generating video background music exist, none achieve a music quality and diversity that is comparable to the text-based models. We adapt the text-based models to accept video as an alternative input modality for the control of the audio generation process and evaluate our approach quantitatively and qualitatively through the analysis of exemplary results in terms of audio quality and through a case study to determine the users’ perspective on the video-audio correspondence of our results.

Primary author

Yongli Mou (RWTH Aachen University)

Co-authors

Mr Niklas Schulte (RWTH Aachen University) Prof. Stefan Decker (RWTH Aachen University)

Presentation materials

There are no materials yet.