Speaker
Description
The recent success of large language models (LLMs) like GPT and BERT has demonstrated the immense capabilities of transformer-based architectures on natural language processing (NLP) tasks, such as text generation, translation, and summarization, setting new benchmarks in Artificial Intelligence (AI) performance. Building on this momentum, the AI research community is increasingly focusing on extending the capabilities of LLMs to multi-modal data, giving rise to multimodal foundation models. The use of generative models for music generation has also been gaining popularity. In this study, we present the application of multi-modal foundation models in the domain of video background music generation. Current music generation models are predominantly controlled by a single input modality: text. Video input is one such modality, with remarkably different requirements for the generation of background music accompanying it. Even though alternative methods for generating video background music exist, none achieve a music quality and diversity that is comparable to the text-based models. We adapt the text-based models to accept video as an alternative input modality for the control of the audio generation process and evaluate our approach quantitatively and qualitatively through the analysis of exemplary results in terms of audio quality and through a case study to determine the users’ perspective on the video-audio correspondence of our results.