MetaMusic Overview

(简体中文/EN)

This is a project that uses Wav2CLIP and VQGAN-CLIP to generate AI music videos and images from any song. We named our first-year project group MetaMusic, which signifies the diversity of transformations and meaning expression that can be achieved through music.

The basic code uses VQGAN-CLIP, and the CLIP embedding for audio uses Wav2CLIP. After reading the literature, we found that our project has similarities with Music2Video, and we have made several original modifications and optimizations.

For technical information related to this mechanism, please refer to the following images:

Explorations

[×] We have tried integrating the AudioClip model, but it didn't produce good results for music due to the training dataset's focus on real-world sound recognition.
[×] We have tried integrating other VQ-GAN models, but the ImageNet dataset's realism doesn't align well with generating emotional images for songs, resulting in poor results.
[×] We have attempted to transform the entire project into a CLIP-guided Diffusion model, but the video generation efficiency was too low, resulting in a significant increase in video generation time. The video module that generates frames using large models has been temporarily put aside.
[√] We have made compatibility improvements to the code and performed mixed-precision computations to reduce GPU memory usage for some models.
[√] We have successfully wrapped the two systems into APIs for easy usage.
[√] We have implemented a visual interface using gradio.
[ ] We will try to create our own dataset to train the Wav2Clip model (or possibly VQ-GAN).
[ ] We will continue to explore other image generation models, including but not limited to Diffusion architecture, and integrate them into the MetaMusic impression image generation module.

Current Results Showcase

Usage Guide

The current supported device environments for this model are:

Windows (verified) / Linux (unverified) systems
Nvidia GPU with CUDA support / CPU

Due to the complexity and potential compatibility issues of the model code, the configuration may not be smooth. Please proceed with caution and good luck!

This guide assumes the usage of Anaconda for managing virtual Python environments.

Create a new virtual Python environment for VQGAN-CLIP:

conda create --name metamusic python=3.9
conda activate metamusic

Next, download the CUDA and cudnn versions supported by your Nvidia GPU from CUDA Toolkit - Free Tools and Training | NVIDIA Developer (generally, for 30 series GPUs and above, downloading CUDA 12 is sufficient).
Then, install PyTorch in the new environment:

Visit Start Locally | PyTorch and choose the appropriate PyTorch version for your computer. Copy the provided link and use it to download and install PyTorch (e.g., I used the cu118 version).

pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

Install the necessary system support libraries, including but not limited to ffmpeg and MSVC, and configure the system environment variables accordingly.
Install other required Python packages:

pip install ftfy regex tqdm omegaconf pytorch-lightning IPython kornia imageio imageio-ffmpeg einops torch_optimizer wav2clip

Additionally, clone the required repositories:

git clone 'https://github.com/SMARK2022/MetaMusic.git'

Due to version compatibility issues, it is recommended to use the requirements.txt file with version numbers for installation. Additionally, separately install taming-transformers using pip from the internet.
You will also need at least one pretrained VQGAN model. (After testing, VQ-GAN ImageNet 16384 is recommended)

mkdir checkpoints

wget -o checkpoints/vqgan_imagenet_f16_16384.yaml 'https://heibox.uni-heidelberg.de/d/a7530b09fed84f80a887/files/?p=%2Fconfigs%2Fmodel.yaml&dl=1'
wget -o checkpoints/vqgan_imagenet_f16_16384.ckpt 'https://heibox.uni-heidelberg.de/d/a7530b09fed84f80a887/files/?p=%2Fckpts%2Flast.ckpt&dl=1'

For more information about VQGAN pretrained models, including download links, refer to https://github.com/CompVis/taming-transformers#overview-of-pretrained-models.

By default, the .yaml and .ckpt files are expected to be placed in the checkpoints directory. For more information about the dataset and models, refer to https://github.com/CompVis/taming-transformers.

Generate Music Impression Images / Videos

To generate music impression images/videos, you can directly run gradiogui.py, which is the visual interface file, and specify your music on the web page.
If you need fine-grained parameter adjustments, you can use the two different APIs api_picture or api_video as needed.

Example:

import api_picture
api_picture.generate(filemusic=....mp3, ...):

The model and process for generating music impression images are still being optimized, so the results may vary.

Additionally, the video generation process may take some time, so please be patient (approximately 2 hours).

References

@misc{unpublished2021clip,
    title  = {CLIP: Connecting Text and Images},
    author = {Alec Radford, Ilya Sutskever, Jong Wook Kim, Gretchen Krueger, Sandhini Agarwal},
    year   = {2021}
}

@misc{esser2020taming,
      title={Taming Transformers for High-Resolution Image Synthesis},
      author={Patrick Esser and Robin Rombach and Björn Ommer},
      year={2020},
      eprint={2012.09841},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

@article{wu2021wav2clip,
  title={Wav2CLIP: Learning Robust Audio Representations From CLIP},
  author={Wu, Ho-Hsiang and Seetharaman, Prem and Kumar, Kundan and Bello, Juan Pablo},
  journal={arXiv preprint arXiv:2110.11499},
  year={2021}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README_en.md

README_en.md

MetaMusic Overview

Explorations

Current Results Showcase

Usage Guide

Generate Music Impression Images / Videos

References

Files

README_en.md

Latest commit

History

README_en.md

File metadata and controls

MetaMusic Overview

Explorations

Current Results Showcase

Usage Guide

Generate Music Impression Images / Videos

References