An ODE-based generative neural vocoder using Rectified Flow
Recently ODE-based generative models are a hot topic in machine learning and image generation and have achieved remarkable performance. However, due to the differences in data distribution between images and waveforms, it is not clear how well these models perform on speech tasks. In this project, I implement an ODE-based generative neural coder called WaveODE using Rectified Flow [4] as the backbone and hope to contribute to the generalization of ODE-based generative models for speech tasks.
- The testdata folder contains some example files that allow the project to run directly.
- If you want to run with your own dataset:
- Replace the feature_dirs and fileid_list in config.json with your own dataset.
- Modify the acoustic parameters to match the data you are using and adjust the batch size to the number you need.
python3 -u train.py -c config.yaml -l logdir -m waveode_1-rectified_flow
- RK45 solver:
python3 inference.py --hparams config.yaml --checkpoint logdir/waveode_1-rectified_flow/xxx.pth --input test_mels_dir --output out_dir
- Euler sover:
python3 inference.py --hparams config.yaml --checkpoint logdir/waveode_1-rectified_flow/xxx.pth --input test_mels_dir --output out_dir --sampling_method euler --sampling_steps N
- Generate (noise, audio) tuples using 1-Rectified Flow:
python3 inferene.py --hparams config.yaml --checkpoint logdir/waveode/xxx.pth --input all_mels_dir --output testdata/generate
- Train 2-Rectified Flow using generated data
python3 -u train_reflow.py -c config_reflow.yaml -l logdir -m waveode_2-rectified_flow
- Upload demos of Waveode on open-resources speech corpus such as LJSpeech and VCTK
ODE-based generative model (also known as continuous normalizing flow) is a family of generative models that use an ODE-based model to model data distributions where the trajectory from an initial distribution such as a Gaussian distribution to a target distribution follows a ordinary differential equation.
There are some relevant papers:
[1] Neural ordinary differential equations (Chen et al. 2018) Paper
[2] FFJORD: Free-Form Continuous Dynamics for Scalable Reversible Generative Models (Grathwohl et al. 2018) Paper
[3] Score-Based Generative Modeling through Stochastic Differential Equations (Song et al. 2021) Paper
[3] Flow Matching for Generative Modeling (Lipman et al. 2023) Paper
[4] Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow (Liu et al. 2023) Paper
[5] Stochastic Interpolants: A Unifying Framework for Flows and Diffusions (Albergo et al. 2023) Paper
[6] Action Matching: Learning Stochastic Dynamics From Samples (Neklyudov et al. 2022) Paper
[7] Riemannian Flow Matching on General Geometries (Chen et al. 2023) Paper
[8] Conditional Flow Matching: Simulation-Free Dynamic Optimal Transport (Tong et al. 2023) Paper
[9] Minimizing Trajectory Curvature of ODE-based Generative Models (Lee et all. 2023) Paper
Because ODE-based model is simpler in theory and implementation, it has become very popular recently.
Since Rectified Flow is a proposed approach based on image generation, it may need to be modified or improved for speech tasks. On the other hand, glitches in image generation (e.g., unnatural hands) are less likely to affect the overall image quality, but glitches in speech are naturally easy to capture perceptually.
[5] proposed that the loss function of Rectified Flow is biased and [9] proposed that Rectified Flow estimates the upper bound of the degree of intersection of the independent coupling but does not really minimize it, and improvements based on the loss function might improve its quality