ncnn-hifi-GAN

VULKAN support ...

HiFi-GAN - GAN-based high-speed Neural Vocoder for Efficient and High Fidelity Speech Synthesis in TTS pipeline and Realistic Voice Conversion.

HiFi-GAN has improved the shortcomings of poor voice quality in previous GAN-based works.

The experimental results prove that HiFi-GAN can generate 22.05 kHz speech 13.4 times faster than autoregressive models.

In TTS based on deep learning, there are two stages to generate speech from text:

generate mel-spec from text, typically such as Tacotron and FastSpeech ,
generate speech from mel-spec, such as WaveNet and WaveRNN .

The performance of WaveNet is almost the same as that of human speech, but the generation speed is too slow. Recently, GAN-based Vocoder, such as MelGAN, tries to further increase the speed of speech generation. However, this type of model sacrifices quality while improving efficiency. Therefore, researchers hope to have a Vocoder with both efficiency and quality, this is HiFi-GAN.

output.mp4

How to use.

Download model hifivoice and place it in /models folder.
hifivoice.exe -i melgram_flipped.jpg
The input range of the mel-spectrogram for the vocoder is approximately from -11 to 2. For example, we take a mel-spectrogram saved in a regular jpg file with a magnitude range of 0..255. To use mel-spectrogram from a picture, the values need to be scaled. Mel_Image = Mel_Image * (1/255) * 13 - 11 = we get a range of values from -11 to 2.
Input Mel spectrogram paramters:
- n_fft = 1024
- num_mels = 80
- sampling_rate = 22050
- hop_size = 256
- win_size = 1024
- fmin = 0
- fmax = 8000

NCNN is a high-performance neural network.

HiFi-GAN Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

ncnn-hifi-GAN

How to use.

Files

README.md

Latest commit

History

README.md

File metadata and controls

ncnn-hifi-GAN

How to use.