Why is the output different from Vibe's work? #9

Sing303 · 2024-10-18T11:28:05Z

I checked the output of the library and the result of the Vibe application (which uses it). Why are their results different?

Vibe

This Lib

altunenes · 2024-10-18T12:03:10Z

have you used any audio normalization? it is almost as important a process as the models themselves. Unfortunately, when I reviewed their papers/original repos (pyannote/whisper etc), I could not find a "general" normalization method that should be used to get the best results. What I did was mostly experimental.

Note that, Vibe uses:

https://github.com/thewh1teagle/vibe/blob/276a6a20b711ecb6c1aa080d1906ab83269626a0/core/src/audio.rs#L54C1-L75C8

pub fn normalize(input: PathBuf, output: PathBuf) -> Result<()> {
    let ffmpeg_path = find_ffmpeg_path().context("ffmpeg not found")?;
    tracing::debug!("ffmpeg path is {}", ffmpeg_path.display());

    let mut cmd = Command::new(ffmpeg_path);
    let cmd = cmd.stderr(Stdio::piped()).args([
        "-i",
        input.to_str().context("tostr")?,
        "-ar",
        "16000",
        "-ac",
        "1",
        "-c:a",
        "pcm_s16le",
        "-af", // normalize loudness
        "loudnorm=I=-16:TP=-1.5:LRA=11",
        output.to_str().context("tostr")?,
        "-hide_banner",
        "-y",
        "-loglevel",
        "error",
    ]);

from what I learned from the speaker identification/whisper process is audio normalization plays a crucial part. I have no idea what is the best normalization to do, it's mostly experimental and different normalizations for different situations can give different results. This is especially obvious in parallel speech. For my tests though, I generally use gstreamer's audio normalization. They work really nicely.

https://github.com/sdroege/gstreamer-rs/tree/main/gstreamer-audio

Sing303 · 2024-10-18T12:26:50Z

After same normalization, the result is also different :)
ffmpeg -i "6_speakers1.wav" -ar 16000 -ac 1 -c:a pcm_s16le -af "loudnorm=I=-16:TP=-1.5:LRA=11" "6_speakers.wav" -hide_banner -y -loglevel error

altunenes · 2024-10-18T15:58:31Z

strange. Which one is more accurate?

Sing303 · 2024-10-18T17:33:42Z

Vibe more accurate

Sing303 · 2024-10-25T11:30:24Z

@thewh1teagle Any idea what the difference is?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why is the output different from Vibe's work? #9

Why is the output different from Vibe's work? #9

Sing303 commented Oct 18, 2024

altunenes commented Oct 18, 2024 •

edited

Loading

Sing303 commented Oct 18, 2024

altunenes commented Oct 18, 2024

Sing303 commented Oct 18, 2024

Sing303 commented Oct 25, 2024

Why is the output different from Vibe's work? #9

Why is the output different from Vibe's work? #9

Comments

Sing303 commented Oct 18, 2024

altunenes commented Oct 18, 2024 • edited Loading

Sing303 commented Oct 18, 2024

altunenes commented Oct 18, 2024

Sing303 commented Oct 18, 2024

Sing303 commented Oct 25, 2024

altunenes commented Oct 18, 2024 •

edited

Loading