Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why is the output different from Vibe's work? #9

Open
Sing303 opened this issue Oct 18, 2024 · 5 comments
Open

Why is the output different from Vibe's work? #9

Sing303 opened this issue Oct 18, 2024 · 5 comments

Comments

@Sing303
Copy link

Sing303 commented Oct 18, 2024

I checked the output of the library and the result of the Vibe application (which uses it). Why are their results different?

Vibe
Fhli3JHyMk

This Lib
nxnN2EAiRU

@altunenes
Copy link
Contributor

altunenes commented Oct 18, 2024

have you used any audio normalization? it is almost as important a process as the models themselves. Unfortunately, when I reviewed their papers/original repos (pyannote/whisper etc), I could not find a "general" normalization method that should be used to get the best results. What I did was mostly experimental.

Note that, Vibe uses:

https://github.com/thewh1teagle/vibe/blob/276a6a20b711ecb6c1aa080d1906ab83269626a0/core/src/audio.rs#L54C1-L75C8

pub fn normalize(input: PathBuf, output: PathBuf) -> Result<()> {
    let ffmpeg_path = find_ffmpeg_path().context("ffmpeg not found")?;
    tracing::debug!("ffmpeg path is {}", ffmpeg_path.display());

    let mut cmd = Command::new(ffmpeg_path);
    let cmd = cmd.stderr(Stdio::piped()).args([
        "-i",
        input.to_str().context("tostr")?,
        "-ar",
        "16000",
        "-ac",
        "1",
        "-c:a",
        "pcm_s16le",
        "-af", // normalize loudness
        "loudnorm=I=-16:TP=-1.5:LRA=11",
        output.to_str().context("tostr")?,
        "-hide_banner",
        "-y",
        "-loglevel",
        "error",
    ]);

from what I learned from the speaker identification/whisper process is audio normalization plays a crucial part. I have no idea what is the best normalization to do, it's mostly experimental and different normalizations for different situations can give different results. This is especially obvious in parallel speech. For my tests though, I generally use gstreamer's audio normalization. They work really nicely.

https://github.com/sdroege/gstreamer-rs/tree/main/gstreamer-audio

@Sing303
Copy link
Author

Sing303 commented Oct 18, 2024

After same normalization, the result is also different :)
ffmpeg -i "6_speakers1.wav" -ar 16000 -ac 1 -c:a pcm_s16le -af "loudnorm=I=-16:TP=-1.5:LRA=11" "6_speakers.wav" -hide_banner -y -loglevel error

image

@altunenes
Copy link
Contributor

strange. Which one is more accurate?

@Sing303
Copy link
Author

Sing303 commented Oct 18, 2024

Vibe more accurate

@Sing303
Copy link
Author

Sing303 commented Oct 25, 2024

@thewh1teagle Any idea what the difference is?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants