Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Version 1.1.0 has onnxruntime thread affinity crash #1169

Closed
Appfinity-development opened this issue Nov 23, 2024 · 9 comments
Closed

Version 1.1.0 has onnxruntime thread affinity crash #1169

Appfinity-development opened this issue Nov 23, 2024 · 9 comments

Comments

@Appfinity-development
Copy link

Appfinity-development commented Nov 23, 2024

Updated from 1.0.3 to 1.1.0. Now an onnxruntime thread affinity crash occurs each time. Both versions run on a Nvidia A40 with 4 CPU cores, 48GB VRAM and 16GB RAM (on a private Replicate server). Shouldn't be a hardware issue. Our model config:

  self.whisper_model = WhisperModel(
            "large-v2",
            device="cuda",
            compute_type="float16",
            cpu_threads=4,
            num_workers=1
        )
        
        ...
        
         options = dict(
            vad_filter=True,
            vad_parameters=dict(min_silence_duration_ms=1000),
            initial_prompt=prompt,
            word_timestamps=True,
            language=language,
            log_progress=True,
            hotwords=prompt
        )
        
        segments, transcript_info = self.whisper_model.transcribe(audio=audio_file, **options)
        

Also tried this:

import os
os.environ["ORT_DISABLE_CPU_AFFINITY"] = "1"
os.environ["OMP_NUM_THREADS"] = "4"
os.environ["OPENBLAS_NUM_THREADS"] = "4"
os.environ["MKL_NUM_THREADS"] = "4"
os.environ["VECLIB_MAXIMUM_THREADS"] = "4"
os.environ["NUMEXPR_NUM_THREADS"] = "4"

But to no avail. Any suggestions? Below the crash log.

Loading large-v2 model...
Done loading large-v2 model, took: 75.503 seconds
Starting transcribing
INFO:faster_whisper:Processing audio with duration 03:25.706
2024-11-22 19:33:53.322733977 [E:onnxruntime:Default, env.cc:234 ThreadMain] pthread_setaffinity_np failed for thread: 785, index: 1, mask: {2, }, error code: 22 error msg: Invalid argument. Specify the number of threads explicitly so the affinity is not set.
INFO:faster_whisper:VAD filter removed 00:19.722 of audio
DEBUG:faster_whisper:VAD filter kept the following audio segments: [00:00.048 -> 01:07.440], [01:07.984 -> 03:06.576]
0%| | 0/185.98 [00:00<?, ?seconds/s]DEBUG:faster_whisper:Processing segment at 00:00.000
Traceback (most recent call last):
File "/root/.pyenv/versions/3.10.15/lib/python3.10/site-packages/cog/server/runner.py", line 417, in _handle_done
f.result()
File "/root/.pyenv/versions/3.10.15/lib/python3.10/concurrent/futures/_base.py", line 451, in result
return self.__get_result()
File "/root/.pyenv/versions/3.10.15/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
raise self._exception
cog.server.exceptions.FatalWorkerException: Prediction failed for an unknown reason. It might have run out of memory? (exitcode -6)

The cog.yaml with dependencies looks like this:

build:
  gpu: true
  system_packages:
    - "ffmpeg"
    - "libmagic1"
  python_version: "3.10"
  python_packages:
    # Core ML packages
    - "torch==2.3.0"
    - "torchaudio==2.3.0"
    - "faster-whisper==1.1.0"
    - "pyannote-audio==3.3.1"
    - "onnxruntime"

    # API and utility packages
    - "requests==2.31.0"
    - "firebase-admin==6.4.0"
    - "google-generativeai==0.3.2"
    - "babel==2.14.0"
    - "openai==1.12.0"
    - "supabase==2.10.0"
    - "kalyke-apns==1.0.3"
    - "numpy<2.0.0"

  run:
    - "pip install --upgrade pip"
    - "echo env is ready!"

predict: "predict.py:Predictor"

Also tried removing the onnxruntime dependency or setting it to a specific gpu version. But nothing fixes the issue. Anyone with ideas (@MahmoudAshraf97) ?

If the cpu is used as device on WhisperModel the onnxruntime error still shows in the logs but there is no crash and transcribing finishes successfully.

@Appfinity-development Appfinity-development changed the title Version 1.1.0 has onxyruntime crash Version 1.1.0 has onnxruntime thread affinity crash Nov 23, 2024
@MahmoudAshraf97
Copy link
Collaborator

@Appfinity-development
Copy link
Author

Appfinity-development commented Nov 23, 2024

Which API is available to set SileroVADModel SessionOptions parameters?

@Purfview
Copy link
Contributor

Purfview commented Nov 23, 2024

Which API is available to set SileroVADModel SessionOptions parameters?

Just change it in vad.py to:

        opts.inter_op_num_threads = 1
        opts.intra_op_num_threads = 1

@Appfinity-development
Copy link
Author

Im running the code on a docker environment which just pulls in faster_whisper package from PyPi. So local changes I make in Pycharm to package won't propagate to the Replicate server. Only 2 options I see is monkey patching or forking the whole lib. Both which I'm not really keen on doing..

Or am I missing a third option?

@MahmoudAshraf97
Copy link
Collaborator

No third option currently, I just want you to test the fix first before we actually take any steps to fix

@Appfinity-development
Copy link
Author

Appfinity-development commented Nov 28, 2024

Tried monkey patching, this does remove the onnxruntime error but the OOM error still persisted. It turned out to be ctranslate2 version 4.5.0 was incompatible with the cog docker env of replicate. After downgrading to 4.4.0 it worked again. I did however keep the monkey patch since the logs won't be polluted then and the error seems something that should be addressed in 1.1.1.

Im now using large-v2 with the BatchedInferencePipeline which speeds up the processing time around 2x. Very nice for the same model.

This is my current packages in case someone else runs into the issue:

    - "torch==2.3.0"
    - "torchaudio==2.3.0"
    - "faster-whisper==1.1.0"
    - "pyannote-audio==3.3.2"
    - "ctranslate2==4.4.0"

monkey patch:

import faster_whisper.vad
from faster_whisper.vad import SileroVADModel

# to prevent "Invalid argument. Specify the number of threads explicitly so the affinity is not set" onnxruntime error

class PatchedSileroVADModel(SileroVADModel):
    def __init__(self, encoder_path, decoder_path):
        try:
            import onnxruntime
        except ImportError as e:
            raise RuntimeError(
                "Applying the VAD filter requires the onnxruntime package"
            ) from e

        # Custom modification for SessionOptions
        opts = onnxruntime.SessionOptions()
        opts.inter_op_num_threads = 4
        opts.intra_op_num_threads = 4
        opts.log_severity_level = 3

        # Initialize sessions with modified options
        self.encoder_session = onnxruntime.InferenceSession(
            encoder_path,
            providers=["CPUExecutionProvider"],
            sess_options=opts,
        )
        self.decoder_session = onnxruntime.InferenceSession(
            decoder_path,
            providers=["CPUExecutionProvider"],
            sess_options=opts,
        )

faster_whisper.vad.SileroVADModel = PatchedSileroVADModel

@Purfview
Copy link
Contributor

I think it should be

        opts.inter_op_num_threads = 1
        opts.intra_op_num_threads = 1

@MahmoudAshraf97
Copy link
Collaborator

I think it should be

        opts.inter_op_num_threads = 1
        opts.intra_op_num_threads = 1

the error he's mentioning is only caused when the value is 0 since that means onnx must infer the actual number and it fails to do so, any fixed number should fix the error, setting it to 1 should be the safest but not the fastest

Also VAD encoder now benefits from GPU acceleration if anyone needs it

Purfview added a commit to Purfview/faster-whisper that referenced this issue Dec 10, 2024
Reported problems:
SYSTRAN#1193
SYSTRAN#1169

VAD implementations consumes humongous memory amounts [original Silero doesn't have this problem]

This PR should fix the OOM problem.
Alt solution could be removing 'lru_cache'.
@Purfview
Copy link
Contributor

@Appfinity-development
Try this fix -> #1198

MahmoudAshraf97 pushed a commit to Purfview/faster-whisper that referenced this issue Dec 12, 2024
Reported problems:
SYSTRAN#1193
SYSTRAN#1169

VAD implementations consumes humongous memory amounts [original Silero doesn't have this problem]

This PR should fix the OOM problem.
Alt solution could be removing 'lru_cache'.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants