Faster video inference script. #650

eliphatfs · 2023-06-28T01:27:19Z

Changes:

Moved the final scaling and uint8 quantization to GPU, reducing CPU and main memory bandwidth consumption (Line 225-227). 2.5x speed-up.
Instruct FFMPEG to use RGB frames instead of BGR so no need to swap channels (Line 70 and 148).
Batched inference (controlled by --batch parameter, default is 4). Crushed CUDA GPU util to 100%.
Instruct torch to make contiguous tensors after the BCHW -> BHWC transform on GPU (Line 227). So no need to copy the buffer before writing to FFMPEG (Line 167). Reduced output IO time by 10x.

The metrics above are measured on a 1920x1080 30 fps anime video. On AMD R9-5900HX CPU (8 cores 16 threads) and 3080 LP (16GB), FP16, the processing rate goes from 0.8 fps to 4.6 fps with the optimizations (575% speed-up!). About 7.6 GB VRAM is used. You also get 4.4 fps (550% speed-up) at batch size 2, which now requires about 4.4 GB VRAM.

The script is not yet extensively tested (have no idea how to go, need some advice), and does not support extracting frames first, face enhance, alpha or grayscale images. Extract frames and face enhance go through very different workflows so the optimizations may not be applicable. Alpha and grayscale should not be an issue for almost all videos to be processed.

See #619, #634, #531.

DaDaDaDaDaYeah · 2023-07-02T04:04:54Z

Tested. This really works! Thanks!!

Test results (480p, upscale parameter 2):
From 5-7 frame/s (original codes) to average 30 frame/s. Same output quality.
From GPU 3D usage 30% (mostly CPU) to GPU 3D usage 95-100%.

tthg119 · 2023-07-03T03:03:10Z

Hmm, I have no idea why my result stays the same somehow. My run was ESRNet_4xplus, nproc=1, [480x270] --> [1920x1080], noFace, noExtractFirst on 4x A40 (48Gb VRAM each), 52 cpu cores with 920G RAM. Both the old and new inference script give ~3fr/s

Yes I noticed that the batch approach boost up GPU utilization significantly (around 23Gb 100% on each GPU comparing to just 4Gb 60% ish). I didn't measure cpu in details but Htop shows that it's about the same.
Also I tried with different models and configs, with different batch size but again the difference is only ~0.5fr/s. Would love to have some thoughts if possible

eliphatfs · 2023-07-03T08:29:52Z

Hmm, I have no idea why my result stays the same somehow. My run was ESRNet_4xplus, nproc=1, [480x270] --> [1920x1080], noFace, noExtractFirst on 4x A40 (48Gb VRAM each), 52 cpu cores with 920G RAM. Both the old and new inference script give ~3fr/s

Yes I noticed that the batch approach boost up GPU utilization significantly (around 23Gb 100% on each GPU comparing to just 4Gb 60% ish). I didn't measure cpu in details but Htop shows that it's about the same. Also I tried with different models and configs, with different batch size but again the difference is only ~0.5fr/s. Would love to have some thoughts if possible

Could you attach the video for some analysis here?

FNsi · 2023-07-13T09:13:22Z

25% faster for me!

FNsi · 2023-07-13T09:20:07Z

May I ask a question?
Do you know why without --fp32 the output will be white noise? (Also the main branch, Amd Rocm)

eliphatfs · 2023-07-13T11:46:43Z

I am running the animevideov3 model without FP32 and the outputs are correct.
Could you please provide more details about your setup?
I don't have RoCM available and there may be flaws in some APIs with FP16 as it is relatively new and not as mature as CUDA. For a suggestion on debugging by yourself, you may record the output of each layer in the network on the same input in the two modes FP16 and FP32 and compare them. If all of them are very different, perhaps there is a problem with rocm on your hardware; if it starts to become very different after a specific layer, you may be running into precision issues and you can't do much without changing a model.

FNsi · 2023-07-13T11:53:10Z

I am running the animevideov3 model without FP32 and the outputs are correct.

Sorry I tried with or without fp32 and there's no difference, whole white output at all.

Could you please provide more details about your setup?

Just simply using Python inference_video_fast.py --fp32 or not with general x3v4 model (the tiny denoise one, the master branch work fine with fp32)

eliphatfs · 2023-07-13T13:06:11Z

This command is working fine on my machine:

python inference_realesrgan_video_fast.py --model_name=realesr-general-x4v3 -i "videos\2022-12-24 17-53-30.mp4" -s 2

Did I understand your input correctly?

FNsi · 2023-07-13T13:09:58Z

Did I understand your input correctly?

I think you are right. I did it with -dn 0, I will try to use it again without it.

eliphatfs · 2023-07-13T13:11:52Z

It also works here with -dn 0.

FNsi · 2023-07-13T13:13:02Z

It also works here with -dn 0.

So I guess I need to debug into it...

Wait I did use the no nb_frames video patch and my input is a webm file. (still the original script works. )

Okay test the demo.mp4 find out fp16 has some detail and fp32 is just color blocks...

wacky6 · 2023-07-20T15:49:18Z

FYI, I observe torch.compile + channel_last provides 2x speedup (no tiling, no face enhancing, fp16) on NVIDIA A4000.

self.model = self.model.to(memory_format=torch.channels_last)
self.model = torch.compile(self.model)

Might be worth a shot?

I'm not sure how easy torch.compile fits with other features (face enhance, tiling), nor the end-to-end speedup in these cases (tiling and face enhace add CPU overheads).

eliphatfs · 2023-07-21T04:21:41Z

FYI, I observe torch.compile + channel_last provides 2x speedup (no tiling, no face enhancing, fp16) on NVIDIA A4000.
self.model = self.model.to(memory_format=torch.channels_last)
self.model = torch.compile(self.model)
Might be worth a shot?

I'm not sure how easy torch.compile fits with other features (face enhance, tiling), nor the end-to-end speedup in these cases (tiling and face enhace add CPU overheads).

Thanks for your comments! Which model are you using? On my side, using channel_last seems to reduce performance by half.
torch.compile is generally helpful for performance as it generates optimized code on kernel level.

wacky6 · 2023-07-21T05:02:20Z

Thanks for your comments! Which model are you using? On my side, using channel_last seems to reduce performance by half.

"Officlal" Real-ESRGAN x4 I suspect channel_last / channel_first gain will vary by device? Without channel_last, I get about 1.5x speedup on A4000.

eliphatfs · 2023-07-21T07:00:26Z

pytorch/pytorch#92542
I guess RRDB-based networks and VGG-based networks have different preferences for channel formats.

epistemex · 2023-09-18T18:47:47Z

You could also add an option to change the default libx264 to h264_nvenc encoder for ffmpeg which would give an additional performance boost. It would require ffmpeg compiled with cuda support, hence this as an option.

aliencaocao · 2024-05-22T13:02:26Z

how to use this on images instead of videos?

wheezy1749 · 2024-10-07T20:18:53Z

Insane performance increase! Made this library unrealistic on my 4070 to nearly realtime. 25 fps source video went from doing 6.1fps to 21.5fps! Beautiful mate!

eliphatfs added 3 commits June 27, 2023 23:08

Fast inference script (experimental).

8788c09

Remove profiling instrumentation.

aa723c5

Removed unsupported parameters.

a706f39

wacky6 mentioned this pull request Jun 3, 2024

Allow checking whether operators/types are supported for a backend before creating a graph webmachinelearning/webnn#463

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster video inference script. #650

Faster video inference script. #650

eliphatfs commented Jun 28, 2023

DaDaDaDaDaYeah commented Jul 2, 2023 •

edited

Loading

tthg119 commented Jul 3, 2023 •

edited

Loading

eliphatfs commented Jul 3, 2023

FNsi commented Jul 13, 2023

FNsi commented Jul 13, 2023 •

edited

Loading

eliphatfs commented Jul 13, 2023

FNsi commented Jul 13, 2023 •

edited

Loading

eliphatfs commented Jul 13, 2023

FNsi commented Jul 13, 2023

eliphatfs commented Jul 13, 2023

FNsi commented Jul 13, 2023 •

edited

Loading

wacky6 commented Jul 20, 2023

eliphatfs commented Jul 21, 2023

wacky6 commented Jul 21, 2023 •

edited

Loading

eliphatfs commented Jul 21, 2023

epistemex commented Sep 18, 2023

aliencaocao commented May 22, 2024

wheezy1749 commented Oct 7, 2024

Faster video inference script. #650

Are you sure you want to change the base?

Faster video inference script. #650

Conversation

eliphatfs commented Jun 28, 2023

DaDaDaDaDaYeah commented Jul 2, 2023 • edited Loading

tthg119 commented Jul 3, 2023 • edited Loading

eliphatfs commented Jul 3, 2023

FNsi commented Jul 13, 2023

FNsi commented Jul 13, 2023 • edited Loading

eliphatfs commented Jul 13, 2023

FNsi commented Jul 13, 2023 • edited Loading

eliphatfs commented Jul 13, 2023

FNsi commented Jul 13, 2023

eliphatfs commented Jul 13, 2023

FNsi commented Jul 13, 2023 • edited Loading

wacky6 commented Jul 20, 2023

eliphatfs commented Jul 21, 2023

wacky6 commented Jul 21, 2023 • edited Loading

eliphatfs commented Jul 21, 2023

epistemex commented Sep 18, 2023

aliencaocao commented May 22, 2024

wheezy1749 commented Oct 7, 2024

DaDaDaDaDaYeah commented Jul 2, 2023 •

edited

Loading

tthg119 commented Jul 3, 2023 •

edited

Loading

FNsi commented Jul 13, 2023 •

edited

Loading

FNsi commented Jul 13, 2023 •

edited

Loading

FNsi commented Jul 13, 2023 •

edited

Loading

wacky6 commented Jul 21, 2023 •

edited

Loading