help understanding system memory usage #6140

eanopolsky · 2024-06-19T00:02:15Z

eanopolsky
Jun 19, 2024

I'm experiencing higher than expected system memory usage when attempting to load a model and would like to understand why.

I am already aware that prequantized models exist, that they are an easy way to use less memory, and that it's best to stuff the whole model into VRAM whenever possible. My goal in making this post is to improve my understanding of the load-in-4bit and use_double_quant toggles.

Steps to reproduce:

Install text-generation-webui on an Ubuntu 22.04.4 LTS machine. It has been long enough that I have forgotten the commands I used to install it, but I'm on commit 7cf1402.
Start the application by running start_linux.sh.
On the Models tab, download Open-Orca/Mistral-7B-OpenOrca.
Select the Mistral-7B-OpenOrca model and choose the following settings:
- Model loader: Transformers
- gpu-memory in MiB for device :0: 0
- cpu-memory in MiB: 0
- compute_dtype: float16
- quant_type: nf4
- alpha_value: 1
- rope_freq_base: 0
- compress_pos_emb: 1
- load-in-8bit: unchecked
- load-in-4bit: checked
- use_double_quant: checked
- use_flash_attention_2: unchecked
- auto-devices: unchecked
- cpu: checked
- disk: unchecked
- bf16: unchecked
- trust-remote-code: unchecked
- no_use_fast: unchecked
- disable_exllama: unchecked
- disable_exllamav2: unchecked
Check the amount of resident memory used by server.py: ps aux|grep 'python server.py$'|awk '{print $6}'. In my case, it was using 691468 KB of resident memory (about 0.66 GB)
Click Load.

Expected results: Because Mistral-7B-OpenOrca claims to be a 7 billion parameter model and I believe I have instructed text-generation-webui to load each parameter in 4 bit precision, I would expect text-generation-webui's resident memory to grow by roughly 7,000,000,000 parameters * (4 bits / parameter) * (1 byte / 8 bits) * (1 gigabyte / 1,000,000,000 bytes) = 3.5 GB, for a grand total of 4 or 5 GB of resident memory once model loading is complete.

Actual results: Rerunning ps aux|grep 'python server.py$'|awk '{print $6}' after the model loads reports that server.py is using 30578844 KB of resident memory (about 29 GB).

Other information that may be helpful:

Repeating the above steps with load-in-4bit and use_double_quant unchecked (i.e. I don't think I'm requesting any quantization) still results in about 29GB of resident memory used.
Inference works just fine when loaded either way.

Troubleshooting steps taken:

Reviewed the model tab documentation at https://github.com/oobabooga/text-generation-webui/wiki/04-%E2%80%90-Model-Tab -- this is where I developed my understanding of the expected behavior of the load-in-4bit and use_double_quant toggles.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

help understanding system memory usage #6140

{{title}}

Replies: 0 comments

Select a reply

help understanding system memory usage #6140

eanopolsky Jun 19, 2024

Replies: 0 comments

eanopolsky
Jun 19, 2024