help understanding system memory usage #6140
Unanswered
eanopolsky
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I'm experiencing higher than expected system memory usage when attempting to load a model and would like to understand why.
I am already aware that prequantized models exist, that they are an easy way to use less memory, and that it's best to stuff the whole model into VRAM whenever possible. My goal in making this post is to improve my understanding of the load-in-4bit and use_double_quant toggles.
Steps to reproduce:
ps aux|grep 'python server.py$'|awk '{print $6}'
. In my case, it was using 691468 KB of resident memory (about 0.66 GB)Expected results: Because Mistral-7B-OpenOrca claims to be a 7 billion parameter model and I believe I have instructed text-generation-webui to load each parameter in 4 bit precision, I would expect text-generation-webui's resident memory to grow by roughly 7,000,000,000 parameters * (4 bits / parameter) * (1 byte / 8 bits) * (1 gigabyte / 1,000,000,000 bytes) = 3.5 GB, for a grand total of 4 or 5 GB of resident memory once model loading is complete.
Actual results: Rerunning
ps aux|grep 'python server.py$'|awk '{print $6}'
after the model loads reports that server.py is using 30578844 KB of resident memory (about 29 GB).Other information that may be helpful:
Troubleshooting steps taken:
Beta Was this translation helpful? Give feedback.
All reactions