You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In vLLM there is support already for two kinds of model configuration format and several weight formats. However, there are other less common uses cases that aren't covered by the existing code base. For example #12250 and #10647 .
The purpose of this RFC is to enable two use cases:
Custom configuration or weight formats
Loading configurations and weights from custom storage back-ends such as KV stores.
Proposed Change.
Currently the configuration format can be controlled by the following flag:
--config-format {auto,hf,mistral}
The format of the model config to load. * "auto" will
try to load the config in hf format if available else
it will try to load in mistral format
The proposal of this RFC is to expand it to:
--config-format {auto,hf,mistral} or name registered in --config-format-plugin
The format of the model config to load. * "auto" will
try to load the config in hf format if available else
it will try to load in mistral format
--config-format-plugin CONFIG_FORMAT_PLUGIN
Special config format plugin to load the model
configuration from custom formats or custom storage backends.
The name registered for this plugin can be used in ``--config-format``.
In same way, currently the weight format is controlled by:
--load-format {auto,pt,safetensors,npcache,dummy,tensorizer,sharded_state,gguf,bitsandbytes,mistral,runai_streamer}
The format of the model weights to load. * "auto" will
try to load the weights in the safetensors format and
fall back to the pytorch bin format if safetensors
format is not available. * "pt" will load the weights
in the pytorch bin format. * "safetensors" will load
the weights in the safetensors format. * "npcache"
will load the weights in pytorch format and store a
numpy cache to speed up the loading. * "dummy" will
initialize the weights with random values, which is
mainly for profiling. * "tensorizer" will load the
weights using tensorizer from CoreWeave. See the
Tensorize vLLM Model script in the Examples section
for more information. * "runai_streamer" will load the
Safetensors weights using Run:aiModel Streamer *
"bitsandbytes" will load the weights using
bitsandbytes quantization.
The proposal of this RFC is to expand it to:
--load-format {auto,pt,safetensors,npcache,dummy,tensorizer,sharded_state,gguf,bitsandbytes,mistral,runai_streamer} or name registered in --load-format-plugin
--load-format-plugin LOAD_FORMAT_PLUGIN
Special weight format loader plugin write to load the model
weights from custom formats or custom storage backends.
The name registered for this plugin can be used in ``--load-format``.
Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
The text was updated successfully, but these errors were encountered:
Motivation.
In vLLM there is support already for two kinds of model configuration format and several weight formats. However, there are other less common uses cases that aren't covered by the existing code base. For example #12250 and #10647 .
The purpose of this RFC is to enable two use cases:
Proposed Change.
Currently the configuration format can be controlled by the following flag:
The proposal of this RFC is to expand it to:
In same way, currently the weight format is controlled by:
The proposal of this RFC is to expand it to:
Feedback Period.
No response
CC List.
@njhill , @tjohnson31415 , @fialhocoelho
Any Other Things.
No response
Before submitting a new issue...
The text was updated successfully, but these errors were encountered: