lm_eval --model vllm did not work when data_parallel_size > 1 #2379

wukaixingxp · 2024-10-03T16:56:23Z

We noticed that lm_eval --model vllm did not work when data_parallel_size > 1 and got Error: No available node types can fulfill resource request from Ray. After some research, I believe when tensor_parallel_size=1 we should use multiprocessing instead of ray (in this line) for the latest vLLM. My code works on data_parallel_size=1 but got the following error when data_parallel_size > 1, the logs are below, please help!
Log:

(llama) $ pip list | grep vllm
vllm                                     0.6.2
vllm-flash-attn                          2.5.9.post1
(llama) $ CUDA_VISIBLE_DEVICES=4,5,6,7 lm_eval --model vllm   --model_args pretrained=meta-llama/Llama-3.1-8B,tensor_parallel_size=1,dtype=auto,gpu_memory_utilization=0.9,data_parallel_size=4,max_model_len=8192,add_bos_token=True,seed=42 --tasks meta_pretrain --batch_size auto --output_path eval_results --include_path ./work_dir --seed 42  --log_samples
2024-10-02:13:21:55,591 INFO     [__main__.py:272] Verbosity set to INFO
2024-10-02:13:21:55,591 INFO     [__main__.py:303] Including path: ./work_dir
2024-10-02:13:21:59,000 INFO     [__main__.py:369] Selected Tasks: ['meta_pretrain']
2024-10-02:13:21:59,093 INFO     [evaluator.py:152] Setting random seed to 42 | Setting numpy seed to 42 | Setting torch manual seed to 42
2024-10-02:13:21:59,093 INFO     [evaluator.py:189] Initializing vllm model, with arguments: {'pretrained': 'meta-llama/Llama-3.1-8B', 'tensor_parallel_size': 1, 'dtype': 'auto', 'gpu_memory_utilization': 0.9, 'data_parallel_size': 4, 'max_model_len': 8192, 'add_bos_token': True, 'seed': 42}
2024-10-02:13:21:59,093 WARNING  [vllm_causallms.py:105] You might experience occasional issues with model weight downloading when data_parallel is in use. To ensure stable performance, run with data_parallel_size=1 until the weights are downloaded and cached.
2024-10-02:13:21:59,093 INFO     [vllm_causallms.py:110] Manual batching is not compatible with data parallelism.
2024-10-02:13:22:01,155 WARNING  [task.py:325] [Task: meta_bbh] has_training_docs and has_validation_docs are False, using test_docs as fewshot_docs but this is not recommended.
2024-10-02:13:22:01,159 WARNING  [task.py:325] [Task: meta_bbh] has_training_docs and has_validation_docs are False, using test_docs as fewshot_docs but this is not recommended.
2024-10-02:13:22:02,690 WARNING  [task.py:325] [Task: meta_mmlu_pro_pretrain] has_training_docs and has_validation_docs are False, using test_docs as fewshot_docs but this is not recommended.
2024-10-02:13:22:02,696 WARNING  [task.py:325] [Task: meta_mmlu_pro_pretrain] has_training_docs and has_validation_docs are False, using test_docs as fewshot_docs but this is not recommended.
2024-10-02:13:22:03,420 INFO     [evaluator.py:261] Setting fewshot random generator seed to 42
2024-10-02:13:22:03,420 INFO     [evaluator.py:261] Setting fewshot random generator seed to 42
2024-10-02:13:22:03,422 INFO     [task.py:411] Building contexts for meta_mmlu_pro_pretrain on rank 0...
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 12032/12032 [00:00<00:00, 29428.92it/s]
2024-10-02:13:22:04,566 INFO     [task.py:411] Building contexts for meta_bbh on rank 0...
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6511/6511 [00:00<00:00, 153755.41it/s]
2024-10-02:13:22:04,876 INFO     [evaluator.py:438] Running generate_until requests
Running generate_until requests:   0%|                                                                                                                  | 0/18543 [00:00<?, ?it/s]2024-10-02 13:22:32,034 INFO worker.py:1783 -- Started a local Ray instance.
(run_inference_one_model pid=1394830) Calling ray.init() again after it has already been called.
(autoscaler +46s) Tip: use `ray status` to view detailed cluster status. To disable these messages, set RAY_SCHEDULER_EVENTS=0.
(autoscaler +46s) Error: No available node types can fulfill resource request {'node:2401:db00:23c:1314:face:0:34f:0': 0.001, 'GPU': 1.0}. Add suitable node types to this cluster to resolve this issue.
(run_inference_one_model pid=1394830) INFO 10-02 13:22:47 ray_utils.py:183] Waiting for creating a placement group of specs for 10 seconds. specs=[{'node:2401:db00:23c:1314:face:0:34f:0': 0.001, 'GPU': 1.0}]. Check `ray status` to see if you have enough resources.
(run_inference_one_model pid=1394830) INFO 10-02 13:23:07 ray_utils.py:183] Waiting for creating a placement group of specs for 30 seconds. specs=[{'node:2401:db00:23c:1314:face:0:34f:0': 0.001, 'GPU': 1.0}]. Check `ray status` to see if you have enough resources. [repeated 4x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#log-deduplication for more options.)
(autoscaler +1m21s) Error: No available node types can fulfill resource request {'node:2401:db00:23c:1314:face:0:34f:0': 0.001, 'GPU': 1.0}. Add suitable node types to this cluster to resolve this issue.
(run_inference_one_model pid=1394830) INFO 10-02 13:23:47 ray_utils.py:183] Waiting for creating a placement group of specs for 70 seconds. specs=[{'node:2401:db00:23c:1314:face:0:34f:0': 0.001, 'GPU': 1.0}]. Check `ray status` to see if you have enough resources. [repeated 4x across cluster]
(autoscaler +1m56s) Error: No available node types can fulfill resource request {'node:2401:db00:23c:1314:face:0:34f:0': 0.001, 'GPU': 1.0}. Add suitable node types to this cluster to resolve this issue.
(autoscaler +2m31s) Error: No available node types can fulfill resource request {'node:2401:db00:23c:1314:face:0:34f:0': 0.001, 'GPU': 1.0}. Add suitable node types to this cluster to resolve this issue.

Meanwhile the ray status shows 4gpu available:

ray status
2024-10-02 13:25:26,488 - INFO - Note: detected 384 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
2024-10-02 13:25:26,488 - INFO - Note: NumExpr detected 384 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
2024-10-02 13:25:26,488 - INFO - NumExpr defaulting to 16 threads.
======== Autoscaler status: 2024-10-02 13:25:26.672321 ========
Node status
---------------------------------------------------------------
Active:
 1 node_497fbea7f83f313a8d6a3894bfaffdc98e398f9ee2ea8078803286cd
Pending:
 (no pending nodes)
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Usage:
 0.0/384.0 CPU
 0.0/4.0 GPU
 0B/1.85TiB memory
 52.77MiB/186.26GiB object_store_memory

Demands:
 {'GPU': 1.0, 'node:2401:db00:23c:1314:face:0:34f:0': 0.001} * 1 (PACK): 4+ pending placement groups

The text was updated successfully, but these errors were encountered:

baberabb · 2024-10-04T21:37:20Z

Hi! I cannot reproduce this (on an A40 node). Are you already using a Ray instance? Think that might be the issue, as I don't get the autoscaler messages as in your log. Also haven't been able to initialize multiple models inside a multiprocessing context, as vllm wants to create child processes and that's not allowed.

cc: @mgoin in case they have any tips!

wukaixingxp · 2024-10-04T21:42:49Z

I think vLLM will use "mp" (multiprocessing) for 1gpu inference by default as stated in this line, but lm_eval is still using ray. Correct me if I am wrong.

wukaixingxp · 2024-10-04T21:48:36Z

My command: lm_eval --model vllm --model_args pretrained=meta-llama/Meta-Llama-3.1-8B,tensor_parallel_size=1,dtype=auto,gpu_memory_utilization=0.8,data_parallel_size=4 --tasks lambada_openai --batch_size auto
Log:

lm_eval --model vllm  --model_args pretrained=meta-llama/Meta-Llama-3.1-8B,tensor_parallel_size=1,dtype=auto,gpu_memory_utilization=0.8,data_parallel_size=4 --tasks lambada_openai --batch_size auto
2024-10-04:14:46:18,784 INFO     [__main__.py:279] Verbosity set to INFO
2024-10-04:14:46:18,815 INFO     [__init__.py:491] `group` and `group_alias` keys in TaskConfigs are deprecated and will be removed in v0.4.5 of lm_eval. The new `tag` field will be used to allow for a shortcut to a group of tasks one does not wish to aggregate metrics across. `group`s which aggregate across subtasks must be only defined in a separate group config file, which will be the official way to create groups that support cross-task aggregation as in `mmlu`. Please see the v0.4.4 patch notes and our documentation: https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/new_task_guide.md#advanced-group-configs for more information.
2024-10-04:14:46:23,062 INFO     [__main__.py:376] Selected Tasks: ['lambada_openai']
2024-10-04:14:46:23,152 INFO     [evaluator.py:161] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234
2024-10-04:14:46:23,152 INFO     [evaluator.py:198] Initializing vllm model, with arguments: {'pretrained': 'meta-llama/Meta-Llama-3.1-8B', 'tensor_parallel_size': 1, 'dtype': 'auto', 'gpu_memory_utilization': 0.8, 'data_parallel_size': 4}
2024-10-04:14:46:23,152 WARNING  [vllm_causallms.py:105] You might experience occasional issues with model weight downloading when data_parallel is in use. To ensure stable performance, run with data_parallel_size=1 until the weights are downloaded and cached.
2024-10-04:14:46:23,152 INFO     [vllm_causallms.py:110] Manual batching is not compatible with data parallelism.
Downloading data: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.16M/1.16M [00:00<00:00, 5.40MB/s]
Generating test split: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 5153/5153 [00:00<00:00, 654272.83 examples/s]
2024-10-04:14:46:27,058 WARNING  [task.py:337] [Task: lambada_openai] has_training_docs and has_validation_docs are False, using test_docs as fewshot_docs but this is not recommended.
2024-10-04:14:46:27,059 WARNING  [task.py:337] [Task: lambada_openai] has_training_docs and has_validation_docs are False, using test_docs as fewshot_docs but this is not recommended.
2024-10-04:14:46:27,099 INFO     [evaluator.py:279] Setting fewshot random generator seed to 1234
2024-10-04:14:46:27,099 WARNING  [model.py:422] model.chat_template was called with the chat_template set to False or None. Therefore no chat template will be applied. Make sure this is an intended behavior.
2024-10-04:14:46:27,100 INFO     [task.py:423] Building contexts for lambada_openai on rank 0...
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5153/5153 [00:05<00:00, 969.30it/s]
2024-10-04:14:46:32,456 INFO     [evaluator.py:465] Running loglikelihood requests
Running loglikelihood requests:   0%|                                                                                                                    | 0/5153 [00:00<?, ?it/s]2024-10-04 14:46:35,488 INFO worker.py:1783 -- Started a local Ray instance.
(run_inference_one_model pid=70741) WARNING 10-04 14:46:42 arg_utils.py:930] Chunked prefill is enabled by default for models with max_model_len > 32K. Currently, chunked prefill might not work with some features or models. If you encounter any issues, please disable chunked prefill by setting --enable-chunked-prefill=False.
(run_inference_one_model pid=70741) INFO 10-04 14:46:42 config.py:1010] Chunked prefill is enabled with max_num_batched_tokens=512.
(run_inference_one_model pid=70741) Calling ray.init() again after it has already been called.
(autoscaler +26s) Tip: use `ray status` to view detailed cluster status. To disable these messages, set RAY_SCHEDULER_EVENTS=0.
(autoscaler +26s) Error: No available node types can fulfill resource request {'node:2401:db00:23c:1314:face:0:34f:0': 0.001, 'GPU': 1.0}. Add suitable node types to this cluster to resolve this issue.
(run_inference_one_model pid=70951) INFO 10-04 14:46:52 ray_utils.py:183] Waiting for creating a placement group of specs for 10 seconds. specs=[{'node:2401:db00:23c:1314:face:0:34f:0': 0.001, 'GPU': 1.0}]. Check `ray status` to see if you have enough resources.
(run_inference_one_model pid=70814) WARNING 10-04 14:46:43 arg_utils.py:930] Chunked prefill is enabled by default for models with max_model_len > 32K. Currently, chunked prefill might not work with some features or models. If you encounter any issues, please disable chunked prefill by setting --enable-chunked-prefill=False. [repeated 3x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#log-deduplication for more options.)
(run_inference_one_model pid=70814) INFO 10-04 14:46:43 config.py:1010] Chunked prefill is enabled with max_num_batched_tokens=512. [repeated 3x across cluster]
(run_inference_one_model pid=70951) INFO 10-04 14:47:12 ray_utils.py:183] Waiting for creating a placement group of specs for 30 seconds. specs=[{'node:2401:db00:23c:1314:face:0:34f:0': 0.001, 'GPU': 1.0}]. Check `ray status` to see if you have enough resources. [repeated 4x across cluster]
(autoscaler +1m1s) Error: No available node types can fulfill resource request {'node:2401:db00:23c:1314:face:0:34f:0': 0.001, 'GPU': 1.0}. Add suitable node types to this cluster to resolve this issue.
(run_inference_one_model pid=70951) INFO 10-04 14:47:52 ray_utils.py:183] Waiting for creating a placement group of specs for 70 seconds. specs=[{'node:2401:db00:23c:1314:face:0:34f:0': 0.001, 'GPU': 1.0}]. Check `ray status` to see if you have enough resources. [repeated 4x across cluster]
(autoscaler +1m37s) Error: No available node types can fulfill resource request {'node:2401:db00:23c:1314:face:0:34f:0': 0.001, 'GPU': 1.0}. Add suitable node types to this cluster to resolve this issue.

baberabb · 2024-10-04T22:15:38Z

I think vLLM will use "mp" (multiprocessing) for 1gpu inference by default as stated in this line, but lm_eval is still using ray. Correct me if I am wrong.

It should use mp when data_parallel_size=1. Otherwise we need to initialize multiple vllm instances and I haven't found a way to do that outside of a ray context.

You could try here

@ray.remote(num_gpus=1, num_cpus=1)

and maybe also calling ray.init(...) with the appropriate args beforehand, but this didn't work properly with tensor_parallel_size > 1 iirc.

Alternatively you could serve the models separately and use local-completions to send in the requests

baberabb added the bug Something isn't working. label Oct 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

lm_eval --model vllm did not work when data_parallel_size > 1 #2379

lm_eval --model vllm did not work when data_parallel_size > 1 #2379

wukaixingxp commented Oct 3, 2024

baberabb commented Oct 4, 2024

wukaixingxp commented Oct 4, 2024

wukaixingxp commented Oct 4, 2024 •

edited

Loading

baberabb commented Oct 4, 2024 •

edited

Loading

lm_eval --model vllm did not work when data_parallel_size > 1 #2379

lm_eval --model vllm did not work when data_parallel_size > 1 #2379

Comments

wukaixingxp commented Oct 3, 2024

baberabb commented Oct 4, 2024

wukaixingxp commented Oct 4, 2024

wukaixingxp commented Oct 4, 2024 • edited Loading

baberabb commented Oct 4, 2024 • edited Loading

wukaixingxp commented Oct 4, 2024 •

edited

Loading

baberabb commented Oct 4, 2024 •

edited

Loading