[Bug]: Run multiple LLMs inference one by one with multiple TP always pending on the second one in Model list #12337

alexhegit · 2025-01-23T04:21:56Z

Your current environment

The output of `python collect_env.py`

Neuron SDK Version: N/A
vLLM Version: 0.6.6.post1
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      NV18    NV18    NV18    NV18    NV18    NV18    NV18    0-95,192-287    0               N/A
GPU1    NV18     X      NV18    NV18    NV18    NV18    NV18    NV18    0-95,192-287    0               N/A
GPU2    NV18    NV18     X      NV18    NV18    NV18    NV18    NV18    0-95,192-287    0               N/A
GPU3    NV18    NV18    NV18     X      NV18    NV18    NV18    NV18    0-95,192-287    0               N/A
GPU4    NV18    NV18    NV18    NV18     X      NV18    NV18    NV18    96-191,288-383  1               N/A
GPU5    NV18    NV18    NV18    NV18    NV18     X      NV18    NV18    96-191,288-383  1               N/A
GPU6    NV18    NV18    NV18    NV18    NV18    NV18     X      NV18    96-191,288-383  1               N/A
GPU7    NV18    NV18    NV18    NV18    NV18    NV18    NV18     X      96-191,288-383  1               N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NVIDIA_VISIBLE_DEVICES=all
NVIDIA_REQUIRE_CUDA=cuda>=12.1 brand=tesla,driver>=470,driver<471 brand=unknown,driver>=470,driver<471 brand=nvidia,driver>=470,driver<471 brand=nvidiartx,driver>=470,driver<471 brand=geforce,driver>=470,driver<471 brand=geforcertx,driver>=470,driver<471 brand=quadro,driver>=470,driver<471 brand=quadrortx,driver>=470,driver<471 brand=titan,driver>=470,driver<471 brand=titanrtx,driver>=470,driver<471 brand=tesla,driver>=525,driver<526 brand=unknown,driver>=525,driver<526 brand=nvidia,driver>=525,driver<526 brand=nvidiartx,driver>=525,driver<526 brand=geforce,driver>=525,driver<526 brand=geforcertx,driver>=525,driver<526 brand=quadro,driver>=525,driver<526 brand=quadrortx,driver>=525,driver<526 brand=titan,driver>=525,driver<526 brand=titanrtx,driver>=525,driver<526
NVIDIA_DRIVER_CAPABILITIES=compute,utility
VLLM_USAGE_SOURCE=production-docker-image
CUDA_VERSION=12.1.0
LD_LIBRARY_PATH=/usr/local/lib/python3.12/dist-packages/cv2/../../lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
CUDA_MODULE_LOADING=LAZY

I use my test tool could get from here https://github.com/alexhegit/vLLM_ModelCoverageTest/

The python code copied here,

import os
import shutil
from datetime import datetime
import torch
import logging
from argparse import ArgumentParser
import pandas as pd
from vllm import LLM, SamplingParams

current_date = datetime.now().strftime('%Y%m%d')
log_file = f"mct-{current_date}.log"
logging.basicConfig(filename=log_file, level=logging.DEBUG, 
                    format='%(asctime)s - %(levelname)s - %(message)s')

class InferenceEngine:
    def infer_with_model(self, model_id, gpus):
        try:
            if not isinstance(gpus, list) or len(gpus) == 0:
                logging.error(f"<vLLM-CMT> Provided GPUs list is invalid for model {model_id}: {gpus}")
                return "FAILED"
            
            os.environ["CUDA_VISIBLE_DEVICES"] = ",".join(map(str, gpus))
            tp = len(gpus)
            logging.info(f"<vLLM-CMT> Inference Model {model_id}, TP {tp}")
            llm = LLM(model=model_id,
                      tensor_parallel_size=tp,
                      trust_remote_code=True,
                      gpu_memory_utilization=0.95,
                      max_model_len=1024,
                      enforce_eager=True,
                      load_format="dummy"
                     )
            prompts = ["The capital of France is"]
            outputs = llm.generate(prompts, SamplingParams(temperature=0.8, top_p=0.9))
            if not outputs or len(outputs) == 0:
                raise ValueError("<vLLM-CMT >No outtputs received from the model.")
            for output in outputs:
                prompt = output.prompt
                generated_text = output.outputs[0].text
                logging.info(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
            logging.info(f"<vLLM-CMT> Model {model_id} inference status: PASS")
            return "PASS"
        except Exception as e:
            logging.error(f"<vLLM-CMT> Error during inference for model {model_id}: {e}")
            return "FAILED"

def delete_model_cache():
    try:
        cache_dir = os.path.expanduser("~/.cache/huggingface/hub/")
        if os.path.exists(cache_dir):
            shutil.rmtree(cache_dir)
            logging.info(f"<vLLM-CMT> Model cache directory deleted: {cache_dir}")
        else:
            logging.warning(f"<vLLM-CMT> Model cache directory does not exist: {cache_dir}")
    except Exception as e:
        logging.error(f"<vLLM-CMT> Error occurred while deleting model cache: {e}")

def main():
    parser = ArgumentParser(description="Run inference on models specified in a CSV file.")
    parser.add_argument("--csv", type=str, required=True, help="Path to the input CSV file")
    args = parser.parse_args()
    csv_file = args.csv
    if not os.path.isfile(csv_file):
        logging.error(f"<vLLM-CMT> Provided CSV file does not exist: {csv_file}")
        parser.print_help()
        return
    
    try:
        df = pd.read_csv(csv_file)
        if 'model_id' not in df.columns or 'gpus' not in df.columns:
            logging.error("<vLLM-CMT> CSV file must contain 'model_id' and 'gpus' columns.")
            raise ValueError("CSV file must contain 'model_id' and 'gpus' columns.")
        
        engine = InferenceEngine()
        results = []
        for index, row in df.iterrows():
            model_id = row['model_id']
            gpus = [int(gpu) for gpu in str(row['gpus']).split(',')]
            status = engine.infer_with_model(model_id, gpus)
            logging.info(f"<vLLM-CMT> Model {model_id} inference status: {status}")
            results.append(status)
            
            # Save intermediate results after each model
            df.loc[index, 'status'] = status
            base_name, ext = os.path.splitext(csv_file)
            output_csv_file = f"{base_name}_results{ext}"
            df.to_csv(output_csv_file, index=False)
            logging.info(f"<vLLM-CMT> Intermediate results saved to: {output_csv_file}")
            
            delete_model_cache()
        
    except Exception as e:
        logging.error(f"<vLLM-CMT> Error occurred while processing CSV file or models: {e}")

if __name__ == "__main__":
    main()

The reproduce steps are,

Steps:

Step1: setup a MTP LLM list in csv, e.g.

Create the model list in csv file like that,

# cat dev.csv
model_id,gpus
facebook/opt-125m,"0,1"
BAAI/Aquila-7B,"0,1"
BAAI/Aquila-7B,"0"

Step2: run test

python ModelCoverageTest.py --csv dev.csv

Step3: check the results

root@titan:/ws/vLLM_ModelCoverageTest# cat dev_results.csv
model_id,gpus,status
facebook/opt-125m,"0,1",PASS
BAAI/Aquila-7B,"0,1",FAILED
BAAI/Aquila-7B,0,FAILED

If set the model list with TP like that, all of them could run with PASS

model_id,gpus,status
facebook/opt-125m,"0",PASS
BAAI/Aquila-7B,"0",PASS
BAAI/Aquila-7B,0,PASS

Step4: check the log

root@titan:/ws/vLLM_ModelCoverageTest# tail mct-20250122.log
2025-01-22 19:19:41,374 - DEBUG - https://huggingface.co:443 "HEAD /BAAI/Aquila-7B/resolve/main/generation_config.json HTTP/1.1" 200 0
2025-01-22 19:19:41,375 - DEBUG - Attempting to acquire lock 127590362120848 on /root/.cache/huggingface/hub/.locks/models--BAAI--Aquila-7B/684bc56cb1fb502fe6bfecbc2bb6713f2db918d7.lock
2025-01-22 19:19:41,375 - DEBUG - Lock 127590362120848 acquired on /root/.cache/huggingface/hub/.locks/models--BAAI--Aquila-7B/684bc56cb1fb502fe6bfecbc2bb6713f2db918d7.lock
2025-01-22 19:19:41,461 - DEBUG - https://huggingface.co:443 "GET /BAAI/Aquila-7B/resolve/main/generation_config.json HTTP/1.1" 200 132
2025-01-22 19:19:41,462 - DEBUG - Attempting to release lock 127590362120848 on /root/.cache/huggingface/hub/.locks/models--BAAI--Aquila-7B/684bc56cb1fb502fe6bfecbc2bb6713f2db918d7.lock
2025-01-22 19:19:41,462 - DEBUG - Lock 127590362120848 released on /root/.cache/huggingface/hub/.locks/models--BAAI--Aquila-7B/684bc56cb1fb502fe6bfecbc2bb6713f2db918d7.lock
2025-01-22 19:19:41,701 - ERROR - <vLLM-CMT> Error during inference for model BAAI/Aquila-7B: tensor parallel group already initialized, but of unexpected size: get_tensor_model_parallel_world_size()=2 vs. tensor_model_parallel_size=1
2025-01-22 19:19:41,713 - INFO - <vLLM-CMT> Model BAAI/Aquila-7B inference status: FAILED
2025-01-22 19:19:41,715 - INFO - <vLLM-CMT> Intermediate results saved to: dev_results.csv
2025-01-22 19:19:41,718 - INFO - <vLLM-CMT> Model cache directory deleted: /root/.cache/huggingface/hub/

Model Input Dumps

No response

🐛 Describe the bug

I use my test tool could get from here https://github.com/alexhegit/vLLM_ModelCoverageTest/

The python code copied here,

import os
import shutil
from datetime import datetime
import torch
import logging
from argparse import ArgumentParser
import pandas as pd
from vllm import LLM, SamplingParams

current_date = datetime.now().strftime('%Y%m%d')
log_file = f"mct-{current_date}.log"
logging.basicConfig(filename=log_file, level=logging.DEBUG, 
                    format='%(asctime)s - %(levelname)s - %(message)s')

class InferenceEngine:
    def infer_with_model(self, model_id, gpus):
        try:
            if not isinstance(gpus, list) or len(gpus) == 0:
                logging.error(f"<vLLM-CMT> Provided GPUs list is invalid for model {model_id}: {gpus}")
                return "FAILED"
            
            os.environ["CUDA_VISIBLE_DEVICES"] = ",".join(map(str, gpus))
            tp = len(gpus)
            logging.info(f"<vLLM-CMT> Inference Model {model_id}, TP {tp}")
            llm = LLM(model=model_id,
                      tensor_parallel_size=tp,
                      trust_remote_code=True,
                      gpu_memory_utilization=0.95,
                      max_model_len=1024,
                      enforce_eager=True,
                      load_format="dummy"
                     )
            prompts = ["The capital of France is"]
            outputs = llm.generate(prompts, SamplingParams(temperature=0.8, top_p=0.9))
            if not outputs or len(outputs) == 0:
                raise ValueError("<vLLM-CMT >No outtputs received from the model.")
            for output in outputs:
                prompt = output.prompt
                generated_text = output.outputs[0].text
                logging.info(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
            logging.info(f"<vLLM-CMT> Model {model_id} inference status: PASS")
            return "PASS"
        except Exception as e:
            logging.error(f"<vLLM-CMT> Error during inference for model {model_id}: {e}")
            return "FAILED"

def delete_model_cache():
    try:
        cache_dir = os.path.expanduser("~/.cache/huggingface/hub/")
        if os.path.exists(cache_dir):
            shutil.rmtree(cache_dir)
            logging.info(f"<vLLM-CMT> Model cache directory deleted: {cache_dir}")
        else:
            logging.warning(f"<vLLM-CMT> Model cache directory does not exist: {cache_dir}")
    except Exception as e:
        logging.error(f"<vLLM-CMT> Error occurred while deleting model cache: {e}")

def main():
    parser = ArgumentParser(description="Run inference on models specified in a CSV file.")
    parser.add_argument("--csv", type=str, required=True, help="Path to the input CSV file")
    args = parser.parse_args()
    csv_file = args.csv
    if not os.path.isfile(csv_file):
        logging.error(f"<vLLM-CMT> Provided CSV file does not exist: {csv_file}")
        parser.print_help()
        return
    
    try:
        df = pd.read_csv(csv_file)
        if 'model_id' not in df.columns or 'gpus' not in df.columns:
            logging.error("<vLLM-CMT> CSV file must contain 'model_id' and 'gpus' columns.")
            raise ValueError("CSV file must contain 'model_id' and 'gpus' columns.")
        
        engine = InferenceEngine()
        results = []
        for index, row in df.iterrows():
            model_id = row['model_id']
            gpus = [int(gpu) for gpu in str(row['gpus']).split(',')]
            status = engine.infer_with_model(model_id, gpus)
            logging.info(f"<vLLM-CMT> Model {model_id} inference status: {status}")
            results.append(status)
            
            # Save intermediate results after each model
            df.loc[index, 'status'] = status
            base_name, ext = os.path.splitext(csv_file)
            output_csv_file = f"{base_name}_results{ext}"
            df.to_csv(output_csv_file, index=False)
            logging.info(f"<vLLM-CMT> Intermediate results saved to: {output_csv_file}")
            
            delete_model_cache()
        
    except Exception as e:
        logging.error(f"<vLLM-CMT> Error occurred while processing CSV file or models: {e}")

if __name__ == "__main__":
    main()

The reproduce steps are,

Steps:

Step1: setup a MTP LLM list in csv, e.g.

Here is the MTP batch test csv file

# cat dev.csv
model_id,gpus
facebook/opt-125m,"0,1"
BAAI/Aquila-7B,"0,1"
BAAI/Aquila-7B,"0"

Step2: run test

python ModelCoverageTest.py --csv dev.csv

Step3: check the results

root@titan:/ws/vLLM_ModelCoverageTest# cat dev_results.csv
model_id,gpus,status
facebook/opt-125m,"0,1",PASS
BAAI/Aquila-7B,"0,1",FAILED
BAAI/Aquila-7B,0,FAILED

If set the model list with TP like that, all of them could run with PASS

model_id,gpus,status
facebook/opt-125m,"0",PASS
BAAI/Aquila-7B,"0",PASS
BAAI/Aquila-7B,0,PASS

Step4: check the log

root@titan:/ws/vLLM_ModelCoverageTest# tail mct-20250122.log
2025-01-22 19:19:41,374 - DEBUG - https://huggingface.co:443 "HEAD /BAAI/Aquila-7B/resolve/main/generation_config.json HTTP/1.1" 200 0
2025-01-22 19:19:41,375 - DEBUG - Attempting to acquire lock 127590362120848 on /root/.cache/huggingface/hub/.locks/models--BAAI--Aquila-7B/684bc56cb1fb502fe6bfecbc2bb6713f2db918d7.lock
2025-01-22 19:19:41,375 - DEBUG - Lock 127590362120848 acquired on /root/.cache/huggingface/hub/.locks/models--BAAI--Aquila-7B/684bc56cb1fb502fe6bfecbc2bb6713f2db918d7.lock
2025-01-22 19:19:41,461 - DEBUG - https://huggingface.co:443 "GET /BAAI/Aquila-7B/resolve/main/generation_config.json HTTP/1.1" 200 132
2025-01-22 19:19:41,462 - DEBUG - Attempting to release lock 127590362120848 on /root/.cache/huggingface/hub/.locks/models--BAAI--Aquila-7B/684bc56cb1fb502fe6bfecbc2bb6713f2db918d7.lock
2025-01-22 19:19:41,462 - DEBUG - Lock 127590362120848 released on /root/.cache/huggingface/hub/.locks/models--BAAI--Aquila-7B/684bc56cb1fb502fe6bfecbc2bb6713f2db918d7.lock
2025-01-22 19:19:41,701 - ERROR - <vLLM-CMT> Error during inference for model BAAI/Aquila-7B: tensor parallel group already initialized, but of unexpected size: get_tensor_model_parallel_world_size()=2 vs. tensor_model_parallel_size=1
2025-01-22 19:19:41,713 - INFO - <vLLM-CMT> Model BAAI/Aquila-7B inference status: FAILED
2025-01-22 19:19:41,715 - INFO - <vLLM-CMT> Intermediate results saved to: dev_results.csv
2025-01-22 19:19:41,718 - INFO - <vLLM-CMT> Model cache directory deleted: /root/.cache/huggingface/hub/

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

The text was updated successfully, but these errors were encountered:

alexhegit added the bug Something isn't working label Jan 23, 2025

alexhegit mentioned this issue Jan 23, 2025

Run LLMs MTP test in list always failed at the second LLM alexhegit/vLLM_ModelCoverageTest#2

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Run multiple LLMs inference one by one with multiple TP always pending on the second one in Model list #12337

[Bug]: Run multiple LLMs inference one by one with multiple TP always pending on the second one in Model list #12337

alexhegit commented Jan 23, 2025 •

edited

Loading

[Bug]: Run multiple LLMs inference one by one with multiple TP always pending on the second one in Model list #12337

[Bug]: Run multiple LLMs inference one by one with multiple TP always pending on the second one in Model list #12337

Comments

alexhegit commented Jan 23, 2025 • edited Loading

Your current environment

Steps:

Step1: setup a MTP LLM list in csv, e.g.

Step2: run test

Step3: check the results

Step4: check the log

Model Input Dumps

🐛 Describe the bug

Steps:

Step1: setup a MTP LLM list in csv, e.g.

Step2: run test

Step3: check the results

Step4: check the log

Before submitting a new issue...

alexhegit commented Jan 23, 2025 •

edited

Loading