RuntimeError: CUDA error: invalid device ordinal #40

roperi · 2023-08-18T11:02:13Z

Docker image: pytorch/pytorch:2.0.1-cuda11.7-cudnn8-devel
OS: Ubuntu 20.04.5 LTS
24 GB VRAM
29 GB RAM 4 vCPU

I tried with batch size 1 and 2. And with and without multi-gpu support. Both fail.

$ bash train_sdxl.sh                                                                                                                                                   
[2023-08-18 10:54:39,821] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-08-18 10:54:44,720] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-08-18 10:54:44,758] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-08-18 10:54:44,778] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-08-18 10:54:44,799] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-08-18 10:54:44,821] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-08-18 10:54:44,852] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-08-18 10:54:44,880] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-08-18 10:54:44,883] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-08-18 10:54:44,885] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-08-18 10:54:44,894] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
Traceback (most recent call last):
  File "/workspace/SimpleTuner/train_sdxl.py", line 1191, in <module>
Traceback (most recent call last):
  File "/workspace/SimpleTuner/train_sdxl.py", line 1191, in <module>
    main()
  File "/workspace/SimpleTuner/train_sdxl.py", line 156, in main
    accelerator = Accelerator(
  File "/workspace/SimpleTuner/.venv/lib/python3.10/site-packages/accelerate/accelerator.py", line 358, in __init__
    self.state = AcceleratorState(
  File "/workspace/SimpleTuner/.venv/lib/python3.10/site-packages/accelerate/state.py", line 720, in __init__
    main()
  File "/workspace/SimpleTuner/train_sdxl.py", line 156, in main
        PartialState(cpu, **kwargs)accelerator = Accelerator(

  File "/workspace/SimpleTuner/.venv/lib/python3.10/site-packages/accelerate/state.py", line 198, in __init__
  File "/workspace/SimpleTuner/.venv/lib/python3.10/site-packages/accelerate/accelerator.py", line 358, in __init__
Traceback (most recent call last):
    torch.cuda.set_device(self.device)
  File "/workspace/SimpleTuner/train_sdxl.py", line 1191, in <module>
  File "/workspace/SimpleTuner/.venv/lib/python3.10/site-packages/torch/cuda/__init__.py", line 350, in set_device
    self.state = AcceleratorState(
  File "/workspace/SimpleTuner/.venv/lib/python3.10/site-packages/accelerate/state.py", line 720, in __init__
    torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

    PartialState(cpu, **kwargs)
  File "/workspace/SimpleTuner/.venv/lib/python3.10/site-packages/accelerate/state.py", line 198, in __init__
    torch.cuda.set_device(self.device)
  File "/workspace/SimpleTuner/.venv/lib/python3.10/site-packages/torch/cuda/__init__.py", line 350, in set_device
    main()
  File "/workspace/SimpleTuner/train_sdxl.py", line 156, in main
    torch._C._cuda_setDevice(device)
RuntimeError    : accelerator = Accelerator(CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
  File "/workspace/SimpleTuner/.venv/lib/python3.10/site-packages/accelerate/accelerator.py", line 358, in __init__
    self.state = AcceleratorState(
  File "/workspace/SimpleTuner/.venv/lib/python3.10/site-packages/accelerate/state.py", line 720, in __init__
    PartialState(cpu, **kwargs)
  File "/workspace/SimpleTuner/.venv/lib/python3.10/site-packages/accelerate/state.py", line 198, in __init__
    torch.cuda.set_device(self.device)
  File "/workspace/SimpleTuner/.venv/lib/python3.10/site-packages/torch/cuda/__init__.py", line 350, in set_device
    torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Traceback (most recent call last):
  File "/workspace/SimpleTuner/train_sdxl.py", line 1191, in <module>
    main()
  File "/workspace/SimpleTuner/train_sdxl.py", line 156, in main
    accelerator = Accelerator(
  File "/workspace/SimpleTuner/.venv/lib/python3.10/site-packages/accelerate/accelerator.py", line 358, in __init__
    self.state = AcceleratorState(
  File "/workspace/SimpleTuner/.venv/lib/python3.10/site-packages/accelerate/state.py", line 720, in __init__
    PartialState(cpu, **kwargs)
  File "/workspace/SimpleTuner/.venv/lib/python3.10/site-packages/accelerate/state.py", line 198, in __init__
    torch.cuda.set_device(self.device)
  File "/workspace/SimpleTuner/.venv/lib/python3.10/site-packages/torch/cuda/__init__.py", line 350, in set_device
    torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Traceback (most recent call last):
  File "/workspace/SimpleTuner/train_sdxl.py", line 1191, in <module>
Traceback (most recent call last):
  File "/workspace/SimpleTuner/train_sdxl.py", line 1191, in <module>
    main()
  File "/workspace/SimpleTuner/train_sdxl.py", line 156, in main
    accelerator = Accelerator(
  File "/workspace/SimpleTuner/.venv/lib/python3.10/site-packages/accelerate/accelerator.py", line 358, in __init__
    main()
  File "/workspace/SimpleTuner/train_sdxl.py", line 156, in main
        accelerator = Accelerator(self.state = AcceleratorState(

  File "/workspace/SimpleTuner/.venv/lib/python3.10/site-packages/accelerate/accelerator.py", line 358, in __init__
  File "/workspace/SimpleTuner/.venv/lib/python3.10/site-packages/accelerate/state.py", line 720, in __init__
    self.state = AcceleratorState(
  File "/workspace/SimpleTuner/.venv/lib/python3.10/site-packages/accelerate/state.py", line 720, in __init__
    PartialState(cpu, **kwargs)
  File "/workspace/SimpleTuner/.venv/lib/python3.10/site-packages/accelerate/state.py", line 198, in __init__
    PartialState(cpu, **kwargs)
  File "/workspace/SimpleTuner/.venv/lib/python3.10/site-packages/accelerate/state.py", line 198, in __init__
    torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

    torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Traceback (most recent call last):
  File "/workspace/SimpleTuner/train_sdxl.py", line 1191, in <module>
Traceback (most recent call last):
  File "/workspace/SimpleTuner/train_sdxl.py", line 1191, in <module>
    main()
  File "/workspace/SimpleTuner/train_sdxl.py", line 156, in main
    accelerator = Accelerator(
  File "/workspace/SimpleTuner/.venv/lib/python3.10/site-packages/accelerate/accelerator.py", line 358, in __init__
    self.state = AcceleratorState(
  File "/workspace/SimpleTuner/.venv/lib/python3.10/site-packages/accelerate/state.py", line 720, in __init__
    PartialState(cpu, **kwargs)
  File "/workspace/SimpleTuner/.venv/lib/python3.10/site-packages/accelerate/state.py", line 198, in __init__
    main()
  File "/workspace/SimpleTuner/train_sdxl.py", line 156, in main
    torch.cuda.set_device(self.device)
  File "/workspace/SimpleTuner/.venv/lib/python3.10/site-packages/torch/cuda/__init__.py", line 350, in set_device
    accelerator = Accelerator(
  File "/workspace/SimpleTuner/.venv/lib/python3.10/site-packages/accelerate/accelerator.py", line 358, in __init__
    torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

    self.state = AcceleratorState(
  File "/workspace/SimpleTuner/.venv/lib/python3.10/site-packages/accelerate/state.py", line 720, in __init__
Traceback (most recent call last):
  File "/workspace/SimpleTuner/train_sdxl.py", line 1191, in <module>
    PartialState(cpu, **kwargs)
  File "/workspace/SimpleTuner/.venv/lib/python3.10/site-packages/accelerate/state.py", line 198, in __init__
    torch.cuda.set_device(self.device)
  File "/workspace/SimpleTuner/.venv/lib/python3.10/site-packages/torch/cuda/__init__.py", line 350, in set_device
    torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

    main()
  File "/workspace/SimpleTuner/train_sdxl.py", line 156, in main
    accelerator = Accelerator(
  File "/workspace/SimpleTuner/.venv/lib/python3.10/site-packages/accelerate/accelerator.py", line 358, in __init__
    self.state = AcceleratorState(
  File "/workspace/SimpleTuner/.venv/lib/python3.10/site-packages/accelerate/state.py", line 720, in __init__
    PartialState(cpu, **kwargs)
  File "/workspace/SimpleTuner/.venv/lib/python3.10/site-packages/accelerate/state.py", line 198, in __init__
    torch.cuda.set_device(self.device)
  File "/workspace/SimpleTuner/.venv/lib/python3.10/site-packages/torch/cuda/__init__.py", line 350, in set_device
    torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Downloading (_)lve/main/config.json: 100%|______________________________________________________________________________________________________________________________________________________| 631/631 [00:00<00:00, 3.44MB/s]
Downloading (_)ch_model.safetensors:  85%|______________________________________________________________________________________________________________________________                      | 283M/335M [00:03<00:00, 86.9MB/s]
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 12254 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 12255) of binary: /workspace/SimpleTuner/.venv/bin/python
Traceback (most recent call last):
  File "/workspace/SimpleTuner/.venv/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/workspace/SimpleTuner/.venv/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main
    args.func(args)
  File "/workspace/SimpleTuner/.venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 970, in launch_command
    multi_gpu_launcher(args)
  File "/workspace/SimpleTuner/.venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 646, in multi_gpu_launcher
    distrib_run.run(args)
  File "/workspace/SimpleTuner/.venv/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/workspace/SimpleTuner/.venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/workspace/SimpleTuner/.venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
train_sdxl.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2023-08-18_10:54:51
  host      : ee48d4679c9e
  rank      : 2 (local_rank: 2)
  exitcode  : 1 (pid: 12256)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
  time      : 2023-08-18_10:54:51
  host      : ee48d4679c9e
  rank      : 3 (local_rank: 3)
  exitcode  : 1 (pid: 12257)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
  time      : 2023-08-18_10:54:51
  host      : ee48d4679c9e
  rank      : 4 (local_rank: 4)
  exitcode  : 1 (pid: 12258)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[4]:
  time      : 2023-08-18_10:54:51
  host      : ee48d4679c9e
  rank      : 5 (local_rank: 5)
  exitcode  : 1 (pid: 12259)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[5]:
  time      : 2023-08-18_10:54:51
  host      : ee48d4679c9e
  rank      : 6 (local_rank: 6)
  exitcode  : 1 (pid: 12260)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[6]:
  time      : 2023-08-18_10:54:51
  host      : ee48d4679c9e
  rank      : 7 (local_rank: 7)
  exitcode  : 1 (pid: 12261)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[7]:
  time      : 2023-08-18_10:54:51
  host      : ee48d4679c9e
  rank      : 8 (local_rank: 8)
  exitcode  : 1 (pid: 12262)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[8]:
  time      : 2023-08-18_10:54:51
  host      : ee48d4679c9e
  rank      : 9 (local_rank: 9)
  exitcode  : 1 (pid: 12263)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-08-18_10:54:51
  host      : ee48d4679c9e
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 12255)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

The text was updated successfully, but these errors were encountered:

roperi · 2023-08-18T11:32:39Z

contents of sdxl-env.sh

# Reproducible training.
export TRAINING_SEED=555

# Restart where we left off. Change this to "checkpoint-1234" to start from a specific checkpoint.
export RESUME_CHECKPOINT="latest"

# How often to checkpoint. Depending on your learning rate, you may wish to change this.

# For the default settings with 10 gradient accumulations, more frequent checkpoints might be preferable at first.
export CHECKPOINTING_STEPS=150
# This is how many checkpoints we will keep. Two is safe, but three is safer.
export CHECKPOINTING_LIMIT=2

export LEARNING_RATE=1e-6 #@param {type:"number"}

# Configure these values.
# Using a Huggingface Hub model:
export MODEL_NAME="stabilityai/stable-diffusion-xl-base-1.0"
# Using a local path to a huggingface hub model or saved checkpoint:
#export MODEL_NAME="/datasets/models/pipeline"

export TRACKER_RUN_NAME="simpletuner-sdxl"

# Use this to append an instance prompt to each caption, used for adding trigger words.
# This has not been tested in SDXL.
#export INSTANCE_PROMPT="lotr style "
# This will be used for WandB uploads.
export VALIDATION_PROMPT="close up photo of xxxxx"
# How frequently we will save and run a pipeline for validations.
export VALIDATION_STEPS=100

# Location of training data.
export BASE_DIR="/workspace/SimpleTuner/"
export INSTANCE_DIR="${BASE_DIR}/input"
export OUTPUT_DIR="${BASE_DIR}/models"

# Some data that we generate will be cached here.
export STATE_PATH="${BASE_DIR}/training_state.json"
# Store whether we've seen an image or not, to prevent repeats.
export SEEN_STATE_PATH="${BASE_DIR}/training_images_seen.json"

# Max number of steps OR epochs can be used. But we default to Epochs.
export MAX_NUM_STEPS=30000
# Will likely overtrain, but that's fine.
export NUM_EPOCHS=25

# Use any standard scheduler type.
export LR_SCHEDULE="constant"
# Whether this is used, depends on whether you have epochs or num_steps in use.
export LR_WARMUP_STEPS=$((MAX_NUM_STEPS / 10))
# Adjust this for your GPU memory size.
export TRAIN_BATCH_SIZE=2

# Validation image settings.
VALIDATION_GUIDANCE=7.5
VALIDATION_GUIDANCE_RESCALE=0.0


# Leave these alone unless you know what you are doing.
export RESOLUTION=1024
export GRADIENT_ACCUMULATION_STEPS=4          # Yes, it slows training down. No, you don't want to change this.

# SDXL text encoder training is not currently tested.
#export TEXT_ENCODER_LIMIT=101                # Train the text encoder for % of the process. Buggy.
#export TEXT_ENCODER_FREEZE_STRATEGY='before' # before, after, between.
#export TEXT_ENCODER_FREEZE_BEFORE=22         # Ignored when using 'after' strategy.
#export TEXT_ENCODER_FREEZE_AFTER=24          # Ignored when using 'before' strategy.

# Caption dropout probability. Set to 0.1 for 10% of captions dropped out. Set to 0 to disable.
export CAPTION_DROPOUT_PROBABILITY=0.1

# Mixed precision is the best. You honestly might need to YOLO it in fp16 mode for Google Colab type setups.
export MIXED_PRECISION="bf16"                # Might not be supported on all GPUs. fp32 will be needed for others.

# With Pytorch 2.1, you might have pretty good luck here.
# If you're using aspect bucketing however, each resolution change will recompile.
export TRAINING_DYNAMO_BACKEND='no'          # or 'inductor' if you want to brave PyTorch 2 compile issues

# This has to be changed if you're training with multiple GPUs.
export TRAINING_NUM_PROCESSES=10
export TRAINING_NUM_MACHINES=1

# These should remain empty if you remove their options.
export ACCELERATE_EXTRA_ARGS="--multi_gpu"                          # --multi_gpu or other similar flags for huggingface accelerate
export DEBUG_EXTRA_ARGS="--print_filenames --report_to=wandb"     # Removing print_filenames can ease on spam.
export TRAINER_EXTRA_ARGS="--allow_tf32 --use_8bit_adam --use_ema"  # anything you want to pass along extra to the actual train_sdxl.py script.

# These are pretty sketchy to change. --use_original_images can be removed to enable image cropping. Not tested for SDXL.
export TRAINER_EXTRA_ARGS="${TRAINER_EXTRA_ARGS} --enable_xformers_memory_efficient_attention --use_original_images=true"
export TRAINER_EXTRA_ARGS="${TRAINER_EXTRA_ARGS} --gradient_checkpointing --gradient_accumulation_steps=${GRADIENT_ACCUMULATION_STEPS}"

## For offset noise training:
#export TRAINER_EXTRA_ARGS="${TRAINER_EXTRA_ARGS} --offset_noise --noise_offset=0.02"

## For noise input pertubation - adds extra noise, randomly. This is separate from offset noise:
#export TRAINER_EXTRA_ARGS="${TRAINER_EXTRA_ARGS} --input_pertubation=0.01"

## For terminal SNR training:
#export TRAINER_EXTRA_ARGS="${TRAINER_EXTRA_ARGS} --prediction_type=v_prediction --rescale_betas_zero_snr"
#export TRAINER_EXTRA_ARGS="${TRAINER_EXTRA_ARGS} --training_scheduler_timestep_spacing=leading --inference_scheduler_timestep_spacing=trailing"

## For experimental min-SNR weighted loss training (5 is suggested value by the original researchers):
#export TRAINER_EXTRA_ARGS="${TRAINER_EXTRA_ARGS} --snr_gamma=5.0"

roperi · 2023-08-18T11:50:53Z

Tried with batch size 1 and 2, and with and without multi-gpu support. They all fail.

roperi · 2023-08-18T12:20:07Z

>>> import torch
>>> torch.__version__
'2.0.1+cu117'
>>> torch.cuda.is_available()
True
>>> torch.cuda.device_count()
1
>>> torch.cuda.get_device_name(0)
'NVIDIA GeForce RTX 4090'

Maybe same problem as ProGamerGov/neural-style-pt#70?:

...If the devices exist and CUDA works, then it's probably just an issue with the ID you are using....
...You fix the GPU device order by CUDA_DEVICE_ORDER=PCI_BUS_ID before the command:...
... You can also use CUDA_VISIBLE_DEVICES before the command to make sure that PyTorch can only see the specified device..

If so, how do I make sure PyTorch sees the specified device? Issuing any of those two commands before accelerate launch in train_sdxl.sh?

EDIT:

If so, how do I make sure PyTorch sees the specified device? Issuing any of those two command before accelerate launch in train_sdxl.sh?

Ok, I tried but it didn't work. Same invalid device ordinal error

roperi · 2023-08-18T12:29:17Z

Ok. I solved it with export CUDA_VISIBLE_DEVICES=1 before bash train_sdxl.sh

bghira · 2023-08-18T12:42:40Z

you might have missed the --multi_gpu or num_processes=2

roperi · 2023-08-18T12:52:00Z

you might have missed the --multi_gpu or num_processes=2

I'm just using 1 gpu. So i commented the export ACCELERATE_EXTRA_ARGS="--multi_gpu" line in sdxl-env.sh. Should I uncomment it then ? It seems so. :/

I think my problem now is how to configure the following lines considering I'm using a 1x 4090 24GB.

# This has to be changed if you're training with multiple GPUs.
export TRAINING_NUM_PROCESSES=2
export TRAINING_NUM_MACHINES=1

# These should remain empty if you remove their options.
export ACCELERATE_EXTRA_ARGS="--multi_gpu"                          # --multi_gpu or other similar flags for huggingface accelerate
export TRAINER_EXTRA_ARGS="--allow_tf32 --use_8bit_adam --use_ema"  # anything you want to pass along extra to the actual train_sdxl.py script.

# These are pretty sketchy to change. --use_original_images can be removed to enable image cropping. Not tested for SDXL.
export TRAINER_EXTRA_ARGS="${TRAINER_EXTRA_ARGS} --enable_xformers_memory_efficient_attention --use_original_images=true"

roperi · 2023-08-18T12:59:22Z

I uncommented the `export ACCELERATE_EXTRA_ARGS="--multi_gpu" and now I get this error:

Enabling xformers memory-efficient attention.                                                                                                                                            
Traceback (most recent call last):                                                                                                                                                                                               
  File "/workspace/SimpleTuner/train_sdxl.py", line 1191, in <module>                                                                                                                                                            
    main()                                                                                                                                                                                                                       
  File "/workspace/SimpleTuner/train_sdxl.py", line 237, in main                                                                                                                                                                 
    unet.enable_xformers_memory_efficient_attention()                                                                                                                                                                            
  File "/workspace/SimpleTuner/.venv/lib/python3.10/site-packages/diffusers/models/modeling_utils.py", line 263, in enable_xformers_memory_efficient_attention                                                                   
    self.set_use_memory_efficient_attention_xformers(True, attention_op)                                                                                                                                                         
  File "/workspace/SimpleTuner/.venv/lib/python3.10/site-packages/diffusers/models/modeling_utils.py", line 227, in set_use_memory_efficient_attention_xformers                                                                  
    fn_recursive_set_mem_eff(module)                                                                                                                                                                                             
  File "/workspace/SimpleTuner/.venv/lib/python3.10/site-packages/diffusers/models/modeling_utils.py", line 223, in fn_recursive_set_mem_eff                                                                                     
    fn_recursive_set_mem_eff(child)                                                                                                                                                                                              
  File "/workspace/SimpleTuner/.venv/lib/python3.10/site-packages/diffusers/models/modeling_utils.py", line 223, in fn_recursive_set_mem_eff                                                                                     
    fn_recursive_set_mem_eff(child)                                                                                                                                                                                              
  File "/workspace/SimpleTuner/.venv/lib/python3.10/site-packages/diffusers/models/modeling_utils.py", line 223, in fn_recursive_set_mem_eff                                                                                     
    fn_recursive_set_mem_eff(child)                                                                                                                                                                                              
  File "/workspace/SimpleTuner/.venv/lib/python3.10/site-packages/diffusers/models/modeling_utils.py", line 220, in fn_recursive_set_mem_eff                                                                                     
    module.set_use_memory_efficient_attention_xformers(valid, attention_op)                                                                                                                                                      
  File "/workspace/SimpleTuner/.venv/lib/python3.10/site-packages/diffusers/models/modeling_utils.py", line 227, in set_use_memory_efficient_attention_xformers                                                                  
    fn_recursive_set_mem_eff(module)                                                                                                                                                                                             
  File "/workspace/SimpleTuner/.venv/lib/python3.10/site-packages/diffusers/models/modeling_utils.py", line 223, in fn_recursive_set_mem_eff                                                                                     
    fn_recursive_set_mem_eff(child)                                                                                                                                                                                              
  File "/workspace/SimpleTuner/.venv/lib/python3.10/site-packages/diffusers/models/modeling_utils.py", line 223, in fn_recursive_set_mem_eff                                                                                     
    fn_recursive_set_mem_eff(child)                                                                                                                                                                                              
  File "/workspace/SimpleTuner/.venv/lib/python3.10/site-packages/diffusers/models/modeling_utils.py", line 220, in fn_recursive_set_mem_eff                                                                                     
    module.set_use_memory_efficient_attention_xformers(valid, attention_op)                                                                                                                                                      
  File "/workspace/SimpleTuner/.venv/lib/python3.10/site-packages/diffusers/models/attention_processor.py", line 201, in set_use_memory_efficient_attention_xformers                                                             
    raise ValueError(                                                                                                                                                                                                            
ValueError: torch.cuda.is_available() should be True but is False. xformers' memory efficient attention is only available for GPU

I'm pretty sure torch.cuda.is_available() was returning True before (see above). How did it go missing? 😵‍💫

roperi · 2023-08-18T14:52:08Z

Anyway, I started afresh. Now I commented the following line:

#export TRAINER_EXTRA_ARGS="${TRAINER_EXTRA_ARGS} --enable_xformers_memory_efficient_attention --use_original_images=true"

...and this happens:

export CUDA_VISIBLE_DEVICES=0
bash train_sdxl.sh                                                                                                                                                             
[2023-08-18 14:47:04,068] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)                                                                                                          
[2023-08-18 14:47:18,861] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)                                                                                                          
[2023-08-18 14:47:18,945] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)                                                                                                          
Traceback (most recent call last):                                                                                                                                                                                               
  File "/workspace/SimpleTuner/train_sdxl.py", line 1191, in <module>                                                                                                                                                            
    main()                                                                                                                                                                                                                       
  File "/workspace/SimpleTuner/train_sdxl.py", line 156, in main                                                                                                                                                                 
    accelerator = Accelerator(                                                                                                                                                                                                   
  File "/workspace/SimpleTuner/.venv/lib/python3.10/site-packages/accelerate/accelerator.py", line 358, in __init__                                                                                                              
    self.state = AcceleratorState(                                                                                                                                                                                               
  File "/workspace/SimpleTuner/.venv/lib/python3.10/site-packages/accelerate/state.py", line 720, in __init__                                                                                                                    
    PartialState(cpu, **kwargs)                                                                                                                                                                                                  
  File "/workspace/SimpleTuner/.venv/lib/python3.10/site-packages/accelerate/state.py", line 198, in __init__                                                                                                                    
    torch.cuda.set_device(self.device)                                                                                                                                                                                           
  File "/workspace/SimpleTuner/.venv/lib/python3.10/site-packages/torch/cuda/__init__.py", line 350, in set_device                                                                                                               
    torch._C._cuda_setDevice(device)                                                                                                                                                                                             
RuntimeError: CUDA error: invalid device ordinal

Immediately after the crash, I open the python interpreter (within venv) and get:

>>> import torch

>>> torch.__version__
'2.0.1+cu117'
>>> torch.cuda.is_available()
True
>>> torch.cuda.device_count()
1
>>> torch.cuda.get_device_name(0)
'NVIDIA GeForce RTX 4090'

roperi · 2023-08-18T15:38:59Z

I think issue's root cause was the num_processes (since I was trying to run it in a 1x GPU) plus this config:

export TRAINING_NUM_PROCESSES=2
export TRAINING_NUM_MACHINES=1

...so it was bound to fail. Thing is tried changing it to TRAINING_NUM_PROCESSES=1 but it failed with an error with something saying I needed a value of at least 2 ( I didn't save that log, unfortunately). So this is why I went back to num_processes=2

Anyway, I solved the invalid device ordinal error just by choosing a 2x GPU 4090 24G instead (but then it fail with out of memory error (already allocated 23 and tried to allocate other 20GB). Will try again but with a >40GB memory device instead.

bghira mentioned this issue Aug 18, 2023

Default launch script settings enable MultiGPU #43

Closed

bghira added bug Something isn't working documentation Improvements or additions to documentation labels Aug 18, 2023

roperi closed this as completed Aug 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RuntimeError: CUDA error: invalid device ordinal #40

RuntimeError: CUDA error: invalid device ordinal #40

roperi commented Aug 18, 2023 •

edited

Loading

roperi commented Aug 18, 2023

roperi commented Aug 18, 2023

roperi commented Aug 18, 2023 •

edited

Loading

roperi commented Aug 18, 2023

bghira commented Aug 18, 2023

roperi commented Aug 18, 2023 •

edited

Loading

roperi commented Aug 18, 2023 •

edited

Loading

roperi commented Aug 18, 2023 •

edited

Loading

roperi commented Aug 18, 2023 •

edited

Loading

RuntimeError: CUDA error: invalid device ordinal #40

RuntimeError: CUDA error: invalid device ordinal #40

Comments

roperi commented Aug 18, 2023 • edited Loading

roperi commented Aug 18, 2023

contents of sdxl-env.sh

roperi commented Aug 18, 2023

roperi commented Aug 18, 2023 • edited Loading

roperi commented Aug 18, 2023

bghira commented Aug 18, 2023

roperi commented Aug 18, 2023 • edited Loading

roperi commented Aug 18, 2023 • edited Loading

roperi commented Aug 18, 2023 • edited Loading

roperi commented Aug 18, 2023 • edited Loading

roperi commented Aug 18, 2023 •

edited

Loading

roperi commented Aug 18, 2023 •

edited

Loading

roperi commented Aug 18, 2023 •

edited

Loading

roperi commented Aug 18, 2023 •

edited

Loading

roperi commented Aug 18, 2023 •

edited

Loading

roperi commented Aug 18, 2023 •

edited

Loading