Data and model are not sent to the correct device when multiple devices are being used #575

giacomoguiduzzi · 2025-02-24T10:22:04Z

1. System Info

As you asked in PR #563, I am creating this issue as I noticed a bug about moving data with models to different GPUs. If the device passed to the model is not a torch.device() object (i.e., a string as cuda:2 or an integer as 2) the function _send_data_to_given_device() does not behave correctly:

def _send_data_to_given_device(self, data) -> Iterable:
    if isinstance(self.device, torch.device):  # single device
        data = map(lambda x: x.to(self.device), data)
    else:  # parallely training on multiple devices
        # randomly choose one device to balance the workload
        # device = np.random.choice(self.device)

        data = map(lambda x: x.cuda(), data)

    return data

You can see that the first if branch checks if the self.device object is a torch.device, else everything is moved to cuda(), that without specification is cuda:0 or the first device available, thus moving the data to a different device, leading to the model being on cuda:2 and the data on cuda:0, crashing.

Let me know if there is any additional information you might need about this.

2. Information

The official example scripts
My own created scripts

3. Reproduction

Steps to reproduce the behavior:

On a machine with multiple CUDA devices available, create a new model instance (whatever model) passing device='cuda:1' or another device that is not cuda:0. Whatever is not a torch.device instance makes the model crash, but with 0 or cuda:0 it works anyway as 0 is the default device.
Try training the model with whatever data.
Crash.

4. Expected behavior

I would have expected the internal code to the model to handle a numerical value or string value, possibly converting it to torch.device if necessary, to move both the model and the subsequent passed data for training to the same device.

The text was updated successfully, but these errors were encountered:

WenjieDu · 2025-02-25T07:21:15Z

Thanks, Giacomo. I've merged PR #563. Regarding your commit to compute metrics even with NaN target values, the code is based on old commits (your modifications are in pypots.utils.metric.error but we have migrated functions there to pypots.nn.functional.error), so they have conflicts with the code in the main repo and I removed them to make PR #563 mergeable. Could you please pull the latest updates from WenjieDu/PyPOTS:main to your code base then add back your previous modifications to the func _check_inputs() and make a new PR? Also please create another new Feature Request issue to clarify the purpose ;-)

giacomoguiduzzi added the bug Something isn't working label Feb 24, 2025

This was referenced Feb 25, 2025

Fix a bug that data and model are not on the same device when CUDA device list is applied #563

Merged

Fix potential bug that data and model not on the same cuda device, update docs #577

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data and model are not sent to the correct device when multiple devices are being used #575

Data and model are not sent to the correct device when multiple devices are being used #575

giacomoguiduzzi commented Feb 24, 2025

WenjieDu commented Feb 25, 2025 •

edited

Loading

Data and model are not sent to the correct device when multiple devices are being used #575

Data and model are not sent to the correct device when multiple devices are being used #575

Comments

giacomoguiduzzi commented Feb 24, 2025

1. System Info

2. Information

3. Reproduction

4. Expected behavior

WenjieDu commented Feb 25, 2025 • edited Loading

WenjieDu commented Feb 25, 2025 •

edited

Loading