Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data and model are not sent to the correct device when multiple devices are being used #575

Open
1 of 2 tasks
giacomoguiduzzi opened this issue Feb 24, 2025 · 1 comment
Labels
bug Something isn't working

Comments

@giacomoguiduzzi
Copy link
Contributor

1. System Info

Hi @WenjieDu,

As you asked in PR #563, I am creating this issue as I noticed a bug about moving data with models to different GPUs. If the device passed to the model is not a torch.device() object (i.e., a string as cuda:2 or an integer as 2) the function _send_data_to_given_device() does not behave correctly:

def _send_data_to_given_device(self, data) -> Iterable:
    if isinstance(self.device, torch.device):  # single device
        data = map(lambda x: x.to(self.device), data)
    else:  # parallely training on multiple devices
        # randomly choose one device to balance the workload
        # device = np.random.choice(self.device)

        data = map(lambda x: x.cuda(), data)

    return data

You can see that the first if branch checks if the self.device object is a torch.device, else everything is moved to cuda(), that without specification is cuda:0 or the first device available, thus moving the data to a different device, leading to the model being on cuda:2 and the data on cuda:0, crashing.

Let me know if there is any additional information you might need about this.

2. Information

  • The official example scripts
  • My own created scripts

3. Reproduction

Steps to reproduce the behavior:

  1. On a machine with multiple CUDA devices available, create a new model instance (whatever model) passing device='cuda:1' or another device that is not cuda:0. Whatever is not a torch.device instance makes the model crash, but with 0 or cuda:0 it works anyway as 0 is the default device.
  2. Try training the model with whatever data.
  3. Crash.

4. Expected behavior

I would have expected the internal code to the model to handle a numerical value or string value, possibly converting it to torch.device if necessary, to move both the model and the subsequent passed data for training to the same device.

@WenjieDu
Copy link
Owner

WenjieDu commented Feb 25, 2025

Thanks, Giacomo. I've merged PR #563. Regarding your commit to compute metrics even with NaN target values, the code is based on old commits (your modifications are in pypots.utils.metric.error but we have migrated functions there to pypots.nn.functional.error), so they have conflicts with the code in the main repo and I removed them to make PR #563 mergeable. Could you please pull the latest updates from WenjieDu/PyPOTS:main to your code base then add back your previous modifications to the func _check_inputs() and make a new PR? Also please create another new Feature Request issue to clarify the purpose ;-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants