-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DeepSensor with Pytorch Lightning #43
Comments
Hi @nilsleh, thank you for opening this issue and for all the details! Firstly, regarding the plum error, can you post the rest of the stack trace? The error means that multiple dispatch failed to find a matching Secondly, from a bigger picture, further thought is needed on whether integrating with PyTorch lightning is in the scope of deepsensor. But at very least, there should be some documentation on how to get lightning working. Once you've got your basic implementation working, let's discuss the pros and cons. Note, the code you've provided is specific to the Regarding wrapping the On the above, if the added functionality is useful and doesn't sacrifice too much flexibility, then that's a good argument to add it. |
Hi @tom-andersson, thanks for your reply. I have tried to create a minimal reproducible example as a google colab notebook here that downloads some data from meteonet just to try and get the "mechanics" to work. Maybe that can enhance the discussion. At the moment, I have just implemented the pytorch training loop that shows the mentioned error. Once that is resolved I would add the lightning training loop. |
Hi @nilsleh, thank you for the colab notebook, that's very helpful. I've had a look at the 'standard PyTorch training loop' part and found the cause of the plum error. It's because you are passing the Once you fix this, you'll bump into two other issues caused by the
After all three of the above changes, the code runs. P.S. You should use the same |
Thanks a lot for your feedback and making it work, I updated the colab notebook accordingly. Looking at the batching of tasks and using the Additionally, I have also implemented the logic for what I think is needed to have a Another question I have run in, is how GPU training is supposed to work "out of the box" with a pytorch or lighnting training procedure. I found the set_gpu_default_device function, however, this hard codes to the |
Thanks for the updates @nilsleh,
I'm not sure I understand your question. IIUC, we just need to generalise Under the hood, |
In lightning normally the only thing one has to do is to set the devices flag in the Edit: And at the moment just calling |
@tom-andersson In PyTorch Lightning’s DDP strategy, every GPU is managed by a separate Python process. This Python process holds the whole model in memory and computes the forward and backward for particular batch elements. After computing the forward and backward, the process communicates with all other process to update and synchronise the model parameters. Importantly, in this Python process, the device is always set to a single GPU, which is the GPU associated to the Python process. But this is all very conceptual. In reality, all of this is abstracted away by PyTorch Lightning. So how does it work then? The answer is simple: the models in Now, this should work, but doesn’t quite seem to, because of the forking problem. I think it’s worthwhile diving into that. My guess is that somewhere the device is set to CUDA where that shouldn’t be done: you’ll need to completely hand over control to PTL. |
@wesselb @tom-andersson From my perspective, my main aim was to find a pytorch lightning setup that could facilitate the training of models via DeepSensor. Single GPU support, via just setting the |
@nilsleh My impression is that single-GPU and multi-GPU training via PTL DDP should most certainly be possible! I think it's a matter of getting to the bottom of the CUDA multiprocessing error, where perhaps the right approach is to defer all device management 100% to PTL. |
@wesselb @tom-andersson I have updated the colab notebook with my progress of GPU training so far and can try to summarize what I found so far. Maybe that is helpful. Writing a pytorch training loop (no deepsensor training utitlities) and doing the moving of tensors to device manually with a With the lightning approach I observed that when setting the trainer to gpu training, it says that When handling the data device movement in the datamodule by overwriting From my understanding moving the model weights to device happens in the |
Hey @nilsleh, sorry for the delay. Have you made any progress on this? You're more familiar with PTL than me, so I can't be of much help here. It's hard to know specifically what's going wrong. Under the hood of the @wesselb suggested deferring all device management to PTL. In that case, you may want to avoid using Regarding the |
Hi Tom, no worries. Yes, I have done some more work on this, however, won't have time to properly update it until after the vacation. |
Hi, thank you for the fantastic work, this is very exciting. I am looking to apply the DeepSensor library to different problems, but have the aim to setup my experiments with pytorch lightning, because it reduces boiler plate code and offers lots of benefits with respect to code organization, experiment logging, gpu training etc. I am aware that DeepSensor aims to support both tensorflow and pytorch, and thus lightning might not have any priority but I think it would be a great addition to have a "template" on how to use lightning with DeepSensor, especially for research code. I was hoping I could lay out my current idea for such a template, but I have realized that I need some pointers, for which I'd be very grateful.
For lightning, you roughly need two puzzle pieces, a
LightningModule
that defines your training, validation steps, specifically how you compute the loss and aLightningDatamodule
that is giving you a pytorchDataLoader
for each stage.Focusing on the latter first, I have taken the
task_loader_tour
to get a better understanding of what theTaskLoader
can do. However, for training you need to generate a set of tasks first, which is done manually by callingTaskLoader
repeatedly, and to create a training batch of tasks one uses theconcat_tasks
function which seems sort of analogous to aDataLoader
collate function. But there are also additional data processing tasks inside theConvNP
module, specifically, theloss_fn()
before theneuralprocess
library is used to both conduct the forward pass and compute the loss.Before, moving to lightning I was hoping to get a "standard" pytorch training loop going. To this end, I was thinking whether one could leverage a pytorch
IterableDataset
with something like this to give you tasks and subsequently have aDataLoader
:This is of course assuming that a batch can be formed by the dataloader and if not, one would have to define a collate function, which I am assuming would look something like
concat_tasks
. Additionally, fornum_workers>0
one has to also define the distribution to workers within theIterableDataset
.The idea is then to use a "standard" pytorch training scheme as follows:
However, here I am running into the following for which I am not sure how to get around.
But if pytorch training could be conducted in that way, one could then move to lightning, by defining a
DataModule
that could have more init arguments to support different tasks etc. but basically look something like this:And a
LightningModule
, here just shown in a very basic form:Because then, one can leverage the
Trainer
with all its flexibility to conduct training and evaluation:Apologies for the long post, and also if all of this is something you have tried/or are working on already. However, I would be grateful for any other pointers towards what you think about this idea. If you find this interesting, I would be happy to provide more details or implement this more formally in a PR in what form you see fit.
The text was updated successfully, but these errors were encountered: