-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Eval error: CUDA_ERROR_OUT_OF_MEMORY #8
Comments
Hi Scott, Thanks for reporting this bug. I've run into some memory issues when saving outputs as well (with the LSTM, which also uses a recurrent model like UniRep). We'll look into it. I just wanted to note that it looks like you're running out of system RAM (not GPU RAM), which I believe on Google Colab is 26 GB. |
Hi, I switched to using the Transformer to see if the memory issue is isolated to the UniRep model but unfortunately I got the same error although it made it much farther - 795it. [I'm not sure what the batch size is so don't know how much of the ~27k in the test set it got through.]
Thanks for the quick responses and looking into this. Scott |
So this is an issue because we're really using What we should do is change |
When writing the paper I just added a hack to delete the keys of the output that we didn't need, which is why we were able to actually get results. We need a more robust solution if we're going to expose this feature to other people. |
Would it be possible/make sense to be able to pass an argument such as: I'm not sure how disabling the The above might be a quick fix in terms of getting it running (albeit without saving the outputs) until your better solution is implemented. Scott |
@CaptainCapsaicin I'm not sure what the progress on the h5py option is. It's tricky because we don't want to write a non-general solution into As for writing a quick fix, we could probably do something like that. |
I have trained the fluorescence task model with:
!tape with model=unirep tasks=fluorescence load_from='pretrained_models/unirep_weights.h5' freeze_embedding_weights=True steps_per_epoch=100 datafile='data/fluorescence/fluorescence_train.tfrecords'
[Note used very small steps_per_epoch so it would train in a reasonable time so I can just get something working.]
Next I tried to evaluate the model using:
!tape-eval results/fluorescence_unirep_2019-07-30--17-22-15/ --datafile data/fluorescence/fluorescence_test.tfrecord'
but after only a few iterations the GPU memory use just explodes:
I'm running on a Tesla T4 with 14GB of memory (Google Colab).
The memory explosion would appear to be in
test_metrics = test_graph.run_epoch(save_outputs=outfile)
Any suggestions on how to resolve?
Thanks.
Scott
The text was updated successfully, but these errors were encountered: