-
Notifications
You must be signed in to change notification settings - Fork 68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Detecting RAM killed Dorado runs #320
Comments
Thank you for reaching out with this detailed information about the issue you're facing with Dorado. Gradually increasing memory utilization during methylation calling is a known issue. We've received similar reports, and our team is actively investigating it. As for detecting whether a Dorado run was killed due to running out of memory, this is dependent on your operating system and cluster manager. Here are a couple of specific approaches you might consider:
However, addressing the root cause might be more effective. Running multiple smaller Dorado jobs on subsets of the POD5s might be a strategic approach. This would not only align better with typical cluster management practices (where long-running jobs can be problematic) but also provide more resilience and manageability. Here's how you could proceed:
This approach might require adjustments to your current pipeline, but it will be much more resilient. Best wishes, |
Another method could be to check for the exit code of dorado. When dorado completes successfully, the exit code will be 0. If it errors out for any reason (killed by OS due to OOM is code 137 on linux) the exit code will be non-zero. So your script could check for non zero exit code (on Linux the This of course is not an ideal solution. As Mike mentioned, we're looking into addressing some memory issues. However, 60GB is not a lot of memory, so it's possible the peak memory is just hitting that number. Are you running single or multi GPU basecalling? Is 60GB the maximum memory your job can request? |
Are you also running alignment during basecalling? |
Thanks a ton all! A lot of good thoughts here. We definitely understand our current cluster config is NOT memory optimized, and was built for previous pipelines that were not RAM intensive. That said we have recently ordered more RAM and will be upgrading the nodes to ~84GB ram each, which, while not a huge increase will hopefully alleviate some of these issues. We are running single GPU basecalling, and doing this with mapping to a reference as well, since our end goal is a bedMethyl file mapped to a genome created using modbamtobed. As far as short term options, we are not using SLURM, so that wont do the trick, but do think that looking up the exit code might be the easiest solution. If this doesn't work we will divide the data and proceed that way! Either way, would be great if there was some way that Dorado could work on smaller RAM machines, or at least give a warning/a standard output that allows the user to detect a failed run in an automated way. Best, |
We have seen that running alignment during basecalling increases the memory footprint. We will check if the memory during alignment can be capped, but it may take us a while to get to it. If you have some spare cycles, it would be useful to check what the memory footprint is without alignment. It may make sense to do alignment as a downstream step in your pipeline using minimap2 to make dorado runs more stable. |
We have recently been trying out running Dorado as part of an automated pipeline to do methylation calling using the all cytosine model from some recent Promethion runs. We have noticed that on our current cluster configurations with only about ~60GB of RAM our Dorado runs consistently fail after running for about ~36 hours. We have tracked this, and it is clear that Dorado gets killed due to it running out of RAM. I assume this is related to the way that Dorado holds the large .bam or intermediate files in working memory.
Luckily, we have been able to get past this by using the new resume function (thanks a ton for this feature!), but we have been unable to come up with a simple way to detect if a dorado run failed and needs to be re-run using --resume. We were hoping that it would be possible to have the fact that a dorado run was "killed" written to a standard output file, or alternatively having the progress bar saved as described in #307 . Sadly we have been unable to figure out a way to automatically detect if a run failed in the middle and needs to be re-run with resume.
Does anyone have recomendations on how to (A) run Dorado such that it does not get as RAM hungry, and can be run on a machine with about 60GB of RAM without crashing, or (B) how to automatically detect if a dorado run did get killed and needs tro be resumed?
Thanks,
Jack
The text was updated successfully, but these errors were encountered: