-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Segmentation Fault With Large Runs of gen_lib.py #111
Comments
Hmmm. I've been able to run up to ~20k samples with up to ~1k realizations, but the code has changed since then. Still, I'm guessing it might not be something intrinsic about the large numbers. I would guess that there could be a memory error, or something like that, but generally the number of samples shouldn't increase the amount of memory being used. If you check the outputs and logs from individual processors, do you see any other error messages, or hints at where the segmentation fault is coming from? |
This is the only line indicating an error that I have been able to find. Here is the longer version of this error message (top part truncated because it's long):
The other outputs/logs are just paused after the most recent simulation and the output/sims files look as expected, except there aren't as many as there should be. For completeness, here are the last few lines of the output file of the most recent run to return this type of error:
I have run some tests to try and narrow down exactly where this issue might be coming from, but broadly there is no reliable pattern that I can find yet. The only thing that is consistent is that models will always fail if I am using 4 node, but I have had models with 2 nodes fail in this way as well. Recently, I had one 2 node run a complete all 20,000 samples (and it seems like a second might make it). The runs that fail (regardless of how many nodes / how much memory I request) always run for 40-50 minutes wall time. The one consistency I am seeing is that they either go for ~45 minutes and fail or they complete, never in between. I'm still trying to dig into this and I will update if I have a breakthrough. |
Thanks for the additional info. These types of issues are always a huge pain to debug! It still sounds like it could conceivably be a memory issue of some sort. In my experience, memory errors with python can be sporadic/chaotic, possibly because of the complex garbage collection system it uses. If you're already using the full memory on each node, you could try using full nodes but not all of the processors (so that the memory per processor is higher). You could also try adding in additional memory information logging. There's a function holodeck.librarian.libraries.log_mem_usage that does this. And, just to check, you know about the |
I typically request memory per cpu and have seen no difference between, e.g., 3G vs 5G/cpu; I will look into the memory logging to get more information. It does seem to be somewhat random, but given the all or nothing pattern it might be something that happens only in the early stages... Thanks for checking, I am able to resume jobs and pick up where things left off, but they all still run into the segmentation fault error after the same amount of time. I resumed one run a few times and got up to about ~7,000 samples, but I would have to resume ~10 times at that rate to get to the full 20,000 samples 🤷 |
I don't have much useful to suggest, sorry! But a few grasping-at-straws thoughts:
|
Running gen_lib.py with large values of NSAMPS (or NREALS) on a cluster leads to segmentation fault. The definition of large here will vary depending on the cluster, but for me it maxes out at just under NSAMPS = 2500 (with NREALS=100).
The error message returned by the Great Lakes cluster reads: mpiexec noticed that process rank 26 with PID 0 on node gl3160 exited on signal 11 (Segmentation fault).
The text was updated successfully, but these errors were encountered: