-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Worker getting removed #850
Comments
Are you submitting the managers themselves as jobs, or running a manager on the head node which delegates out to the SLURM queue? If it's the former, there may be a misconfiguration somewhere. The first error here ( Either way, is there anything in the logs for the QCFractal instance? That will usually have a little bit more info, although I'm guessing it's just managers not reporting back. |
Hi @bennybp , The configuration that I setup is Server on the Head Node + Manager on the Head Node that delegates out the SLURM queues with the workers. I'll restart the server with DEBUG mode for the logging as I didn't see any useful logs from the info.
Yep, for some reason it spins up correctly the workers (slurm allocate the jobs) but they get put on hold because for some reason their idletime always surpass the max_idletime from parsl. About the config, I used the same config on another slurm cluster and it was pretty much working correctly (I also suspect that I have some I/O latency or some other issues as the SPC seems rather slow, any advice on how can I benchmark the retrieval of the jobs and the queue?) I posted the config that I'm currently using: cluster: slurm_manager
loglevel: DEBUG
logfile: qcfractal-manager.log
update_frequency: 180.0 # I updated this
server:
fractal_uri: xxxxx
username: xxxxx
password: xxxxx
verify: False
executors:
slurm_executor:
type: slurm
scratch_directory: ${HOME}/project/qcfractal
walltime: "72:00:00"
exclusive: false
partition: compute
account: null
workers_per_node: 20
max_nodes: 30
cores_per_worker: 8
memory_per_worker: 8
queue_tags:
- '*'
environments:
use_manager_environment: False
conda:
- qcfractal-worker
worker_init:
- source worker-init.sh # abs path for the init for the worker with a pretty standard worker-init.sh: #!/bin/bash
source ~/.bashrc
conda init
ulimit -s unlimited
mkdir -p "${HOME}/project/qcfractal"
cd "${HOME}/project/qcfractal"
source ${HOME}/conda/bin/activate qcfractal_manager I know that parsl would require full exclusivity of the node, could it be a big issue currently that I'm not enabling it? Edit: no info from the logs of the server from DEBUG. Even when the workers are "killed" , the manager is still on |
Ok I've been thinking about this a bit more. There could be a couple issues, although it's not entirely clear. Below is some semi-random thoughts. One part is that the slurm jobs might be getting killed out from under parsl. One thing to check is the raw outputs from slurm. Parsl keeps these in the The more mysterious part is that the qcfractal server is determining your manager is dead, when it is very much alive still. An update frequency of 10 is a little low, but not so bad (especially if your tasks are small). A heartbeat frequency (on the server config) of 10 is probably a bit low - 60-180 seconds is typically what I run with, with a Side note: I realized the docs are a bit wrong. The update frequency is only the time between a qcfractal manager requesting new tasks. It is not related to the heartbeat mechanism at all, which is handled automatically and not configurable on the manager side.
You can look at the qcfractal server log - with log level DEBUG it will print response times for the accesses as they come in. It's possible some request takes so long that the manager chokes on it (although the heartbeat mechanism and the job fetching mechanism run in different threads). Two other random ideas:
If you feel comfortable with sharing any logs (either here or by email), I could have a look. |
Ehy @bennybp , first of all, thank you for all the help that you are giving me.
Yes, I checked the output from both the worker (no issues in the worker logs or manager nor interchange) and the slurm files. The slurm shell file for the compute allocation is correct and (for the worker that get cancelled) the stdout just presents the usual
While the stderr has a
I'll change them to better values but even increasing the server config heartbeat to around 180 seconds didn't change much (and also the update_frequency of the manager)
On my attempt on checking the DEBUG logs of the manager, I didn't see any response times logged beside the idletime from parsl but maybe I'm looking at the wrong things. I can share partial of the DEBUG logs (or even the full of it but it is 1Gb right now, better if by email).
Database is on the cloud and after some debugging and benchmarking, I found that retrieving task and making post request is basically the bulk of the time and the bottleneck (in comparison of the SPC calculation that is rather fast for now) from the ManagerClient (for a low amount of workers a single post request in
Nope One last question, because I'm doing some massive data generation and for the type of data that I'm handling (1 molecule is basically always 1 new entry), I fear that the querying will start to become a bottleneck as the size of the database and the tables will actually grow a lot. Do you think that is something I need to take in account? |
I meant the fractal server logs (not the manager log). There's so many log files :) The server logs should contain something like this, with the response time at the end
That's definitely seems excessive, although 3.5s shouldn't really cause things to break (that tends to happen around 60 seconds). I guess the question is how much slower it is for a 'real' payload. I know there can be a bit of slowness when returning tasks to a busy server, which is something I need to look into.
Whatever you are willing to send. They typically compress nicely, too :)
Depends on how massive. We have >100M records and 4-5TB of data in one server, and it works well. There are certain operations that I need to improve (mostly dataset operations related to modifying existing records), so you might hit some rough edges there. We typically keep individual datasets to around 100k entries for that reason. |
Strange thing, my current log of the server in DEBUG don't have that kind of logging. I'll get a view of the db logs tomorrow too.
Edit: ok I think I know why I don't have that logging. I need to use
I'm currently splitting my datasets into multiple 75k-ish records datasets but the scaling will go around 1B of entries |
Small log with
|
Thanks for the logs. I have a bit of a hypothesis now. The manager code is fairly serial. Updates (returning finished tasks and claiming new tasks) happen on a regular schedule (controlled by update frequency). If these updates take too long, the manager may actually miss sending its heartbeat and the server will assume it is dead. Here's what I see: For some reason, the time it takes to claim or return tasks is a bit long (which is something I need to look into more).
That's 3.7 seconds to return 10 tasks. The size of the data sent to the server is 130,562 bytes. This is pretty long for so little data. So putting it together, it looks like you have lots and lots of small, fast jobs. You are limited to claiming 200 tasks in a single request to the server, and returning 10 in a single request (these are configurable - see Returning tasks or claiming tasks did not count as a heartbeat, but I have a PR #851 that does that. So that should help. But there's still an underlying issue of returning & claiming being kind of slow. This is still somewhat a hypothesis, but seems logical from what I am seeing in the logs. Or it could be something completely different :) |
I agree with everything you pointed out (and yes, in this logs I'm calculating some xtb semi empirical values so it is indeed quite fast).
I already tried beforehand to modify the manager claim and return and I noticed that the API request increase quite a bit. As now, I m claiming 400 taks and returning 20, I'll update soon in the issue the time increase.
Thank you for pointing this out. I'm now using this branch and update with the results on this. I'm also testing things out on the parsl side and switching type of launchers and modifying the DataFlowKernel a bit. I'll soon update here with some extra infos and results. Side note, I should probably start helping a bit with this project 😅 |
I overrided |
Sounds like that might help with the Parsl issues. Let me know how it goes! I've started looking at the task claiming/returning code to see if I can speed it up. I definitely see a way to reduce the time to claim tasks (hopefully by an order of magnitude or more), but that requires a database change. Task returning code might be more difficult, but should also be doable. |
So far it is looking good. Some slight issues on preemption automatic requeue but the worker are not getting killed.
Is there any way I can help with it? I am also wondering if trying AlloyDB is worth it instead of a normal PostGresQL. |
Any sort of benchmarking would be helpful. I made a quick attempt at getting parts of the server to run with cProfile, but it doesn't run well since waitress seems to cause it to swallow the profiling info. I didn't look too much at it Attempt here: https://github.com/MolSSI/QCFractal/commits/bench/ For claiming, the issue is that the query is just not optimal (mostly due to the filtering by
Yes exactly, although it's always nice to see people push the boundaries so that it can be improved!
I don't have any experience with that. My guess is that it might not help (because it's the query that's just inefficient) but would be nice to see if QCFractal works on that. It's fairly restricted to things compatible with Postgresql, so AlloyDB might work. |
Yes that will be done instantly. I'm already running the qcfractal PR for the hearthbeat with some slight modifications.
In my profiling I used py-spy and it worked ok-ish. I am still trying to have a complete profile and will keep you updated.
With some time i will also push the number of entries to >1B |
I've improved the claim behavior in #852 if you would like to try it out and see how much it improves that side! Next is improving when managers return tasks, which is going to be even a bit difficult. But judging from the times above, also pretty important. |
Thank you a lot @bennybp ! I will migrate the db and start some testing + profiling with the new branch Edit: upgraded the db and now submitting some extra jobs |
With the new branch I'm getting timed out while submitting new calculations (chunked around 35k entries and I keep getting timedout after a bunch of those submissions):
I'll downgrade the db to check that it is not related to the branch but from some small tests there is a def a speedup in the claiming of the tasks. I'll follow up with proper benchmarking to report the time gain. |
Are you still getting those timeouts? It's almost certainly unrelated to my recent changes. I have added some internal batching before, but maybe I missed somewhere that could use it |
I'm running some SPC with the qcportal suite and I'm using the
SlurmExecutor
to scale up the number of computation on slurm. The issue is that not all the workers requested from the config gets allocated (could be that my cluster is at capacity) but after a while, the workers correctly allocated getCANCELLED
. Looking at the interchange.log I can see the Managers have compatible Parsl version and Python version but at some point I get:The workers/managers get removed from the SLURM queue and new one get allocated (in less numbers).
Some addictional info, i'm using a qcfractal manager config with
update_frequency: 10
, while the server config has theheartbeat_frequency:10
too. I tried to increase both theupdate_frequency
and theheartbeat_frequency
with no avail.Edit: after a while workers are not doing any task and I had to manually kill the jobs
or
On further investigation I noticed that the first batch of workers/managers are killed because their Idle time is more than the max_idletime for the slurm executor in Parsl even when they are running.
The text was updated successfully, but these errors were encountered: