-
Notifications
You must be signed in to change notification settings - Fork 70
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
No Limit on retries for failed jobs #1228
Comments
This is intentional behavior. If we were to stop trying to replicate, the user would be forced to recreate the RS or take some other action to get it started again. Instead, we just retry indefinitely. The underlying Job is just an artifact of how VolSync works and isn't meant to be a monitoring point. To monitor replication status, you should look at VolSync's metrics instead. |
My concern arises from the possibility of the RS remaining out of sync for an extended duration or necessitating manual intervention for resolution. This scenario poses a potential issue for PVCs with substantial size and resource consumption, as they would remain unavailable during this period. Additionally, if the RS fails to synchronize after multiple retries, it is highly likely to continue failing in subsequent attempts, correct? |
Yes, it could remain out of sync indefinitely. However, we want VolSync to auto-recover when whatever condition does finally clear--- This is why we continue to retry. For monitoring, check out |
The issue with letting volsync spawn jobs infinitely is around cluster capacity management, one of the issues we have seen is, every now and then there will be one job pod that utilizes mem in GBs and although we can put limits on the job pods, we cant really be that conservative about that limit since we have pods with PVC size going upto hundreds of GBs. So in this case, if a job pod really comes up and utilizes memory in GBs, as a cluster administrator I'd prefer that after certain number of retries the job stops being spawned up - because I have alerts made out of the volsync metric, I'd go and check why the last job failed in the logs, and then probably do some remedies and then wait for the next schedule to hit for the replication. Let's consider a scenario where the PVC has a lot of files, and so the job that comes up is taking around 2GB of memory for replication, and for some reason there is some issue with the network bandwidth between DC and DR overnight. Now, this would practically mean that overnight, the job pod has been running continuously consuming 2GBs of memory which ideally should have been available for the actual applications running in the cluster for most of the time. To be clear, I see the point in letting run volsync in auto-recover mode and keep on spawning jobs until the volume is in sync, but I am just contemplating if having the choice of limiting the number of retries could be handy in some of the scenarios where the resources available in a cluster are not abdundant. If that makes sense, I think something like |
Thank you both for your thoughts. I agree that a case can be made for VolSync to "give up" in certain scenarios and wait for intervention. The most compelling is probably the case of a resource constrained cluster where the mover pod is repeatedly being killed due to OOM on the node. In this case, it really could be denying other workloads that have the potential to run successfully. There are essentially an unlimited number of configuration options that can be provided, and with each one, there is both a benefit and a cost. At this time, we believe that the costs, in terms of additional complexity for both dev/test and for users (understanding how to configure and fix) outweighs the benefit. |
Describe the bug
Trying to create a Grafana dashboard for volsync and one the the thing I want to capture list of failed jobs. I'm using default
kube_job_status_failed
metric for this but realised that the value is never set to 1 even after the job failed. Upon investigation I found the volsync keeps on retrying and create new job upon failure. Is there a way where the number of retry can be limited? or maybe there is a reason why it is design this way?Steps to reproduce
Create replication object without destination object.
Expected behavior
Expecting the job to stop retry after couple of failures.
Actual results
Volsync keeps on trying and there seem to be not limit set on the number of retries.
Additional context
would really appreciate if you already have grafana dashboard json which you can share.
The text was updated successfully, but these errors were encountered: