No Limit on retries for failed jobs #1228

winkrs · 2024-04-22T17:43:04Z

Describe the bug

Trying to create a Grafana dashboard for volsync and one the the thing I want to capture list of failed jobs. I'm using default kube_job_status_failed metric for this but realised that the value is never set to 1 even after the job failed. Upon investigation I found the volsync keeps on retrying and create new job upon failure. Is there a way where the number of retry can be limited? or maybe there is a reason why it is design this way?

Steps to reproduce

Create replication object without destination object.

Expected behavior

Expecting the job to stop retry after couple of failures.

Actual results

Volsync keeps on trying and there seem to be not limit set on the number of retries.

Additional context

would really appreciate if you already have grafana dashboard json which you can share.

The text was updated successfully, but these errors were encountered:

JohnStrunk · 2024-04-22T17:48:43Z

This is intentional behavior.

If we were to stop trying to replicate, the user would be forced to recreate the RS or take some other action to get it started again. Instead, we just retry indefinitely. The underlying Job is just an artifact of how VolSync works and isn't meant to be a monitoring point.

To monitor replication status, you should look at VolSync's metrics instead.

winkrs · 2024-04-23T13:36:22Z

My concern arises from the possibility of the RS remaining out of sync for an extended duration or necessitating manual intervention for resolution. This scenario poses a potential issue for PVCs with substantial size and resource consumption, as they would remain unavailable during this period. Additionally, if the RS fails to synchronize after multiple retries, it is highly likely to continue failing in subsequent attempts, correct?

JohnStrunk · 2024-04-23T14:40:29Z

Yes, it could remain out of sync indefinitely. However, we want VolSync to auto-recover when whatever condition does finally clear--- This is why we continue to retry.

For monitoring, check out volsync_volume_out_of_sync in the link I posted previously. You should be able to trigger an alert based on this. If not, we'd like to hear more. Neither @tesshuflower nor I are experts w/ monitoring/metrics.

Shashankft9 · 2024-04-30T06:41:17Z

The issue with letting volsync spawn jobs infinitely is around cluster capacity management, one of the issues we have seen is, every now and then there will be one job pod that utilizes mem in GBs and although we can put limits on the job pods, we cant really be that conservative about that limit since we have pods with PVC size going upto hundreds of GBs. So in this case, if a job pod really comes up and utilizes memory in GBs, as a cluster administrator I'd prefer that after certain number of retries the job stops being spawned up - because I have alerts made out of the volsync metric, I'd go and check why the last job failed in the logs, and then probably do some remedies and then wait for the next schedule to hit for the replication.

Let's consider a scenario where the PVC has a lot of files, and so the job that comes up is taking around 2GB of memory for replication, and for some reason there is some issue with the network bandwidth between DC and DR overnight. Now, this would practically mean that overnight, the job pod has been running continuously consuming 2GBs of memory which ideally should have been available for the actual applications running in the cluster for most of the time.

To be clear, I see the point in letting run volsync in auto-recover mode and keep on spawning jobs until the volume is in sync, but I am just contemplating if having the choice of limiting the number of retries could be handy in some of the scenarios where the resources available in a cluster are not abdundant.
FWIW, we have seen similar kind of issues in some of the enterprise restore&backup solutions, where there will be one job pod just running continuously or respawns of that job running continuously with excessive memory usage and to fix that sometimes we would just kill the pod or the job and wait for it to be created again as per the schedule.

If that makes sense, I think something like retries in the source replication CR will solve what we are looking for, ofcourse if the value is not stated in the CR, the default value will be infite, wdyt?

JohnStrunk · 2024-04-30T17:59:51Z

Thank you both for your thoughts. I agree that a case can be made for VolSync to "give up" in certain scenarios and wait for intervention. The most compelling is probably the case of a resource constrained cluster where the mover pod is repeatedly being killed due to OOM on the node. In this case, it really could be denying other workloads that have the potential to run successfully.

There are essentially an unlimited number of configuration options that can be provided, and with each one, there is both a benefit and a cost. At this time, we believe that the costs, in terms of additional complexity for both dev/test and for users (understanding how to configure and fix) outweighs the benefit.
While we may revisit this in the future as things change, I'm going to close it for now.

winkrs added the bug Something isn't working label Apr 22, 2024

JohnStrunk added this to VolSync project tracking Apr 22, 2024

JohnStrunk closed this as not planned Won't fix, can't repro, duplicate, stale Apr 30, 2024

github-project-automation bot moved this to Done in VolSync project tracking Apr 30, 2024

JohnStrunk added the wontfix This will not be worked on label Apr 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

No Limit on retries for failed jobs #1228

No Limit on retries for failed jobs #1228

winkrs commented Apr 22, 2024

JohnStrunk commented Apr 22, 2024

winkrs commented Apr 23, 2024

JohnStrunk commented Apr 23, 2024

Shashankft9 commented Apr 30, 2024

JohnStrunk commented Apr 30, 2024

No Limit on retries for failed jobs #1228

No Limit on retries for failed jobs #1228

Comments

winkrs commented Apr 22, 2024

JohnStrunk commented Apr 22, 2024

winkrs commented Apr 23, 2024

JohnStrunk commented Apr 23, 2024

Shashankft9 commented Apr 30, 2024

JohnStrunk commented Apr 30, 2024