Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No Limit on retries for failed jobs #1228

Closed
winkrs opened this issue Apr 22, 2024 · 5 comments
Closed

No Limit on retries for failed jobs #1228

winkrs opened this issue Apr 22, 2024 · 5 comments
Labels
bug Something isn't working wontfix This will not be worked on

Comments

@winkrs
Copy link
Contributor

winkrs commented Apr 22, 2024

Describe the bug

Trying to create a Grafana dashboard for volsync and one the the thing I want to capture list of failed jobs. I'm using default kube_job_status_failed metric for this but realised that the value is never set to 1 even after the job failed. Upon investigation I found the volsync keeps on retrying and create new job upon failure. Is there a way where the number of retry can be limited? or maybe there is a reason why it is design this way?

Steps to reproduce

Create replication object without destination object.

Expected behavior

Expecting the job to stop retry after couple of failures.

Actual results

Volsync keeps on trying and there seem to be not limit set on the number of retries.

Additional context

would really appreciate if you already have grafana dashboard json which you can share.

@winkrs winkrs added the bug Something isn't working label Apr 22, 2024
@JohnStrunk
Copy link
Member

This is intentional behavior.

If we were to stop trying to replicate, the user would be forced to recreate the RS or take some other action to get it started again. Instead, we just retry indefinitely. The underlying Job is just an artifact of how VolSync works and isn't meant to be a monitoring point.

To monitor replication status, you should look at VolSync's metrics instead.

@winkrs
Copy link
Contributor Author

winkrs commented Apr 23, 2024

My concern arises from the possibility of the RS remaining out of sync for an extended duration or necessitating manual intervention for resolution. This scenario poses a potential issue for PVCs with substantial size and resource consumption, as they would remain unavailable during this period. Additionally, if the RS fails to synchronize after multiple retries, it is highly likely to continue failing in subsequent attempts, correct?

@JohnStrunk
Copy link
Member

Yes, it could remain out of sync indefinitely. However, we want VolSync to auto-recover when whatever condition does finally clear--- This is why we continue to retry.

For monitoring, check out volsync_volume_out_of_sync in the link I posted previously. You should be able to trigger an alert based on this. If not, we'd like to hear more. Neither @tesshuflower nor I are experts w/ monitoring/metrics.

@Shashankft9
Copy link

The issue with letting volsync spawn jobs infinitely is around cluster capacity management, one of the issues we have seen is, every now and then there will be one job pod that utilizes mem in GBs and although we can put limits on the job pods, we cant really be that conservative about that limit since we have pods with PVC size going upto hundreds of GBs. So in this case, if a job pod really comes up and utilizes memory in GBs, as a cluster administrator I'd prefer that after certain number of retries the job stops being spawned up - because I have alerts made out of the volsync metric, I'd go and check why the last job failed in the logs, and then probably do some remedies and then wait for the next schedule to hit for the replication.

Let's consider a scenario where the PVC has a lot of files, and so the job that comes up is taking around 2GB of memory for replication, and for some reason there is some issue with the network bandwidth between DC and DR overnight. Now, this would practically mean that overnight, the job pod has been running continuously consuming 2GBs of memory which ideally should have been available for the actual applications running in the cluster for most of the time.

To be clear, I see the point in letting run volsync in auto-recover mode and keep on spawning jobs until the volume is in sync, but I am just contemplating if having the choice of limiting the number of retries could be handy in some of the scenarios where the resources available in a cluster are not abdundant.
FWIW, we have seen similar kind of issues in some of the enterprise restore&backup solutions, where there will be one job pod just running continuously or respawns of that job running continuously with excessive memory usage and to fix that sometimes we would just kill the pod or the job and wait for it to be created again as per the schedule.

If that makes sense, I think something like retries in the source replication CR will solve what we are looking for, ofcourse if the value is not stated in the CR, the default value will be infite, wdyt?

@JohnStrunk
Copy link
Member

Thank you both for your thoughts. I agree that a case can be made for VolSync to "give up" in certain scenarios and wait for intervention. The most compelling is probably the case of a resource constrained cluster where the mover pod is repeatedly being killed due to OOM on the node. In this case, it really could be denying other workloads that have the potential to run successfully.

There are essentially an unlimited number of configuration options that can be provided, and with each one, there is both a benefit and a cost. At this time, we believe that the costs, in terms of additional complexity for both dev/test and for users (understanding how to configure and fix) outweighs the benefit.
While we may revisit this in the future as things change, I'm going to close it for now.

@JohnStrunk JohnStrunk closed this as not planned Won't fix, can't repro, duplicate, stale Apr 30, 2024
@JohnStrunk JohnStrunk added the wontfix This will not be worked on label Apr 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working wontfix This will not be worked on
Projects
Archived in project
Development

No branches or pull requests

3 participants