Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature][Failover]Add fault tolerance strategy #15271

Closed
3 tasks done
fuchanghai opened this issue Dec 4, 2023 · 3 comments
Closed
3 tasks done

[Feature][Failover]Add fault tolerance strategy #15271

fuchanghai opened this issue Dec 4, 2023 · 3 comments
Labels
feature new feature

Comments

@fuchanghai
Copy link
Member

Search before asking

  • I had searched in the issues and found no similar feature requirement.

Description

  • background:
When the master or worker is shut down and started again, the unfinished process instances will be executed from the beginning. but some time i just want that begin from need tolerance

  • method
- restart
- start execution from the task that need tolerance

  • question
Do I need to set this fault tolerance policy parameter to a service level, or do I need to set it to belong to each process definition? I prefer the second one

cc @ruanwenjun @EricGao888

Use case

No response

Related issues

No response

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@fuchanghai fuchanghai added feature new feature Waiting for reply Waiting for reply labels Dec 4, 2023
@fuchanghai fuchanghai changed the title [Feature][Failover]Add fault tolerance strategy [Feature][Failover]Add fault tolerance strategy for k8s task Dec 4, 2023
@fuchanghai fuchanghai changed the title [Feature][Failover]Add fault tolerance strategy for k8s task [Feature][Failover]Add fault tolerance strategy Dec 4, 2023
@EricGao888
Copy link
Member

@fuchanghai
"When the master or worker is shut down and started again, the unfinished process instances will be executed from the beginning."

Is it how things work on dev branch? I remember that in 3.0.X, when dolphin performs failover for a process instance, it will check whether those task instances need to be failed over or not.

@EricGao888
Copy link
Member

If as you says, when dolphin performs failover for a process instance, it starts from the very beginning no matter whether some task instances succeeded. It should be a BUG.

@EricGao888 EricGao888 removed the Waiting for reply Waiting for reply label Dec 4, 2023
@fuchanghai
Copy link
Member Author

hi @EricGao888 when restart server ,process instance will be set RECOVER_TOLERANCE_FAULT_PROCESS
image

The following is the logic of starting from the failure and restoring fault tolerance. The startNodeList is not set for restoring fault tolerance. Is there anything I haven't noticed? Please help me correct it.

image image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature new feature
Projects
None yet
Development

No branches or pull requests

2 participants