Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fence_lpar: Handle machine that is stuck in powering off #610

Open
SchoolGuy opened this issue Feb 4, 2025 · 6 comments · May be fixed by #611
Open

fence_lpar: Handle machine that is stuck in powering off #610

SchoolGuy opened this issue Feb 4, 2025 · 6 comments · May be fixed by #611

Comments

@SchoolGuy
Copy link
Contributor

When you power off an LPAR with the fence and the machine is still in the process of powering down the on action is not successful but also not reporting an error.

The desired behaviour would be that the on-command reports that the machine is still powering off.

@hramrach
Copy link
Contributor

hramrach commented Feb 4, 2025

Looks like when the state of machine is 'Error' it's considered powered off, and power=off command does nothing but power=on does nothing either because it's waiting for it to become powered off.

@oalbrigt
Copy link
Collaborator

oalbrigt commented Feb 4, 2025

Sounds like you might need to do some manual intervention if it's in Error-state.

You can see the on/off status code-handling here:
https://github.com/ClusterLabs/fence-agents/blob/main/agents/lpar/fence_lpar.py#L23-L26

hramrach added a commit to hramrach/fence-agents that referenced this issue Feb 4, 2025
When the LPAR crashes it can stay in Error state.

It's no possible to transition to 'Running' state from 'Error' but it's
possible to power off the LPAR. Then for all parctical purposes 'Error'
is an on state.

Fixes: ClusterLabs#610
hramrach added a commit to hramrach/fence-agents that referenced this issue Feb 4, 2025
When the LPAR crashes it can stay in Error state.

It's not possible to transition to 'Running' state from 'Error' but it's
possible to power off the LPAR. Then for all parctical purposes 'Error'
is an on state.

Fixes: ClusterLabs#610
@hramrach hramrach linked a pull request Feb 4, 2025 that will close this issue
@hramrach
Copy link
Contributor

hramrach commented Feb 4, 2025

I think adding 'Error' to the list of 'on' states will resolve the problem.

@hramrach
Copy link
Contributor

hramrach commented Feb 4, 2025

At least for error states that can be resolved by powering off the LPAR.

@oalbrigt
Copy link
Collaborator

oalbrigt commented Feb 5, 2025

Thanks. We'll do some testing and come back to you.

@hramrach
Copy link
Contributor

hramrach commented Feb 5, 2025

The 'Error' state was observed after kernel panic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants