Host-HA vs VM-HA (was: KVMHAMonitor getting initialized without host ha enabled) #8953

sbrueseke · 2024-02-20T09:44:33Z

sbrueseke
Feb 20, 2024

ISSUE TYPE

Bug Report

COMPONENT NAME

HA

CLOUDSTACK VERSION

4.19.0

CONFIGURATION

OS / ENVIRONMENT

SUMMARY

Even when HA has been disabled on cluster and/or host level, KVMHAMonitor getting initialized on KVM hosts.
It looks like code does not check if HA is enabled or not:
https://github.com/apache/cloudstack/blob/8f6721ed4c4e1b31081a951c62ffbe5331cf16d4/plugins/hypervisors/kvm/src/main/java/com/cloud/hypervisor/kvm/resource/LibvirtComputingResource.java

STEPS TO REPRODUCE

1) disable HA on cluster level via UI
2) login to kvm host
3) take a look at agent.log

EXPECTED RESULTS

When HA is disabled I would expect that the management server adds the following parameter to agent.properties file of each host:
reboot.host.and.alert.management.on.heartbeat.timeout=false

or

KVMHAMonitor will not getting initialized at all.

ACTUAL RESULTS

Nothing happens when disabling HA on cluster and/or host level. KVMHAMonitor is getting initialized and will perform checks of the host. In some situations this will lead to an automatic reboot of the host because of KVMHAMonitor is not able to write heartbeat for pool to primary storage:

Feb 19 12:01:08 kvm-2 java[9465]: WARN  [kvm.resource.KVMHAMonitor] (Thread-1:) (logid:) Write heartbeat for pool [71c272d3-b180-4b18-a0fc-cfc1dc5b86c9] failed: Down; try: 5 of 5.
Feb 19 12:01:08 kvm-2 java[9465]: WARN  [kvm.resource.KVMHAMonitor] (Thread-1:) (logid:) Write heartbeat for pool [71c272d3-b180-4b18-a0fc-cfc1dc5b86c9] failed: Down; stopping cloudstack-agent.
Feb 19 12:02:08 kvm-2 heartbeat: kvmspheartbeat.sh will reboot system because it was unable to write the heartbeat to the storage.

DaanHoogland · 2024-03-04T08:34:17Z

DaanHoogland
Mar 4, 2024
Collaborator

@sbrueseke are you addressing (designing/implementing a solution to) this bug?

0 replies

sbrueseke · 2024-03-04T08:53:09Z

sbrueseke
Mar 4, 2024
Author

@sbrueseke are you addressing (designing/implementing a solution to) this bug?

I'm not deep enough into the topic to suggest a solution. We noticed the bug because we were carrying out some tests with Linstor in our lab and suddenly all the KVM hosts restarted. By evaluating the logs, we saw that it was due to the KVMHAMonitor. However, we did not have HA enabled, not on the cluster and not on the hosts.
The question is why the KVMHAMonitor is running on the hosts even if HA is not configured. Maybe there is a reason, but I don't know it. But what, in my opinion, needs to be prevented is that KVMHAMonitor restarts hosts despite HA being deactivated.
One solutions for this would be to set reboot.host.and.alert.management.on.heartbeat.timeout=false in the agent.properties by the management server if HA is not configured or deactivated.
But as I sad before, I'm not deep into that topic and I do not know if this is a valid solution or if there is a much better one.

0 replies

adidiborg · 2024-03-12T04:40:29Z

adidiborg
Mar 12, 2024

@sbrueseke, We are also facing similar issue, but as per our logs ( on all KVM ) we are only seeing below log lines -

Feb 19 12:01:08 kvm-2 java[9465]: WARN  [kvm.resource.KVMHAMonitor] (Thread-1:) (logid:) Write heartbeat for pool [71c272d3-b180-4b18-a0fc-cfc1dc5b86c9] failed: Down; try: 5 of 5.
Feb 19 12:01:08 kvm-2 java[9465]: WARN  [kvm.resource.KVMHAMonitor] (Thread-1:) (logid:) Write heartbeat for pool [71c272d3-b180-4b18-a0fc-cfc1dc5b86c9] failed: Down; stopping cloudstack-agent.

and not seeing this one which you have mentioned

Feb 19 12:02:08 kvm-2 heartbeat: kvmspheartbeat.sh will reboot system because it was unable to write the heartbeat to the storage.

0 replies

dseralathan · 2024-03-12T21:51:38Z

dseralathan
Mar 12, 2024

heartbeat[13962]: kvmheartbeat.sh will reboot system because it was unable to write the heartbeat to the storage.

I received the same error when my Primary NFS Storage was in accessible .

0 replies

DaanHoogland · 2024-04-05T09:46:21Z

DaanHoogland
Apr 5, 2024
Collaborator

Hey @sbrueseke , I finally got a chance to look at this again.If you are using NFS this is how it is supposed to work , I think. If you are using LinStor, see the merged PR for 4.19 (#8670). Is this still an issue?

0 replies

sbrueseke · 2024-04-08T07:40:22Z

sbrueseke
Apr 8, 2024
Author

@DaanHoogland now I am confused! Why is KVMHAMonitor rebooting hosts when the host is unable to write a heartbeat file to any storage when any HA setting is disabled in the UI? Do you know the reason for that?

0 replies

weizhouapache · 2024-04-08T08:01:08Z

weizhouapache
Apr 8, 2024
Collaborator

@DaanHoogland now I am confused! Why is KVMHAMonitor rebooting hosts when the host is unable to write a heartbeat file to any storage when any HA setting is disabled in the UI? Do you know the reason for that?

@sbrueseke
You can add a setting in /etc/cloudstack/agent.properties and restart cloudstack-agent

reboot.host.and.alert.management.on.heartbeat.timeout=false

0 replies

sbrueseke · 2024-04-08T09:31:44Z

sbrueseke
Apr 8, 2024
Author

@DaanHoogland now I am confused! Why is KVMHAMonitor rebooting hosts when the host is unable to write a heartbeat file to any storage when any HA setting is disabled in the UI? Do you know the reason for that?

@sbrueseke You can add a setting in /etc/cloudstack/agent.properties and restart cloudstack-agent
reboot.host.and.alert.management.on.heartbeat.timeout=false

I know this workaround. My question is if this setting should be added by the management server instead of manually.
Is there any legit reason KVMHAMonitor should reboot a host when all HA settings are disabled in the UI?

0 replies

weizhouapache · 2024-04-08T11:06:09Z

weizhouapache
Apr 8, 2024
Collaborator

@DaanHoogland now I am confused! Why is KVMHAMonitor rebooting hosts when the host is unable to write a heartbeat file to any storage when any HA setting is disabled in the UI? Do you know the reason for that?

@sbrueseke You can add a setting in /etc/cloudstack/agent.properties and restart cloudstack-agent
reboot.host.and.alert.management.on.heartbeat.timeout=false
I know this workaround. My question is if this setting should be added by the management server instead of manually. Is there any legit reason KVMHAMonitor should reboot a host when all HA settings are disabled in the UI?

@sbrueseke
as I understand, the Host HA means the HA solutions by OOBM/IPMI

0 replies

sbrueseke · 2024-04-08T11:17:04Z

sbrueseke
Apr 8, 2024
Author

@DaanHoogland now I am confused! Why is KVMHAMonitor rebooting hosts when the host is unable to write a heartbeat file to any storage when any HA setting is disabled in the UI? Do you know the reason for that?

@sbrueseke You can add a setting in /etc/cloudstack/agent.properties and restart cloudstack-agent
reboot.host.and.alert.management.on.heartbeat.timeout=false
I know this workaround. My question is if this setting should be added by the management server instead of manually. Is there any legit reason KVMHAMonitor should reboot a host when all HA settings are disabled in the UI?
@sbrueseke as I understand, the Host HA means the HA solutions by OOBM/IPMI

Can you explain the root cause why KVMHAMonitor needs to reboot the host when a storage is read only?

0 replies

weizhouapache · 2024-04-08T11:37:21Z

weizhouapache
Apr 8, 2024
Collaborator

@DaanHoogland now I am confused! Why is KVMHAMonitor rebooting hosts when the host is unable to write a heartbeat file to any storage when any HA setting is disabled in the UI? Do you know the reason for that?

@sbrueseke You can add a setting in /etc/cloudstack/agent.properties and restart cloudstack-agent
reboot.host.and.alert.management.on.heartbeat.timeout=false
I know this workaround. My question is if this setting should be added by the management server instead of manually. Is there any legit reason KVMHAMonitor should reboot a host when all HA settings are disabled in the UI?
@sbrueseke as I understand, the Host HA means the HA solutions by OOBM/IPMI
Can you explain the root cause why KVMHAMonitor needs to reboot the host when a storage is read only?

@sbrueseke
it is the old behaviour introduced more than 10 years ago. I am not clear why.

there were lots of discussion in the past
#2722 #2890 #2984 #4586 and #4708

since storage issue rarely happens, it may be ok for users to change the setting in agent.properties

0 replies

sbrueseke · 2024-04-08T11:41:44Z

sbrueseke
Apr 8, 2024
Author

@weizhouapache I would suggest to take another look into that and change it default. In my opinion a host should not automatically reboot at all if no settings are configured. From what I know this looks to me like old behavior lost in the code.

0 replies

weizhouapache · 2024-04-08T11:51:07Z

weizhouapache
Apr 8, 2024
Collaborator

@weizhouapache I would suggest to take another look into that and change it default. In my opinion a host should not automatically reboot at all if no settings are configured. From what I know this looks to me like old behavior lost in the code.

I agree with you. @sbrueseke

cc @DaanHoogland @rohityadavcloud @andrijapanicsb @GutoVeronezi @wido
your opinions ?

0 replies

DaanHoogland · 2024-04-08T11:54:33Z

DaanHoogland
Apr 8, 2024
Collaborator

@sbrueseke , I beg to differ, when storage can not be reached all VMs will be running on read-only disks from there on in. This means the hosts is useless and VMs need to reboot according to VM-HA. This behaviour was implemented before Host-HA was conceived and is not related. Maybe a redesign is in place but as of now this works as designed.

0 replies

andrijapanicsb · 2024-04-08T11:59:45Z

andrijapanicsb
Apr 8, 2024
Collaborator

afaik, this has 0 things to do with host HA, VM HA or anything else.
This scrip (kvmspheartbeat.sh) has been there for last 12 years (that I'm aware of) - and it only checks if NFS storage is accessible, and if not, reboots (forcefully) the host.

They only way (to my knowledge) to disable host reboots is to comment out the lines that does some echo into sys/proc...etc - at the VERY end of the script (that echo triggers forcefull reboot) - just comment out that single "echo.... line and you are good. Storage migh be unaccessible (heartbeat fail) but nothing will happen (log messages will say "I'm rebooting" but script is not doing anything due to line commented out)

0 replies

slavkap · 2024-04-08T13:22:07Z

slavkap
Apr 8, 2024
Collaborator

Hi all,
KVMHAMonitor is initialized when the agent host is started (without knowing if HA is enabled/disabled). It is used either when Host HA is enabled to check the host. If there is a primary storage that supports Host HA (even if the HA is disabled) and some issue with the storage the script kvmheartbeat.sh/kvmspheartbeat.sh will be executed.
Probably the fastest and easiest solution before a redesign we should set the agent property reboot.host.and.alert.management.on.heartbeat.timeout to be disabled by default.

0 replies

GutoVeronezi · 2024-04-08T13:53:06Z

GutoVeronezi
Apr 8, 2024
Collaborator

afaik, this has 0 things to do with host HA, VM HA or anything else. This scrip (kvmspheartbeat.sh) has been there for last 12 years (that I'm aware of) - and it only checks if NFS storage is accessible, and if not, reboots (forcefully) the host.

They only way (to my knowledge) to disable host reboots is to comment out the lines that does some echo into sys/proc...etc - at the VERY end of the script (that echo triggers forcefull reboot) - just comment out that single "echo.... line and you are good. Storage migh be unaccessible (heartbeat fail) but nothing will happen (log messages will say "I'm rebooting" but script is not doing anything due to line commented out)

@andrijapanicsb you can also disable this behavior with the agent property reboot.host.and.alert.management.on.heartbeat.timeout, as Wei stated in #8682 (comment).

@weizhouapache I would suggest to take another look into that and change it default. In my opinion a host should not automatically reboot at all if no settings are configured. From what I know this looks to me like old behavior lost in the code.

I agree with you. @sbrueseke

cc @DaanHoogland @rohityadavcloud @andrijapanicsb @GutoVeronezi @wido your opinions ?

PR #4586 introduced this property with default as true to keep the behavior; however, I also agree with @sbrueseke. With default as true, we are delegating to ACS the decision of rebooting the host, which might not be correct in several cases. I am inclined to make this property false by default and focus on alerting the operators about the situation, so they can take the right action over the environment.

A redesign of the feature might be needed; however, we need discuss it first.

0 replies

bkrajendra · 2024-04-29T06:16:34Z

bkrajendra
Apr 29, 2024

CloudStack 4.17.0.1

Recently I also started getting this issue:

Apr 27 07:50:33 host java: WARN  [kvm.resource.KVMHAMonitor] (Thread-1:) (logid:) Write heartbeat for pool [077e9266-fecc-3171-8b1f-6c634ead9ca3] failed: timeout; stopping 
cloudstack-agent.                                                                                                                                                             
Apr 27 07:50:33 host java: DEBUG [kvm.resource.KVMHAMonitor] (Thread-1:) (logid:) Executing: /usr/share/cloudstack-common/scripts/vm/hypervisor/kvm/kvmheartbeat.sh -i 10.1.
27.14 -p /store0/export/primary -m /mnt/077e9266-fecc-3171-8b1f-6c634ead9ca3 -c                                                                                               
Apr 27 07:50:33 host java: DEBUG [kvm.resource.KVMHAMonitor] (Thread-1:) (logid:) Executing while with timeout : 60000                                                      
Apr 27 07:50:33 host heartbeat: kvmheartbeat.sh will reboot system because it was unable to write the heartbeat to the storage.                                             
Apr 27 07:53:33 host kernel: Initializing cgroup subsys cpuset

This has happened second time in this month - after 2.5 years.

My Issues is after reboot all the VMs were down.
Is there any way we can restart all the VMs when reboot happens.

In my case I don't see any issue with the NFS server, and its accessible. Should we treat this issue as temporary NFS reachability or is there anything I should be checking related to NFS host.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Host-HA vs VM-HA (was: KVMHAMonitor getting initialized without host ha enabled) #8953

{{title}}

Replies: 18 comments

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Host-HA vs VM-HA (was: KVMHAMonitor getting initialized without host ha enabled) #8953

sbrueseke Feb 20, 2024

ISSUE TYPE

COMPONENT NAME

CLOUDSTACK VERSION

CONFIGURATION

OS / ENVIRONMENT

SUMMARY

STEPS TO REPRODUCE

EXPECTED RESULTS

ACTUAL RESULTS

Replies: 18 comments

DaanHoogland Mar 4, 2024 Collaborator

sbrueseke Mar 4, 2024 Author

adidiborg Mar 12, 2024

dseralathan Mar 12, 2024

DaanHoogland Apr 5, 2024 Collaborator

sbrueseke Apr 8, 2024 Author

weizhouapache Apr 8, 2024 Collaborator

sbrueseke Apr 8, 2024 Author

weizhouapache Apr 8, 2024 Collaborator

sbrueseke Apr 8, 2024 Author

weizhouapache Apr 8, 2024 Collaborator

sbrueseke Apr 8, 2024 Author

weizhouapache Apr 8, 2024 Collaborator

DaanHoogland Apr 8, 2024 Collaborator

andrijapanicsb Apr 8, 2024 Collaborator

slavkap Apr 8, 2024 Collaborator

GutoVeronezi Apr 8, 2024 Collaborator

bkrajendra Apr 29, 2024

sbrueseke
Feb 20, 2024

DaanHoogland
Mar 4, 2024
Collaborator

sbrueseke
Mar 4, 2024
Author

adidiborg
Mar 12, 2024

dseralathan
Mar 12, 2024

DaanHoogland
Apr 5, 2024
Collaborator

sbrueseke
Apr 8, 2024
Author

weizhouapache
Apr 8, 2024
Collaborator

sbrueseke
Apr 8, 2024
Author

weizhouapache
Apr 8, 2024
Collaborator

sbrueseke
Apr 8, 2024
Author

weizhouapache
Apr 8, 2024
Collaborator

sbrueseke
Apr 8, 2024
Author

weizhouapache
Apr 8, 2024
Collaborator

DaanHoogland
Apr 8, 2024
Collaborator

andrijapanicsb
Apr 8, 2024
Collaborator

slavkap
Apr 8, 2024
Collaborator

GutoVeronezi
Apr 8, 2024
Collaborator

bkrajendra
Apr 29, 2024