[SNI] Upgrade of K8S nodes hosting haproxy and scylla-operator causes too long connectivity failures #1341

vponomaryov · 2023-08-19T08:23:52Z

Issue description

This issue is a regression.
It is unknown if this issue is a regression.

We have the test for upgrading K8S platform with following steps:

Provision operator
Provision a 3-pod Scylla cluster
Populate that Scylla cluster with data
Start load with 4 different loaders
Upgrade K8S control plane
Upgrade the auxiliary node pool where among other workloads we run haproxy and scylla-operator pods. Each service has 2 pods and provisioner of different nodes
Create additional node pool for Scylla pods
Move Scylla pods 1 by 1 to new nodes
Trigger creation of one more Scylla member
Check data

So, during the step 6 where we upgrade the auxiliary node pool which hosts haproxy pods our loaders lose connectivity for long time, enough to fail the load.

Notes

This problem is absent using scylla-operator v1.9.0 and everything else the same. Proof: Argus, CI
Running without SNI/haproxy the upgrade scenario described above passes ok. Proof: Argus, CI.

Impact

Loss of the network connectivity to Scylla pods using the SNI/haproxy for significant amount of time

How frequently does it reproduce?

100% using scylla-operator 1.10.0-rc.0

Installation details

Kernel Version: 5.10.186-179.751.amzn2.x86_64
Scylla version (or git commit hash): 2023.1.0~rc8-20230731.b6f7c5a6910c with build-id f6e718548e76ccf3564ed2387b6582ba8d37793c

Operator Image: scylladb/scylla-operator:1.10.0-rc.0
Operator Helm Version: 1.10.0-rc.0
Operator Helm Repository: https://storage.googleapis.com/scylla-operator-charts/latest
Cluster size: 3 pods (i4i.4xlarge)

OS / Image: `` (k8s-eks: eu-north-1)

Test: upgrade-platform-k8s-eks
Test id: 379e7cd3-3b74-4f39-bb7f-b561a8251126
Test name: scylla-operator/operator-1.10/upgrade/upgrade-platform-k8s-eks
Test config file(s):

kubernetes-platform-upgrade.yaml

Logs and commands

Restore Monitor Stack command: $ hydra investigate show-monitor 379e7cd3-3b74-4f39-bb7f-b561a8251126
Restore monitor on AWS instance using Jenkins job
Show all stored logs command: $ hydra investigate show-logs 379e7cd3-3b74-4f39-bb7f-b561a8251126

Logs:

kubernetes-379e7cd3.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/379e7cd3-3b74-4f39-bb7f-b561a8251126/20230818_093720/kubernetes-379e7cd3.tar.gz
db-cluster-379e7cd3.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/379e7cd3-3b74-4f39-bb7f-b561a8251126/20230818_093720/db-cluster-379e7cd3.tar.gz
sct-runner-events-379e7cd3.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/379e7cd3-3b74-4f39-bb7f-b561a8251126/20230818_093720/sct-runner-events-379e7cd3.tar.gz
sct-379e7cd3.log.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/379e7cd3-3b74-4f39-bb7f-b561a8251126/20230818_093720/sct-379e7cd3.log.tar.gz
loader-set-379e7cd3.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/379e7cd3-3b74-4f39-bb7f-b561a8251126/20230818_093720/loader-set-379e7cd3.tar.gz
monitor-set-379e7cd3.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/379e7cd3-3b74-4f39-bb7f-b561a8251126/20230818_093720/monitor-set-379e7cd3.tar.gz

Jenkins job URL
Argus

The text was updated successfully, but these errors were encountered:

vponomaryov · 2023-08-21T13:58:16Z

Faced too long absence of the SNI connectivity also in one more CI job during the drain_kubernetes_node_then_replace_scylla_node scenario in the 2-tenant setup.

Issue description

Applied nemesis:

haproxy pod-1:

2023/08/17 14:44:53 TRACE   service/endpoints.go:110 backend scylla_sct-cluster-us-east1-b-us-east1-1_cql-ssl: number of slots 1
2023/08/17 14:44:53 TRACE   controller.go:171 HAProxy config sync ended
[WARNING]  (433) : soft-stop running for too long, performing a hard-stop.
[WARNING]  (433) : Proxy ssl hard-stopped (31 remaining conns will be closed).
[WARNING]  (433) : Some tasks resisted to hard-stop, exiting now.
[WARNING]  (267) : Former worker (433) exited with code 0 (Exit)
2023/08/17 14:49:58 TRACE   controller.go:94 HAProxy config sync started

haproxy pod-2:

2023/08/17 14:44:53 TRACE   controller.go:171 HAProxy config sync ended
[WARNING]  (431) : soft-stop running for too long, performing a hard-stop.
[WARNING]  (431) : Proxy ssl hard-stopped (27 remaining conns will be closed).
[WARNING]  (431) : Some tasks resisted to hard-stop, exiting now.
[NOTICE]   (267) : haproxy version is 2.6.6-274d1a4
[WARNING]  (267) : Former worker (431) exited with code 0 (Exit)
2023/08/17 14:49:58 TRACE   controller.go:94 HAProxy config sync started

SCT.log:

2023-08-17 14:48:33,581 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:CRITICAL &gt; java.io.IOException: Operation x10 on key(s) [50393330353137333530]: Error executing: (NoHostAvailableException): All host(s) tried for query failed (no host was tried)
2023-08-17 14:48:33,589 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  &gt; 2023-08-17 14:48:33.588: (InfoEvent Severity.NORMAL) period_type=not-set event_id=451c00f7-8533-42d3-90f5-ce69fb8cbba2: message=TEST_END
2023-08-17 14:48:33,591 f:tester.py       l:2830 c:LongevityOperatorMultiTenantTest p:INFO  &gt; TearDown is starting...

Installation details

Kernel Version: 5.10.184-175.749.amzn2.x86_64
Scylla version (or git commit hash): 2023.1.0~rc8-20230731.b6f7c5a6910c with build-id f6e718548e76ccf3564ed2387b6582ba8d37793c

Operator Image: scylladb/scylla-operator:1.10.0-rc.0
Operator Helm Version: 1.10.0-rc.0
Operator Helm Repository: https://storage.googleapis.com/scylla-operator-charts/latest
Cluster size: 4 nodes (i4i.4xlarge)

Scylla Nodes used in this run:
No resources left at the end of the run

OS / Image: `` (k8s-eks: undefined_region)

Test: longevity-scylla-operator-3h-multitenant-eks
Test id: b6c6963a-a019-44dd-b8cc-97aea5bdc31f
Test name: scylla-operator/operator-1.10/eks/longevity-scylla-operator-3h-multitenant-eks
Test config file(s):

longevity-scylla-operator-3h-multitenant.yaml

Logs and commands

Restore Monitor Stack command: $ hydra investigate show-monitor b6c6963a-a019-44dd-b8cc-97aea5bdc31f
Restore monitor on AWS instance using Jenkins job
Show all stored logs command: $ hydra investigate show-logs b6c6963a-a019-44dd-b8cc-97aea5bdc31f

Logs:

kubernetes-b6c6963a.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/b6c6963a-a019-44dd-b8cc-97aea5bdc31f/20230817_151249/kubernetes-b6c6963a.tar.gz
db-cluster-b6c6963a.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/b6c6963a-a019-44dd-b8cc-97aea5bdc31f/20230817_151249/db-cluster-b6c6963a.tar.gz
sct-runner-events-b6c6963a.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/b6c6963a-a019-44dd-b8cc-97aea5bdc31f/20230817_151249/sct-runner-events-b6c6963a.tar.gz
sct-b6c6963a.log.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/b6c6963a-a019-44dd-b8cc-97aea5bdc31f/20230817_151249/sct-b6c6963a.log.tar.gz
loader-set-b6c6963a.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/b6c6963a-a019-44dd-b8cc-97aea5bdc31f/20230817_151249/loader-set-b6c6963a.tar.gz
monitor-set-b6c6963a.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/b6c6963a-a019-44dd-b8cc-97aea5bdc31f/20230817_151249/monitor-set-b6c6963a.tar.gz
parallel-timelines-report-b6c6963a.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/b6c6963a-a019-44dd-b8cc-97aea5bdc31f/20230817_151249/parallel-timelines-report-b6c6963a.tar.gz

Jenkins job URL
Argus

vponomaryov · 2023-08-21T14:14:41Z

Faced too long absence of the SNI connectivity also in one more CI job during the disrupt_grow_shrink_cluster scenario.

Issue description

Applied nemesis:

haproxy pod-1:

2023/08/17 19:10:03 TRACE   service/endpoints.go:110 backend scylla_sct-cluster-us-east1-b-us-east1-3_cql-ssl: number of slots 1
2023/08/17 19:10:03 TRACE   controller.go:171 HAProxy config sync ended
2023/08/17 19:15:37 TRACE   store/events.go:98 Treating endpoints event {SliceName: Namespace:scylla Service:sct-cluster-client Ports:map[agent-api:0xc000ae82b0 agent-prometheus:0xc000ae8280 cql:0xc000ae8220 cql-shard-aware:0xc000ae8290 cql-ssl:0xc000ae8270 cql-ssl-shard-aware:0xc000ae8230 inter-node-communication:0xc000ae8250 jmx-monitoring:0xc000ae82a0 node-exporter:0xc000ae82d0 prometheus:0xc000ae8240 ssl-inter-node-communication:0xc000ae8260 thrift:0xc000ae82c0] Status:MODIFIED}
2023/08/17 19:15:37 TRACE   store/events.go:102 service sct-cluster-client : endpoints list map[agent-api:{Port:10001 Addresses:map[10.0.4.71:{} 10.0.5.47:{} 10.0.6.129:{} 10.0.6.153:{}]} agent-prometheus:{Port:5090 Addresses:map[10.0.4.71:{} 10.0.5.47:{} 10.0.6.129:{} 10.0.6.153:{}]} cql:{Port:9042 Addresses:map[10.0.4.71:{} 10.0.5.47:{} 10.0.6.129:{} 10.0.6.153:{}]} cql-shard-aware:{Port:19042 Addresses:map[10.0.4.71:{} 10.0.5.47:{} 10.0.6.129:{} 10.0.6.153:{}]} cql-ssl:{Port:9142 Addresses:map[10.0.4.71:{} 10.0.5.47:{} 10.0.6.129:{} 10.0.6.153:{}]} cql-ssl-shard-aware:{Port:19142 Addresses:map[10.0.4.71:{} 10.0.5.47:{} 10.0.6.129:{} 10.0.6.153:{}]} inter-node-communication:{Port:7000 Addresses:map[10.0.4.71:{} 10.0.5.47:{} 10.0.6.129:{} 10.0.6.153:{}]} jmx-monitoring:{Port:7199 Addresses:map[10.0.4.71:{} 10.0.5.47:{} 10.0.6.129:{} 10.0.6.153:{}]} node-exporter:{Port:9100 Addresses:map[10.0.4.71:{} 10.0.5.47:{} 10.0.6.129:{} 10.0.6.153:{}]} prometheus:{Port:9180 Addresses:map[10.0.4.71:{} 10.0.5.47:{} 10.0.6.129:{} 10.0.6.153:{}]} ssl-inter-node-communication:{Port:7001 Addresses:map[10.0.4.71:{} 10.0.5.47:{} 10.0.6.129:{} 10.0.6.153:{}]} thrift:{Port:9160 Addresses:map[10.0.4.71:{} 10.0.5.47:{} 10.0.6.129:{} 10.0.6.153:{}]}]
2023/08/17 19:15:37 TRACE   store/events.go:107 service sct-cluster-client : number of already existing backend(s) in this transaction for this endpoint: 12

haproxy pod-2:

2023/08/17 19:10:03 TRACE   service/endpoints.go:110 backend scylla_sct-cluster-us-east1-b-us-east1-3_cql-ssl: number of slots 1
2023/08/17 19:10:03 TRACE   controller.go:171 HAProxy config sync ended
2023/08/17 19:15:37 TRACE   store/events.go:98 Treating endpoints event {SliceName:sct-cluster-client-lt8jf Namespace:scylla Service:sct-cluster-client Ports:map[agent-api:0xc000683000 agent-prometheus:0xc000683010 cql:0xc000683020 cql-shard-aware:0xc000683080 cql-ssl:0xc0006830a0 cql-ssl-shard-aware:0xc000683060 inter-node-communication:0xc000683090 jmx-monitoring:0xc000683030 node-exporter:0xc000683050 prometheus:0xc0006830b0 ssl-inter-node-communication:0xc000683040 thrift:0xc000683070] Status:MODIFIED}
2023/08/17 19:15:37 TRACE   store/events.go:102 service sct-cluster-client : endpoints list map[agent-api:{Port:10001 Addresses:map[10.0.4.71:{} 10.0.5.47:{} 10.0.6.129:{} 10.0.6.153:{}]} agent-prometheus:{Port:5090 Addresses:map[10.0.4.71:{} 10.0.5.47:{} 10.0.6.129:{} 10.0.6.153:{}]} cql:{Port:9042 Addresses:map[10.0.4.71:{} 10.0.5.47:{} 10.0.6.129:{} 10.0.6.153:{}]} cql-shard-aware:{Port:19042 Addresses:map[10.0.4.71:{} 10.0.5.47:{} 10.0.6.129:{} 10.0.6.153:{}]} cql-ssl:{Port:9142 Addresses:map[10.0.4.71:{} 10.0.5.47:{} 10.0.6.129:{} 10.0.6.153:{}]} cql-ssl-shard-aware:{Port:19142 Addresses:map[10.0.4.71:{} 10.0.5.47:{} 10.0.6.129:{} 10.0.6.153:{}]} inter-node-communication:{Port:7000 Addresses:map[10.0.4.71:{} 10.0.5.47:{} 10.0.6.129:{} 10.0.6.153:{}]} jmx-monitoring:{Port:7199 Addresses:map[10.0.4.71:{} 10.0.5.47:{} 10.0.6.129:{} 10.0.6.153:{}]} node-exporter:{Port:9100 Addresses:map[10.0.4.71:{} 10.0.5.47:{} 10.0.6.129:{} 10.0.6.153:{}]} prometheus:{Port:9180 Addresses:map[10.0.4.71:{} 10.0.5.47:{} 10.0.6.129:{} 10.0.6.153:{}]} ssl-inter-node-communication:{Port:7001 Addresses:map[10.0.4.71:{} 10.0.5.47:{} 10.0.6.129:{} 10.0.6.153:{}]} thrift:{Port:9160 Addresses:map[10.0.4.71:{} 10.0.5.47:{} 10.0.6.129:{} 10.0.6.153:{}]}]
2023/08/17 19:15:37 TRACE   store/events.go:107 service sct-cluster-client : number of already existing backend(s) in this transaction for this endpoint: 12

SCT.log:

2023-08-17 19:16:33,601 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:CRITICAL > java.io.IOException: Operation x10 on key(s) [38504c4c333950343131]: Error executing: (NoHostAvailableException): All host(s) tried for query failed (no host was tried)
2023-08-17 19:16:33,606 f:tester.py       l:2830 c:LongevityTest        p:INFO  > TearDown is starting...

Installation details

Kernel Version: 5.10.184-175.749.amzn2.x86_64
Scylla version (or git commit hash): 2023.1.0-20230813.68e9cef1baf7 with build-id c7f9855620b984af24957d7ab0bd8054306d182e

Operator Image: scylladb/scylla-operator:1.10.0-rc.0
Operator Helm Version: 1.10.0-rc.0
Operator Helm Repository: https://storage.googleapis.com/scylla-operator-charts/latest
Cluster size: 4 nodes (i4i.4xlarge)

Scylla Nodes used in this run:
No resources left at the end of the run

OS / Image: `` (k8s-eks: undefined_region)

Test: longevity-scylla-operator-3h-eks-grow-shrink
Test id: 9dec2665-fd0e-4cba-8bcc-60d7b23ecd00
Test name: scylla-operator/operator-1.10/eks/longevity-scylla-operator-3h-eks-grow-shrink
Test config file(s):

longevity-scylla-operator-3h-grow-shrink.yaml

Logs and commands

Restore Monitor Stack command: $ hydra investigate show-monitor 9dec2665-fd0e-4cba-8bcc-60d7b23ecd00
Restore monitor on AWS instance using Jenkins job
Show all stored logs command: $ hydra investigate show-logs 9dec2665-fd0e-4cba-8bcc-60d7b23ecd00

Logs:

kubernetes-9dec2665.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/9dec2665-fd0e-4cba-8bcc-60d7b23ecd00/20230817_193717/kubernetes-9dec2665.tar.gz
db-cluster-9dec2665.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/9dec2665-fd0e-4cba-8bcc-60d7b23ecd00/20230817_193717/db-cluster-9dec2665.tar.gz
sct-runner-events-9dec2665.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/9dec2665-fd0e-4cba-8bcc-60d7b23ecd00/20230817_193717/sct-runner-events-9dec2665.tar.gz
sct-9dec2665.log.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/9dec2665-fd0e-4cba-8bcc-60d7b23ecd00/20230817_193717/sct-9dec2665.log.tar.gz
loader-set-9dec2665.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/9dec2665-fd0e-4cba-8bcc-60d7b23ecd00/20230817_193717/loader-set-9dec2665.tar.gz
monitor-set-9dec2665.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/9dec2665-fd0e-4cba-8bcc-60d7b23ecd00/20230817_193717/monitor-set-9dec2665.tar.gz
parallel-timelines-report-9dec2665.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/9dec2665-fd0e-4cba-8bcc-60d7b23ecd00/20230817_193717/parallel-timelines-report-9dec2665.tar.gz

Jenkins job URL
Argus

With the scylla-operator-1.10 we started getting HA problems [1] with the 'haproxy' service. So, fix it by using the latest available ingress controller version for haproxy. [1] scylladb/scylla-operator#1341

zimnx · 2023-09-06T08:19:18Z

Reported an issue in haproxy ingress controller: haproxytech/kubernetes-ingress#564

scylla-operator-bot · 2024-06-27T10:41:34Z

The Scylla Operator project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 30d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out

/lifecycle stale

scylla-operator-bot · 2024-07-28T10:32:50Z

The Scylla Operator project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 30d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out

/lifecycle rotten

scylla-operator-bot · 2024-08-27T10:41:45Z

The Scylla Operator project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 30d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out

/close not-planned

scylla-operator-bot · 2024-08-27T10:41:49Z

@scylla-operator-bot[bot]: Closing this issue, marking it as "Not Planned".

In response to this:

The Scylla Operator project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 30d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen

Mark this issue as fresh with /remove-lifecycle rotten

Offer to help out

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

vponomaryov added the kind/bug Categorizes issue or PR as related to a bug. label Aug 19, 2023

vponomaryov changed the title ~~Upgrade of K8S nodes hosting haproxy and scylla-operator causes too long connectivity failures~~ [SNI] Upgrade of K8S nodes hosting haproxy and scylla-operator causes too long connectivity failures Aug 19, 2023

tnozicka assigned zimnx Aug 21, 2023

tnozicka added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Aug 21, 2023

vponomaryov mentioned this issue Aug 24, 2023

fix(k8s-eks): use latest haproxy ingress controller version scylladb/scylla-cluster-tests#6539

Merged

7 tasks

fruch mentioned this issue Nov 5, 2023

sni_proxy close all CQL connections at the same time #1171

Closed

scylla-operator-bot bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 27, 2024

scylla-operator-bot bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jul 28, 2024

scylla-operator-bot bot closed this as not planned Won't fix, can't repro, duplicate, stale Aug 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SNI] Upgrade of K8S nodes hosting haproxy and scylla-operator causes too long connectivity failures #1341

[SNI] Upgrade of K8S nodes hosting haproxy and scylla-operator causes too long connectivity failures #1341

vponomaryov commented Aug 19, 2023 •

edited

Loading

Logs:

vponomaryov commented Aug 21, 2023

Logs:

vponomaryov commented Aug 21, 2023

Logs:

zimnx commented Sep 6, 2023

scylla-operator-bot bot commented Jun 27, 2024

scylla-operator-bot bot commented Jul 28, 2024

scylla-operator-bot bot commented Aug 27, 2024

scylla-operator-bot bot commented Aug 27, 2024

[SNI] Upgrade of K8S nodes hosting haproxy and scylla-operator causes too long connectivity failures #1341

[SNI] Upgrade of K8S nodes hosting haproxy and scylla-operator causes too long connectivity failures #1341

Comments

vponomaryov commented Aug 19, 2023 • edited Loading

Issue description

Notes

Impact

How frequently does it reproduce?

Installation details

Logs:

vponomaryov commented Aug 21, 2023

Issue description

Installation details

Logs:

vponomaryov commented Aug 21, 2023

Issue description

Installation details

Logs:

zimnx commented Sep 6, 2023

scylla-operator-bot bot commented Jun 27, 2024

scylla-operator-bot bot commented Jul 28, 2024

scylla-operator-bot bot commented Aug 27, 2024

scylla-operator-bot bot commented Aug 27, 2024

vponomaryov commented Aug 19, 2023 •

edited

Loading