Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SNI] Upgrade of K8S nodes hosting haproxy and scylla-operator causes too long connectivity failures #1341

Closed
1 of 2 tasks
vponomaryov opened this issue Aug 19, 2023 · 7 comments
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.

Comments

@vponomaryov
Copy link
Contributor

vponomaryov commented Aug 19, 2023

Issue description

  • This issue is a regression.
  • It is unknown if this issue is a regression.

We have the test for upgrading K8S platform with following steps:

  1. Provision operator
  2. Provision a 3-pod Scylla cluster
  3. Populate that Scylla cluster with data
  4. Start load with 4 different loaders
  5. Upgrade K8S control plane
  6. Upgrade the auxiliary node pool where among other workloads we run haproxy and scylla-operator pods. Each service has 2 pods and provisioner of different nodes
  7. Create additional node pool for Scylla pods
  8. Move Scylla pods 1 by 1 to new nodes
  9. Trigger creation of one more Scylla member
  10. Check data

So, during the step 6 where we upgrade the auxiliary node pool which hosts haproxy pods our loaders lose connectivity for long time, enough to fail the load.

Notes

  • This problem is absent using scylla-operator v1.9.0 and everything else the same. Proof: Argus, CI
  • Running without SNI/haproxy the upgrade scenario described above passes ok. Proof: Argus, CI.

Impact

Loss of the network connectivity to Scylla pods using the SNI/haproxy for significant amount of time

How frequently does it reproduce?

100% using scylla-operator 1.10.0-rc.0

Installation details

Kernel Version: 5.10.186-179.751.amzn2.x86_64
Scylla version (or git commit hash): 2023.1.0~rc8-20230731.b6f7c5a6910c with build-id f6e718548e76ccf3564ed2387b6582ba8d37793c

Operator Image: scylladb/scylla-operator:1.10.0-rc.0
Operator Helm Version: 1.10.0-rc.0
Operator Helm Repository: https://storage.googleapis.com/scylla-operator-charts/latest
Cluster size: 3 pods (i4i.4xlarge)

OS / Image: `` (k8s-eks: eu-north-1)

Test: upgrade-platform-k8s-eks
Test id: 379e7cd3-3b74-4f39-bb7f-b561a8251126
Test name: scylla-operator/operator-1.10/upgrade/upgrade-platform-k8s-eks
Test config file(s):

Logs and commands
  • Restore Monitor Stack command: $ hydra investigate show-monitor 379e7cd3-3b74-4f39-bb7f-b561a8251126
  • Restore monitor on AWS instance using Jenkins job
  • Show all stored logs command: $ hydra investigate show-logs 379e7cd3-3b74-4f39-bb7f-b561a8251126

Logs:

Jenkins job URL
Argus

@vponomaryov vponomaryov added the kind/bug Categorizes issue or PR as related to a bug. label Aug 19, 2023
@vponomaryov vponomaryov changed the title Upgrade of K8S nodes hosting haproxy and scylla-operator causes too long connectivity failures [SNI] Upgrade of K8S nodes hosting haproxy and scylla-operator causes too long connectivity failures Aug 19, 2023
@tnozicka tnozicka added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Aug 21, 2023
@vponomaryov
Copy link
Contributor Author

Faced too long absence of the SNI connectivity also in one more CI job during the drain_kubernetes_node_then_replace_scylla_node scenario in the 2-tenant setup.

Issue description

Applied nemesis:
Screenshot from 2023-08-21 16-56-36

haproxy pod-1:

2023/08/17 14:44:53 TRACE   service/endpoints.go:110 backend scylla_sct-cluster-us-east1-b-us-east1-1_cql-ssl: number of slots 1
2023/08/17 14:44:53 TRACE   controller.go:171 HAProxy config sync ended
[WARNING]  (433) : soft-stop running for too long, performing a hard-stop.
[WARNING]  (433) : Proxy ssl hard-stopped (31 remaining conns will be closed).
[WARNING]  (433) : Some tasks resisted to hard-stop, exiting now.
[WARNING]  (267) : Former worker (433) exited with code 0 (Exit)
2023/08/17 14:49:58 TRACE   controller.go:94 HAProxy config sync started

haproxy pod-2:

2023/08/17 14:44:53 TRACE   controller.go:171 HAProxy config sync ended
[WARNING]  (431) : soft-stop running for too long, performing a hard-stop.
[WARNING]  (431) : Proxy ssl hard-stopped (27 remaining conns will be closed).
[WARNING]  (431) : Some tasks resisted to hard-stop, exiting now.
[NOTICE]   (267) : haproxy version is 2.6.6-274d1a4
[WARNING]  (267) : Former worker (431) exited with code 0 (Exit)
2023/08/17 14:49:58 TRACE   controller.go:94 HAProxy config sync started

SCT.log:

2023-08-17 14:48:33,581 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:CRITICAL > java.io.IOException: Operation x10 on key(s) [50393330353137333530]: Error executing: (NoHostAvailableException): All host(s) tried for query failed (no host was tried)
2023-08-17 14:48:33,589 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > 2023-08-17 14:48:33.588: (InfoEvent Severity.NORMAL) period_type=not-set event_id=451c00f7-8533-42d3-90f5-ce69fb8cbba2: message=TEST_END
2023-08-17 14:48:33,591 f:tester.py       l:2830 c:LongevityOperatorMultiTenantTest p:INFO  > TearDown is starting...

Installation details

Kernel Version: 5.10.184-175.749.amzn2.x86_64
Scylla version (or git commit hash): 2023.1.0~rc8-20230731.b6f7c5a6910c with build-id f6e718548e76ccf3564ed2387b6582ba8d37793c

Operator Image: scylladb/scylla-operator:1.10.0-rc.0
Operator Helm Version: 1.10.0-rc.0
Operator Helm Repository: https://storage.googleapis.com/scylla-operator-charts/latest
Cluster size: 4 nodes (i4i.4xlarge)

Scylla Nodes used in this run:
No resources left at the end of the run

OS / Image: `` (k8s-eks: undefined_region)

Test: longevity-scylla-operator-3h-multitenant-eks
Test id: b6c6963a-a019-44dd-b8cc-97aea5bdc31f
Test name: scylla-operator/operator-1.10/eks/longevity-scylla-operator-3h-multitenant-eks
Test config file(s):

Logs and commands
  • Restore Monitor Stack command: $ hydra investigate show-monitor b6c6963a-a019-44dd-b8cc-97aea5bdc31f
  • Restore monitor on AWS instance using Jenkins job
  • Show all stored logs command: $ hydra investigate show-logs b6c6963a-a019-44dd-b8cc-97aea5bdc31f

Logs:

Jenkins job URL
Argus

@vponomaryov
Copy link
Contributor Author

Faced too long absence of the SNI connectivity also in one more CI job during the disrupt_grow_shrink_cluster scenario.

Issue description

Applied nemesis:
Screenshot from 2023-08-21 17-11-56

haproxy pod-1:

2023/08/17 19:10:03 TRACE   service/endpoints.go:110 backend scylla_sct-cluster-us-east1-b-us-east1-3_cql-ssl: number of slots 1
2023/08/17 19:10:03 TRACE   controller.go:171 HAProxy config sync ended
2023/08/17 19:15:37 TRACE   store/events.go:98 Treating endpoints event {SliceName: Namespace:scylla Service:sct-cluster-client Ports:map[agent-api:0xc000ae82b0 agent-prometheus:0xc000ae8280 cql:0xc000ae8220 cql-shard-aware:0xc000ae8290 cql-ssl:0xc000ae8270 cql-ssl-shard-aware:0xc000ae8230 inter-node-communication:0xc000ae8250 jmx-monitoring:0xc000ae82a0 node-exporter:0xc000ae82d0 prometheus:0xc000ae8240 ssl-inter-node-communication:0xc000ae8260 thrift:0xc000ae82c0] Status:MODIFIED}
2023/08/17 19:15:37 TRACE   store/events.go:102 service sct-cluster-client : endpoints list map[agent-api:{Port:10001 Addresses:map[10.0.4.71:{} 10.0.5.47:{} 10.0.6.129:{} 10.0.6.153:{}]} agent-prometheus:{Port:5090 Addresses:map[10.0.4.71:{} 10.0.5.47:{} 10.0.6.129:{} 10.0.6.153:{}]} cql:{Port:9042 Addresses:map[10.0.4.71:{} 10.0.5.47:{} 10.0.6.129:{} 10.0.6.153:{}]} cql-shard-aware:{Port:19042 Addresses:map[10.0.4.71:{} 10.0.5.47:{} 10.0.6.129:{} 10.0.6.153:{}]} cql-ssl:{Port:9142 Addresses:map[10.0.4.71:{} 10.0.5.47:{} 10.0.6.129:{} 10.0.6.153:{}]} cql-ssl-shard-aware:{Port:19142 Addresses:map[10.0.4.71:{} 10.0.5.47:{} 10.0.6.129:{} 10.0.6.153:{}]} inter-node-communication:{Port:7000 Addresses:map[10.0.4.71:{} 10.0.5.47:{} 10.0.6.129:{} 10.0.6.153:{}]} jmx-monitoring:{Port:7199 Addresses:map[10.0.4.71:{} 10.0.5.47:{} 10.0.6.129:{} 10.0.6.153:{}]} node-exporter:{Port:9100 Addresses:map[10.0.4.71:{} 10.0.5.47:{} 10.0.6.129:{} 10.0.6.153:{}]} prometheus:{Port:9180 Addresses:map[10.0.4.71:{} 10.0.5.47:{} 10.0.6.129:{} 10.0.6.153:{}]} ssl-inter-node-communication:{Port:7001 Addresses:map[10.0.4.71:{} 10.0.5.47:{} 10.0.6.129:{} 10.0.6.153:{}]} thrift:{Port:9160 Addresses:map[10.0.4.71:{} 10.0.5.47:{} 10.0.6.129:{} 10.0.6.153:{}]}]
2023/08/17 19:15:37 TRACE   store/events.go:107 service sct-cluster-client : number of already existing backend(s) in this transaction for this endpoint: 12

haproxy pod-2:

2023/08/17 19:10:03 TRACE   service/endpoints.go:110 backend scylla_sct-cluster-us-east1-b-us-east1-3_cql-ssl: number of slots 1
2023/08/17 19:10:03 TRACE   controller.go:171 HAProxy config sync ended
2023/08/17 19:15:37 TRACE   store/events.go:98 Treating endpoints event {SliceName:sct-cluster-client-lt8jf Namespace:scylla Service:sct-cluster-client Ports:map[agent-api:0xc000683000 agent-prometheus:0xc000683010 cql:0xc000683020 cql-shard-aware:0xc000683080 cql-ssl:0xc0006830a0 cql-ssl-shard-aware:0xc000683060 inter-node-communication:0xc000683090 jmx-monitoring:0xc000683030 node-exporter:0xc000683050 prometheus:0xc0006830b0 ssl-inter-node-communication:0xc000683040 thrift:0xc000683070] Status:MODIFIED}
2023/08/17 19:15:37 TRACE   store/events.go:102 service sct-cluster-client : endpoints list map[agent-api:{Port:10001 Addresses:map[10.0.4.71:{} 10.0.5.47:{} 10.0.6.129:{} 10.0.6.153:{}]} agent-prometheus:{Port:5090 Addresses:map[10.0.4.71:{} 10.0.5.47:{} 10.0.6.129:{} 10.0.6.153:{}]} cql:{Port:9042 Addresses:map[10.0.4.71:{} 10.0.5.47:{} 10.0.6.129:{} 10.0.6.153:{}]} cql-shard-aware:{Port:19042 Addresses:map[10.0.4.71:{} 10.0.5.47:{} 10.0.6.129:{} 10.0.6.153:{}]} cql-ssl:{Port:9142 Addresses:map[10.0.4.71:{} 10.0.5.47:{} 10.0.6.129:{} 10.0.6.153:{}]} cql-ssl-shard-aware:{Port:19142 Addresses:map[10.0.4.71:{} 10.0.5.47:{} 10.0.6.129:{} 10.0.6.153:{}]} inter-node-communication:{Port:7000 Addresses:map[10.0.4.71:{} 10.0.5.47:{} 10.0.6.129:{} 10.0.6.153:{}]} jmx-monitoring:{Port:7199 Addresses:map[10.0.4.71:{} 10.0.5.47:{} 10.0.6.129:{} 10.0.6.153:{}]} node-exporter:{Port:9100 Addresses:map[10.0.4.71:{} 10.0.5.47:{} 10.0.6.129:{} 10.0.6.153:{}]} prometheus:{Port:9180 Addresses:map[10.0.4.71:{} 10.0.5.47:{} 10.0.6.129:{} 10.0.6.153:{}]} ssl-inter-node-communication:{Port:7001 Addresses:map[10.0.4.71:{} 10.0.5.47:{} 10.0.6.129:{} 10.0.6.153:{}]} thrift:{Port:9160 Addresses:map[10.0.4.71:{} 10.0.5.47:{} 10.0.6.129:{} 10.0.6.153:{}]}]
2023/08/17 19:15:37 TRACE   store/events.go:107 service sct-cluster-client : number of already existing backend(s) in this transaction for this endpoint: 12

SCT.log:

2023-08-17 19:16:33,601 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:CRITICAL > java.io.IOException: Operation x10 on key(s) [38504c4c333950343131]: Error executing: (NoHostAvailableException): All host(s) tried for query failed (no host was tried)
2023-08-17 19:16:33,606 f:tester.py       l:2830 c:LongevityTest        p:INFO  > TearDown is starting...

Installation details

Kernel Version: 5.10.184-175.749.amzn2.x86_64
Scylla version (or git commit hash): 2023.1.0-20230813.68e9cef1baf7 with build-id c7f9855620b984af24957d7ab0bd8054306d182e

Operator Image: scylladb/scylla-operator:1.10.0-rc.0
Operator Helm Version: 1.10.0-rc.0
Operator Helm Repository: https://storage.googleapis.com/scylla-operator-charts/latest
Cluster size: 4 nodes (i4i.4xlarge)

Scylla Nodes used in this run:
No resources left at the end of the run

OS / Image: `` (k8s-eks: undefined_region)

Test: longevity-scylla-operator-3h-eks-grow-shrink
Test id: 9dec2665-fd0e-4cba-8bcc-60d7b23ecd00
Test name: scylla-operator/operator-1.10/eks/longevity-scylla-operator-3h-eks-grow-shrink
Test config file(s):

Logs and commands
  • Restore Monitor Stack command: $ hydra investigate show-monitor 9dec2665-fd0e-4cba-8bcc-60d7b23ecd00
  • Restore monitor on AWS instance using Jenkins job
  • Show all stored logs command: $ hydra investigate show-logs 9dec2665-fd0e-4cba-8bcc-60d7b23ecd00

Logs:

Jenkins job URL
Argus

vponomaryov added a commit to vponomaryov/scylla-cluster-tests that referenced this issue Aug 24, 2023
With the scylla-operator-1.10 we started getting HA problems [1]
with the 'haproxy' service.
So, fix it by using the latest available ingress controller version for
haproxy.

[1] scylladb/scylla-operator#1341
vponomaryov added a commit to vponomaryov/scylla-cluster-tests that referenced this issue Aug 24, 2023
With the scylla-operator-1.10 we started getting HA problems [1]
with the 'haproxy' service.
So, fix it by using the latest available ingress controller version for
haproxy.

[1] scylladb/scylla-operator#1341
fruch pushed a commit to scylladb/scylla-cluster-tests that referenced this issue Aug 25, 2023
With the scylla-operator-1.10 we started getting HA problems [1]
with the 'haproxy' service.
So, fix it by using the latest available ingress controller version for
haproxy.

[1] scylladb/scylla-operator#1341
@zimnx
Copy link
Collaborator

zimnx commented Sep 6, 2023

Reported an issue in haproxy ingress controller: haproxytech/kubernetes-ingress#564

Copy link
Contributor

The Scylla Operator project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 30d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out

/lifecycle stale

@scylla-operator-bot scylla-operator-bot bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 27, 2024
Copy link
Contributor

The Scylla Operator project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 30d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out

/lifecycle rotten

@scylla-operator-bot scylla-operator-bot bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jul 28, 2024
Copy link
Contributor

The Scylla Operator project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 30d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out

/close not-planned

@scylla-operator-bot scylla-operator-bot bot closed this as not planned Won't fix, can't repro, duplicate, stale Aug 27, 2024
Copy link
Contributor

@scylla-operator-bot[bot]: Closing this issue, marking it as "Not Planned".

In response to this:

The Scylla Operator project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 30d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Projects
None yet
Development

No branches or pull requests

3 participants