gocql panics with `panic: scylla: <ip>:9042 invalid number of shards` when restarting node with higher resources assigned #145

rzetelskik · 2023-09-20T22:32:23Z

One of Scylla Operator's E2E tests started failing after updating gocql dependency from v1.7.3 to v1.11.1 due to gocql panicking with the following logs:

        panic: scylla: 10.105.211.70:9042 invalid number of shards

        goroutine 997 [running]:
        github.com/gocql/gocql.(*scyllaConnPicker).Put(0xc0005b1440, 0xc000904000)
                github.com/gocql/[email protected]/scylla.go:400 +0x448
        github.com/gocql/gocql.(*hostConnPool).connect(0xc00058b1f0)
                github.com/gocql/[email protected]/connectionpool.go:554 +0x2b0
        github.com/gocql/gocql.(*hostConnPool).fill(0xc00058b1f0)
                github.com/gocql/[email protected]/connectionpool.go:397 +0x159
        github.com/gocql/gocql.(*policyConnPool).addHost(0xc0003decd0, 0xc000b849c0)
                github.com/gocql/[email protected]/connectionpool.go:255 +0x1cf
        github.com/gocql/gocql.(*Session).startPoolFill(0xc000685000, 0x0?)
                github.com/gocql/[email protected]/events.go:214 +0x2d
        created by github.com/gocql/gocql.(*controlConn).setupConn
                github.com/gocql/[email protected]/control.go:301 +0x26a

https://github.com/scylladb/scylla-operator/actions/runs/6178751661/job/16772628746#step:3:3702

I bisected the repository and confirmed that, before we reverted to v1.7.3, the last good commit was 6b310ee0ce1c7a72e4d16555dedf4e1cf7058258, and that bumping gocql's version was then enough to break the test.

The failing test is https://github.com/scylladb/scylla-operator/blob/master/test/e2e/set/scyllacluster/scyllacluster_updates.go.
It was failing quite consistently on our master (GitHub CI node with kubeadm and cri-o) before reverting. I was also able to consistently reproduce it locally with a similar setup.

Debug logs from a local run:

2023/09/21 00:24:31 gocql: pool connection error "10.96.238.108:9042": EOF
2023/09/21 00:24:31 scylla: 10.96.238.108:9042 remove shard 0 connection
2023/09/21 00:24:31 gocql: unable to dial control conn 10.96.238.108:9042: dial tcp :0->10.96.238.108:9042: connect: connection refused
2023/09/21 00:24:31 gocql: unable to connect to any ring node: dial tcp :0->10.96.238.108:9042: connect: connection refused
2023/09/21 00:24:31 gocql: control falling back to initial contact points.
...
2023/09/21 00:25:25 Session.ring:[10.96.238.108:UP]
2023/09/21 00:25:25 Session.ring:[10.96.238.108:UP]
2023/09/21 00:25:25 scylla: connecting to shard 0
panic: scylla: 10.96.238.108:9042 invalid number of shards
...

Test scenario:

Deploy ScyllaDB cluster with a single node
Wait for node's readiness
Initialize a gocql session and write data
Restart node with increased assigned resources (see https://github.com/scylladb/scylla-operator/blob/e87c45c17831e93d915a8114ab350e881778361c/test/e2e/set/scyllacluster/scyllacluster_updates.go#L81 for details)
Wait for node's readiness
Read the data written earlier

Now with v1.11.1, gocql is going to panic with panic: scylla: <node-ip>:9042 invalid number of shards, which is a regression from v1.7.3.

Prerequisites for reproducing:

Have a running k8s deployment with k8s-local-volume-provisioner deployed (see https://github.com/scylladb/scylla-operator/blob/master/.github/actions/setup-kubernetes/action.yaml and https://github.com/scylladb/scylla-operator/blob/master/.github/actions/setup-local-volume-provisioner/action.yaml for reference).
Deploy latest scylla-operator

Steps to reproduce:

$ git clone [email protected]:scylladb/scylla-operator.git && cd ./scylla-operator
$ go mod edit -replace github.com/gocql/gocql=github.com/scylladb/[email protected] && go mod tidy && go mod vendor
$ go run ./cmd/scylla-operator-tests run all --loglevel=2 --parallelism=1 --progress --focus="ScyllaCluster should reconcile resource changes"

Additional context:

The issue never occurred in our presubmits, ran in a different environment with Prow CI, for which we're using GKE nodes for both master and worker nodes. My only guess was a different networking setup: GKE Dataplane V2 is implemented using Cilium, while in our GitHub CI we're using a default cri-o's network configuration.

For this reason I've setup a local kubeadm installation with Cilium v1.14.2 without kube-proxy. Unfortunately the issue reproduced, so it didn't help narrowing it down, but maybe you'll find this information helpful.

What version of Scylla or Cassandra are you using?

ScyllaDB OS 5.2.7

What version of Gocql are you using?

1.11.1

What version of Go are you using?

1.20

Cross reference: scylladb/scylla-operator#1399

@avelanarius please let me know if you need any additional information or if you could use any help with reproducing the issue.

cc @tnozicka @mykaul

The text was updated successfully, but these errors were encountered:

avelanarius · 2023-09-21T15:38:21Z

The comparison that triggers the panic:

gocql/scylla.go

Lines 399 to 401 in 61be561

    
           if nrShards != p.nrShards { 
        
           	panic(fmt.Sprintf("scylla: %s invalid number of shards", p.address)) 
        
           }

occurs with nrShards = 2 and p.nrShards = 1.

avelanarius · 2023-09-21T15:40:33Z

It seems like the node has changed the number of shards, but I'll have to double check that this really occured.

In any case, the driver should not panic.

sylwiaszunejko · 2023-10-06T07:48:55Z

I am working on that, looks like check that is causing panic was there in previous version that worked, I need to do bisect to determine the actual change that broke the test

mykaul · 2023-10-11T08:54:14Z

Any updates?

sylwiaszunejko · 2023-10-11T08:59:39Z

I found the commit that broke the test: 0a990b2, it is quite large, but I managed to narrow down the error search to one file (control.go) and now I should quickly find out the reason why the test fails

rzetelskik · 2023-10-26T11:46:31Z

@sylwiaszunejko any updates?

avelanarius · 2023-10-26T11:49:44Z

apache#1729 fixes the problem. We wanted to wait for upstream to merge it, but it looks like we’ll only merge it to our fork.

I will release the next version of gocql soon.

rzetelskik · 2023-11-07T15:19:29Z

scylladb/scylla-operator#1528 merged, thanks @avelanarius @sylwiaszunejko !

rzetelskik mentioned this issue Sep 22, 2023

Failing test: gocql panics in E2E test suite with panic: scylla: <ip>:9042 invalid number of shards scylladb/scylla-operator#1399

Closed

tnozicka mentioned this issue Sep 25, 2023

Update scylladb/gocql dependency scylladb/scylla-operator#1415

Closed

This was referenced Oct 12, 2023

Add host failure handling #147

Closed

Mark node as down if unable to dial control conn apache/cassandra-gocql-driver#1729

Open

rzetelskik mentioned this issue Oct 24, 2023

Make E2E suite expose ScyllaClusters on PodIPs by default scylladb/scylla-operator#1508

Merged

sylwiaszunejko mentioned this issue Oct 26, 2023

Mark node as down if unable to dial control conn #150

Merged

avelanarius closed this as completed in #150 Oct 31, 2023

rzetelskik mentioned this issue Nov 6, 2023

Bump gocql to scylladb/[email protected] scylladb/scylla-operator#1528

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gocql panics with `panic: scylla: <ip>:9042 invalid number of shards` when restarting node with higher resources assigned #145

gocql panics with `panic: scylla: <ip>:9042 invalid number of shards` when restarting node with higher resources assigned #145

rzetelskik commented Sep 20, 2023

avelanarius commented Sep 21, 2023

avelanarius commented Sep 21, 2023

sylwiaszunejko commented Oct 6, 2023

mykaul commented Oct 11, 2023

sylwiaszunejko commented Oct 11, 2023

rzetelskik commented Oct 26, 2023

avelanarius commented Oct 26, 2023

rzetelskik commented Nov 7, 2023

gocql panics with panic: scylla: <ip>:9042 invalid number of shards when restarting node with higher resources assigned #145

gocql panics with panic: scylla: <ip>:9042 invalid number of shards when restarting node with higher resources assigned #145

Comments

rzetelskik commented Sep 20, 2023

Test scenario:

Prerequisites for reproducing:

Steps to reproduce:

Additional context:

What version of Scylla or Cassandra are you using?

What version of Gocql are you using?

What version of Go are you using?

avelanarius commented Sep 21, 2023

avelanarius commented Sep 21, 2023

sylwiaszunejko commented Oct 6, 2023

mykaul commented Oct 11, 2023

sylwiaszunejko commented Oct 11, 2023

rzetelskik commented Oct 26, 2023

avelanarius commented Oct 26, 2023

rzetelskik commented Nov 7, 2023

gocql panics with `panic: scylla: <ip>:9042 invalid number of shards` when restarting node with higher resources assigned #145

gocql panics with `panic: scylla: <ip>:9042 invalid number of shards` when restarting node with higher resources assigned #145