Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gocql panics with panic: scylla: <ip>:9042 invalid number of shards when restarting node with higher resources assigned #145

Closed
rzetelskik opened this issue Sep 20, 2023 · 8 comments · Fixed by #150 · May be fixed by apache/cassandra-gocql-driver#1729

Comments

@rzetelskik
Copy link
Member

One of Scylla Operator's E2E tests started failing after updating gocql dependency from v1.7.3 to v1.11.1 due to gocql panicking with the following logs:

        panic: scylla: 10.105.211.70:9042 invalid number of shards

        goroutine 997 [running]:
        github.com/gocql/gocql.(*scyllaConnPicker).Put(0xc0005b1440, 0xc000904000)
                github.com/gocql/[email protected]/scylla.go:400 +0x448
        github.com/gocql/gocql.(*hostConnPool).connect(0xc00058b1f0)
                github.com/gocql/[email protected]/connectionpool.go:554 +0x2b0
        github.com/gocql/gocql.(*hostConnPool).fill(0xc00058b1f0)
                github.com/gocql/[email protected]/connectionpool.go:397 +0x159
        github.com/gocql/gocql.(*policyConnPool).addHost(0xc0003decd0, 0xc000b849c0)
                github.com/gocql/[email protected]/connectionpool.go:255 +0x1cf
        github.com/gocql/gocql.(*Session).startPoolFill(0xc000685000, 0x0?)
                github.com/gocql/[email protected]/events.go:214 +0x2d
        created by github.com/gocql/gocql.(*controlConn).setupConn
                github.com/gocql/[email protected]/control.go:301 +0x26a

https://github.com/scylladb/scylla-operator/actions/runs/6178751661/job/16772628746#step:3:3702

I bisected the repository and confirmed that, before we reverted to v1.7.3, the last good commit was 6b310ee0ce1c7a72e4d16555dedf4e1cf7058258, and that bumping gocql's version was then enough to break the test.

The failing test is https://github.com/scylladb/scylla-operator/blob/master/test/e2e/set/scyllacluster/scyllacluster_updates.go.
It was failing quite consistently on our master (GitHub CI node with kubeadm and cri-o) before reverting. I was also able to consistently reproduce it locally with a similar setup.

Debug logs from a local run:

2023/09/21 00:24:31 gocql: pool connection error "10.96.238.108:9042": EOF
2023/09/21 00:24:31 scylla: 10.96.238.108:9042 remove shard 0 connection
2023/09/21 00:24:31 gocql: unable to dial control conn 10.96.238.108:9042: dial tcp :0->10.96.238.108:9042: connect: connection refused
2023/09/21 00:24:31 gocql: unable to connect to any ring node: dial tcp :0->10.96.238.108:9042: connect: connection refused
2023/09/21 00:24:31 gocql: control falling back to initial contact points.
...
2023/09/21 00:25:25 Session.ring:[10.96.238.108:UP]
2023/09/21 00:25:25 Session.ring:[10.96.238.108:UP]
2023/09/21 00:25:25 scylla: connecting to shard 0
panic: scylla: 10.96.238.108:9042 invalid number of shards
...
Test scenario:
  1. Deploy ScyllaDB cluster with a single node
  2. Wait for node's readiness
  3. Initialize a gocql session and write data
  4. Restart node with increased assigned resources (see https://github.com/scylladb/scylla-operator/blob/e87c45c17831e93d915a8114ab350e881778361c/test/e2e/set/scyllacluster/scyllacluster_updates.go#L81 for details)
  5. Wait for node's readiness
  6. Read the data written earlier

Now with v1.11.1, gocql is going to panic with panic: scylla: <node-ip>:9042 invalid number of shards, which is a regression from v1.7.3.

Prerequisites for reproducing:
Steps to reproduce:
$ git clone [email protected]:scylladb/scylla-operator.git && cd ./scylla-operator
$ go mod edit -replace github.com/gocql/gocql=github.com/scylladb/[email protected] && go mod tidy && go mod vendor
$ go run ./cmd/scylla-operator-tests run all --loglevel=2 --parallelism=1 --progress --focus="ScyllaCluster should reconcile resource changes"

Additional context:

The issue never occurred in our presubmits, ran in a different environment with Prow CI, for which we're using GKE nodes for both master and worker nodes. My only guess was a different networking setup: GKE Dataplane V2 is implemented using Cilium, while in our GitHub CI we're using a default cri-o's network configuration.

For this reason I've setup a local kubeadm installation with Cilium v1.14.2 without kube-proxy. Unfortunately the issue reproduced, so it didn't help narrowing it down, but maybe you'll find this information helpful.

What version of Scylla or Cassandra are you using?

ScyllaDB OS 5.2.7

What version of Gocql are you using?

1.11.1

What version of Go are you using?

1.20


Cross reference: scylladb/scylla-operator#1399

@avelanarius please let me know if you need any additional information or if you could use any help with reproducing the issue.

cc @tnozicka @mykaul

@avelanarius
Copy link

The comparison that triggers the panic:

gocql/scylla.go

Lines 399 to 401 in 61be561

if nrShards != p.nrShards {
panic(fmt.Sprintf("scylla: %s invalid number of shards", p.address))
}

occurs with nrShards = 2 and p.nrShards = 1.

@avelanarius
Copy link

It seems like the node has changed the number of shards, but I'll have to double check that this really occured.

In any case, the driver should not panic.

@sylwiaszunejko
Copy link
Collaborator

I am working on that, looks like check that is causing panic was there in previous version that worked, I need to do bisect to determine the actual change that broke the test

@mykaul
Copy link

mykaul commented Oct 11, 2023

Any updates?

@sylwiaszunejko
Copy link
Collaborator

I found the commit that broke the test: 0a990b2, it is quite large, but I managed to narrow down the error search to one file (control.go) and now I should quickly find out the reason why the test fails

@rzetelskik
Copy link
Member Author

@sylwiaszunejko any updates?

@avelanarius
Copy link

apache#1729 fixes the problem. We wanted to wait for upstream to merge it, but it looks like we’ll only merge it to our fork.

I will release the next version of gocql soon.

@rzetelskik
Copy link
Member Author

scylladb/scylla-operator#1528 merged, thanks @avelanarius @sylwiaszunejko !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants