Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DSP-23956: DSE 5.1 upgradability #1314

Open
wants to merge 6 commits into
base: main
Choose a base branch
from
Open

DSP-23956: DSE 5.1 upgradability #1314

wants to merge 6 commits into from

Conversation

szymon-miezal
Copy link
Collaborator

@szymon-miezal szymon-miezal commented Oct 1, 2024

The goal of this patch is to resolve the bugs encountered while testing the DSE 5.1 to HCD upgrade. Details are in commit messages.

@@ -289,6 +290,8 @@ protected void setup()

setupVirtualKeyspaces();

LegacySystemKeyspaceToNodes.convertToNodesFilesIfNecessary();
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would need to double-check whether it doesn't introduce any regression in astra 🤞

@bereng
Copy link

bereng commented Oct 4, 2024

https://jenkins-stargazer.aws.dsinternal.org/job/ds-cassandra-pr-gate/job/PR-1314/2/testReport/junit/org.apache.cassandra.distributed.test/NodeToolEnableDisableBinaryTest/tests_stage_1___jvm_distributed___testMaybeChangeDocs/ deserves attention. It is probably a silly thing but equally silly to fix.

I would like a quick high level explanation of the thought of train and the problems you found as I lack the context. Wdyt?

@szymon-miezal
Copy link
Collaborator Author

https://jenkins-stargazer.aws.dsinternal.org/job/ds-cassandra-pr-gate/job/PR-1314/2/testReport/junit/org.apache.cassandra.distributed.test/NodeToolEnableDisableBinaryTest/tests_stage_1___jvm_distributed___testMaybeChangeDocs/ deserves attention. It is probably a silly thing but equally silly to fix.

Seems like a flaky test:

worth having a closer look though.

I would like a quick high level explanation of the thought of train and the problems you found as I lack the context. Wdyt?

Indeed, I have approached it in a very test-driven way. The upgrade dtest has been guiding me in the process. Firstly the system.local and system.peers tables turned out to contain incompatible schema - CC schema and DSE schema differs. To resolve it I have decided to ignore the DSE-specific columns that CC couldn't use in any way, like workload, dse_version etc.. It has been committed as part of 1fd42c2.
Resolving the schema mismatch did allow the node to start but at the same time the node wasn't able to advertise itself using the old host_id - that was when I discovered CC reads local info from a flat file that was generated from scratch at startup - effectively becoming a new node in the cluster. The migration code implemented as part of 6a4c377 aims to resolve this issue in a similar fashion it has been done in DSE 5.1 -> 6.8 upgrade path.
Combining these two changes makes the upgrade dtest pass therefore I decided it's about time to move this patch to the review stage.

@bereng
Copy link

bereng commented Oct 7, 2024

^I took a look and the test failure is quite easy to understand. A certain output is expected but some logging gets in the way. So nothing critical or related to this PR.

@JeremiahDJordan
Copy link
Member

Adding @jacek-lewandowski as a reviewer as IIRC he ported the existing implementation.

@szymon-miezal
Copy link
Collaborator Author

Posting some of the observations/conclusions that might be useful when assessing the Nodes class which became the discussion point of the PR:

  • The initial design of the Nodes class added in CC4 offered the implementation agnostic approach but it was scrapped later on (likely the timeline was a factor).
  • We cannot simply get rid of the current flat-file-based implementation altogether as Astra depends on CC (Astra assumes the storing itself is taken care of by CC) to store these files.
  • In CC 5, there is an improved Nodes interface that allows the storage implementation to be replaced - the way it was initially planned in CC4.
  • The info stored in the peers table isn’t as critical (having some stale data in it is like falling behind gossip a bit), what is critical is the local table namely the host_id and truncated_at columns.
    • While the host_id isn’t that important in astra now, it soon will be as they work on implementing the writer restart feature instead of replacing them each time - therefore we cannot just switch the flat-file with sstable storage impl now.
  • There are two options to proceed further:
    • Port the flexible Nodes storage interface from CC5 to CC4 and specify the implementation to use in HCD. That has the advantage that we get rid of the flat-file impl in CC but has two consequences, 1. CC4 based astra has to be updated to specify the storage impl as well, 2. DSE 6.8 upgrade will require flat-file -> sstable migration code.
    • Keep the current shape of the patch with sstable -> flat file migration.
  • One more conclusion is that going with flat-files now does not prevent us from reverting from them in the future -> it is a matter of a reverse migration which we would need to implement anyway for DSE 6.8 upgrade (flat-file -> sstable) if we decide to go with the flexible approach.

DSE 5.1 and CC have different system.local/peers schema, mainly as DSE adds columns like
dse_version, graph, server_id, workload, which CC doesn't use and can safely ignore.
These columns have been marked as deprecated and will be skipped while reading the sstable.
CC uses flat files to store system.local/peers info and exposes them via virtual tables.
DSE 5.1 uses regular C* sstables for storing these tables. This patch
adds the migration procedure that detects the existence of the mentioned flat
files and rewrites necessary data from sstables.
Copy link

sonarcloud bot commented Nov 6, 2024

@cassci-bot
Copy link

❌ Build ds-cassandra-pr-gate/PR-1314 rejected by Butler


10 new test failure(s) in 15 builds
See build details here


Found 10 new test failures

Test Explanation Branch history Upstream history
...TimeoutTest.prepareRPCTimeout[SEQUENTIAL/false] regression 🔴🔵🔵🔵 🔵🔵🔵🔵🔵🔵🔵
...oParseClientMessageTest.badHeader[version=5/v5] regression 🔴🔵🔵🔵 🔵🔵🔵🔵🔵🔵🔵
...tions[aa_CompoundKeyDataModel{primaryKey=p, c}] regression 🔴🔵🔵🔵 🔵🔵🔵🔵🔵🔵🔵
...positePartitionKeyDataModel{primaryKey=p1, p2}] failing 🔴🔴🔴🔴🔴🔴🔴 🔵🔵🔵🔵🔵🔵🔵
...positePartitionKeyDataModel{primaryKey=p1, p2}] failing 🔴🔴🔴🔴🔴 🔵🔵🔵🔵🔵🔵🔵
...i.s.c.VectorSiftSmallTest.testMultiSegmentBuild failing 🔴🔴🔴🔴🔴🔴🔴 🔵🔵🔵🔵🔵🔵🔵
...=smallint,wide=false,scenario=POST_BUILD_QUERY] flaky 🔵🔴🔵🔵 🔵🔵🔵🔵🔵🔵🔵
o.a.c.i.s.d.v.VectorCompressionTest.testAda002 regression 🔴🔵🔵🔵🔵🔵🔴 🔵🔵🔵🔵🔵🔵🔵
o.a.c.i.s.d.v.VectorCompressionTest.testBert regression 🔴🔵🔵🔴🔵🔵🔴 🔵🔵🔵🔵🔵🔵🔵
...i.s.d.v.VectorCompressionTest.testOpenAiV3Large regression 🔴🔵🔵🔵🔴🔵🔴 🔵🔵🔵🔵🔵🔵🔵

Found 53 known test failures

@szymon-miezal
Copy link
Collaborator Author

I have gone through the latest test failures reported in https://jenkins-stargazer.aws.dsinternal.org/job/ds-cassandra-pr-gate/view/change-requests/job/PR-1314/23/ and it seems that all but one of them were reported as failed by the fast CI build: https://jenkins-stargazer.aws.dsinternal.org/job/ds-cassandra-build-fast-ci/:

The only test failure that I haven't found in fast CI is SmallintTest but this one passed in the last CI run of this branch https://jenkins-stargazer.aws.dsinternal.org/job/ds-cassandra-pr-gate/view/change-requests/job/PR-1314/24/testReport/org.apache.cassandra.index.sai.cql.types/SmallintTest/ which makes me think it is rather a fluke rather than a genuine regression.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants