Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage/adapter: Opt-in migration of sources to the new table model #30483

Open
wants to merge 55 commits into
base: main
Choose a base branch
from

Conversation

jkosh44
Copy link
Contributor

@jkosh44 jkosh44 commented Nov 14, 2024

Motivation

The subsequent PR will implement https://github.com/MaterializeInc/database-issues/issues/8678, which will also disable use of the 'old style' source statements using the same feature-flag introduced here. Once this PR and that PR land, then enabling the force_source_table_syntax flag will completely switch over users to the new syntax.

Tips for reviewer

To test this I've added a new scenario to platform-checks called ActivateSourceVersioningMigration, that runs materialize on an existing version for each check's initialize() method, and then restarts materialize on the latest version with the force_source_table_syntax, activating the migration of any sources created using the 'old style' syntax. Then the validate() step is run on this new version, confirming that all the queries continue to work.

There are already existing platform-checks Checks that use the 'old style' source syntax: TableFromPgSource, TableFromMySqlSource, LoadGeneratorAsOfUpTo, and one I added called UpsertLegacy, that cover the 4 source types we need to test. There are also many other checks that use the old syntax when running on 'old' versions before 0.119, but I wasn't sure how to make the ActivateSourceVersioningMigration scenario target a specific version rather than just the 'previous' version for the base run. @def- @nrainer-materialize let me know if you have ideas on doing that.

I've also updated the legacy upgrade tests to activate this migration after the upgrade which should provide additional coverage too.

Nightly

https://buildkite.com/materialize/nightly/builds?branch=rjobanp%3Asource-table-migration

Checklist

  • This PR has adequate test coverage / QA involvement has been duly considered. (trigger-ci for additional test/nightly runs)
  • This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
  • If this PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way), then it is tagged with a T-proto label.
  • If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).
  • If this PR includes major user-facing behavior changes, I have pinged the relevant PM to schedule a changelog post.

@jkosh44 jkosh44 marked this pull request as ready for review November 14, 2024 18:04
@jkosh44 jkosh44 requested review from a team as code owners November 14, 2024 18:04
@jkosh44 jkosh44 requested a review from ParkMyCar November 14, 2024 18:04
Copy link

shepherdlybot bot commented Nov 14, 2024

Risk Score:81 / 100 Bug Hotspots:4 Resilience Coverage:50%

Mitigations

Completing required mitigations increases Resilience Coverage.

  • (Required) Code Review 🔍 Detected
  • (Required) Feature Flag
  • (Required) Integration Test 🔍 Detected
  • (Required) Observability 🔍 Detected
  • (Required) QA Review
  • (Required) Run Nightly Tests
  • Unit Test
Risk Summary:

The pull request has a high risk score of 81, driven by predictors such as the "Sum Bug Reports Of Files" and the "Delta of Executable Lines". Historically, PRs with these predictors are 115% more likely to cause a bug than the repo baseline. While the observed and predicted bug trends for the repository are decreasing, the presence of 4 file hotspots suggests a heightened risk.

Note: The risk score is not based on semantic analysis but on historical predictors of bug occurrence in the repository. The attributes above were deemed the strongest predictors based on that history. Predictors and the score may change as the PR evolves in code, time, and review activity.

Bug Hotspots:
What's This?

File Percentile
../catalog/apply.rs 98
../src/coord.rs 100
../src/names.rs 92
../catalog/open.rs 99

@jkosh44
Copy link
Contributor Author

jkosh44 commented Nov 14, 2024

I've re-opened #30168 with my own fork to avoid CI issues.

src/sql/src/names.rs Outdated Show resolved Hide resolved
@jkosh44 jkosh44 force-pushed the source-table-migration branch 2 times, most recently from 1972e06 to db98078 Compare November 14, 2024 20:11
Copy link
Member

@ParkMyCar ParkMyCar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly focused on migrate.rs but overall the PR LGTM!

src/adapter/src/catalog/apply.rs Outdated Show resolved Hide resolved
src/sql/src/names.rs Outdated Show resolved Hide resolved
src/sql/src/names.rs Outdated Show resolved Hide resolved
@jkosh44 jkosh44 force-pushed the source-table-migration branch from 02ad145 to 098739e Compare December 12, 2024 21:05
@jkosh44 jkosh44 force-pushed the source-table-migration branch 3 times, most recently from 0f50e5f to cf8f451 Compare December 12, 2024 21:20
@jkosh44 jkosh44 marked this pull request as draft December 16, 2024 17:58
@jkosh44 jkosh44 force-pushed the source-table-migration branch 4 times, most recently from e8930e8 to 619116d Compare December 17, 2024 18:34
@jkosh44
Copy link
Contributor Author

jkosh44 commented Dec 17, 2024

There seems to be an issue with the migration with at least Kafka sources.

If I have the following user objects:

=> SELECT * FROM mz_objects WHERE id LIKE 'u%' ORDER BY id ASC;
 id |  oid  | schema_id |     name      |    type    | owner_id | cluster_id |    privileges     
----+-------+-----------+---------------+------------+----------+------------+-------------------
 u1 | 20189 | u3        | csr_conn      | connection | u1       |            | {u1=U/u1}
 u2 | 20190 | u3        | kafka_conn    | connection | u1       |            | {u1=U/u1}
 u3 | 20191 | u3        | data_progress | source     | u1       |            | {s2=r/u1,u1=r/u1}
 u4 | 20192 | u3        | data          | source     | u1       | u1         | {u1=r/u1}
(4 rows)

Then I enable the migrations via:

=> ALTER SYSTEM SET enable_create_table_from_source TO on;
NOTICE:  variable "enable_create_table_from_source" was updated for the system, this will have no effect on the current session
ALTER SYSTEM
=> ALTER SYSTEM SET force_source_table_syntax TO on;
NOTICE:  variable "force_source_table_syntax" was updated for the system, this will have no effect on the current session
ALTER SYSTEM

Then I restart Materialize, the following user objects will exist:

=> SELECT * FROM mz_objects WHERE id LIKE 'u%' ORDER BY id ASC;
 id |  oid  | schema_id |     name      |    type    | owner_id | cluster_id |    privileges     
----+-------+-----------+---------------+------------+----------+------------+-------------------
 u1 | 20189 | u3        | csr_conn      | connection | u1       |            | {u1=U/u1}
 u2 | 20190 | u3        | kafka_conn    | connection | u1       |            | {u1=U/u1}
 u3 | 20191 | u3        | data_progress | source     | u1       |            | {s2=r/u1,u1=r/u1}
 u4 | 20192 | u3        | data_source   | source     | u1       | u1         | {u1=r/u1}
 u5 | 20193 | u3        | data          | table      | u1       |            | {u1=r/u1}
(5 rows)

That looks correct and is what we would expect to happen. However, if I restart Materialize one more time, then the following user objects will exist:

=> SELECT * FROM mz_objects WHERE id LIKE 'u%' ORDER BY id ASC;
 id |  oid  | schema_id |        name        |    type    | owner_id | cluster_id |    privileges     
----+-------+-----------+--------------------+------------+----------+------------+-------------------
 u1 | 20189 | u3        | csr_conn           | connection | u1       |            | {u1=U/u1}
 u2 | 20190 | u3        | kafka_conn         | connection | u1       |            | {u1=U/u1}
 u3 | 20191 | u3        | data_progress      | source     | u1       |            | {s2=r/u1,u1=r/u1}
 u4 | 20192 | u3        | data_source_source | source     | u1       | u1         | {u1=r/u1}
 u5 | 20193 | u3        | data               | table      | u1       |            | {u1=r/u1}
 u6 | 20194 | u3        | data_source        | table      | u1       |            | {u1=r/u1}
(6 rows)

As you can see, the already migrated objects were migrated again. If I restart Materialize once again, then I see the following:

=> SELECT * FROM mz_objects WHERE id LIKE 'u%' ORDER BY id ASC;
 id |  oid  | schema_id |           name            |    type    | owner_id | cluster_id |    privileges     
----+-------+-----------+---------------------------+------------+----------+------------+-------------------
 u1 | 20189 | u3        | csr_conn                  | connection | u1       |            | {u1=U/u1}
 u2 | 20190 | u3        | kafka_conn                | connection | u1       |            | {u1=U/u1}
 u3 | 20191 | u3        | data_progress             | source     | u1       |            | {s2=r/u1,u1=r/u1}
 u4 | 20192 | u3        | data_source_source_source | source     | u1       | u1         | {u1=r/u1}
 u5 | 20193 | u3        | data                      | table      | u1       |            | {u1=r/u1}
 u6 | 20194 | u3        | data_source               | table      | u1       |            | {u1=r/u1}
 u7 | 20195 | u3        | data_source_source        | table      | u1       |            | {u1=r/u1}
(7 rows)

I'm assuming that each time Materialize is reset the migration is re-run.

@jkosh44
Copy link
Contributor Author

jkosh44 commented Dec 18, 2024

I was able to fix the migration idempotency issue in the most recent batch of commits. Now there's another issue that is triggered in the *-compile-proto-sources.td legacy upgrade test. For some reason, after being migrated, the kafka_proto_source source has it's upper and since stuck at 0.

materialize=> EXPLAIN TIMESTAMP FOR SELECT * FROM kafka_proto_source;
                                 Timestamp                                 
---------------------------------------------------------------------------
                 query timestamp: 1734549613466 (2024-12-18 19:20:13.466) +
           oracle read timestamp: 1734549613466 (2024-12-18 19:20:13.466) +
 largest not in advance of upper:             0 (1970-01-01 00:00:00.000) +
                           upper:[            0 (1970-01-01 00:00:00.000)]+
                           since:[            0 (1970-01-01 00:00:00.000)]+
         can respond immediately: false                                   +
                        timeline: Some(EpochMilliseconds)                 +
               session wall time: 1734549614451 (2024-12-18 19:20:14.451) +
                                                                          +
 source materialize.public.kafka_proto_source (u5, storage):              +
                   read frontier:[            0 (1970-01-01 00:00:00.000)]+
                  write frontier:[            0 (1970-01-01 00:00:00.000)]+
 
(1 row)

@jkosh44 jkosh44 force-pushed the source-table-migration branch from 5bb01ec to cb45af0 Compare December 18, 2024 20:16
@jkosh44 jkosh44 force-pushed the source-table-migration branch from cb45af0 to 254b6ed Compare December 26, 2024 15:01
@jkosh44
Copy link
Contributor Author

jkosh44 commented Dec 26, 2024

I was able to fix the migration idempotency issue in the most recent batch of commits. Now there's another issue that is triggered in the *-compile-proto-sources.td legacy upgrade test. For some reason, after being migrated, the kafka_proto_source source has it's upper and since stuck at 0.

materialize=> EXPLAIN TIMESTAMP FOR SELECT * FROM kafka_proto_source;
                                 Timestamp                                 
---------------------------------------------------------------------------
                 query timestamp: 1734549613466 (2024-12-18 19:20:13.466) +
           oracle read timestamp: 1734549613466 (2024-12-18 19:20:13.466) +
 largest not in advance of upper:             0 (1970-01-01 00:00:00.000) +
                           upper:[            0 (1970-01-01 00:00:00.000)]+
                           since:[            0 (1970-01-01 00:00:00.000)]+
         can respond immediately: false                                   +
                        timeline: Some(EpochMilliseconds)                 +
               session wall time: 1734549614451 (2024-12-18 19:20:14.451) +
                                                                          +
 source materialize.public.kafka_proto_source (u5, storage):              +
                   read frontier:[            0 (1970-01-01 00:00:00.000)]+
                  write frontier:[            0 (1970-01-01 00:00:00.000)]+
 
(1 row)

The issue also repros with *-kafka-source.td which suggests that it's an issue with all Kafka sources. I was also able to repro this with counter load generator sources which suggests that it's an issue with all single output sources. I haven't tried multi-output sources yet.

@jkosh44
Copy link
Contributor Author

jkosh44 commented Dec 26, 2024

The load gen output looks like this, which is slightly different that the kafka source above:

materialize=> EXPLAIN TIMESTAMP FOR SELECT COUNT(*) > 0 FROM lg_source;
                                 Timestamp                                 
---------------------------------------------------------------------------
                 query timestamp: 1735236290985 (2024-12-26 18:04:50.985) +
           oracle read timestamp: 1735236290985 (2024-12-26 18:04:50.985) +
 largest not in advance of upper: 1735236080000 (2024-12-26 18:01:20.000) +
                           upper:[1735236080001 (2024-12-26 18:01:20.001)]+
                           since:[            0 (1970-01-01 00:00:00.000)]+
         can respond immediately: false                                   +
                        timeline: Some(EpochMilliseconds)                 +
               session wall time: 1735236291320 (2024-12-26 18:04:51.320) +
                                                                          +
 source materialize.public.lg_source (u3, storage):                       +
                   read frontier:[1735236079000 (2024-12-26 18:01:19.000)]+
                  write frontier:[1735236080001 (2024-12-26 18:01:20.001)]+
 
(1 row)

EDIT: Actually taking a closer look at this, I don't know how to explain it at all. Why would the since be different than the read frontier?

jkosh44 and others added 3 commits January 28, 2025 14:12
@jkosh44
Copy link
Contributor Author

jkosh44 commented Jan 28, 2025

There's still some test failures in Nightly that I'm struggling with: https://buildkite.com/materialize/nightly/builds/10994#0194ae73-a255-49da-923c-47ff93d4216c

I'm convinced that they're test issue because both failures happen before the migration and both failures are expecting some query to fail but the query succeeds.

@def-
Copy link
Contributor

def- commented Jan 29, 2025

@jkosh44 Should I try to look into the nightly failures? I assume it's the ones in https://buildkite.com/materialize/nightly/builds/10996

@jkosh44
Copy link
Contributor Author

jkosh44 commented Jan 29, 2025

@jkosh44 Should I try to look into the nightly failures? I assume it's the ones in https://buildkite.com/materialize/nightly/builds/10996

If you have the bandwidth, it would be helpful and yes it is.

@def-
Copy link
Contributor

def- commented Jan 29, 2025

New run, rebased and with my fixes for the tests: https://buildkite.com/materialize/nightly/builds/11002

@jkosh44
Copy link
Contributor Author

jkosh44 commented Jan 29, 2025

Looks like "MySQL CDC tests (before source versioning)" is still failing with this error:

> CREATE SOURCE mz_source
FROM MYSQL CONNECTION mysql_conn
FOR ALL TABLES;
query error didn't match; sleeping to see if dataflow produces error shortly 50ms 75ms 113ms 169ms 253ms 380ms 570ms 854ms 1s 2s 3s 4s 6s 10s 15s 16s2025-01-29T17:39:10.383403Z  WARN mz_testdrive::action::consistency: No Catalog state on disk, skipping consistency check
table-in-mysql-schema.td:38:1: error: expected error containing "MySQL source must ingest at least one table, but FOR ALL TABLES matched none", got "source \"materialize.public.mz_source\" already exists"

The error usually means that the CREATE was successful but we were expecting it to fail. It already exists because we created it on the first try. It's possible that it was created somewhere else, but that's usually what that error means.

@jkosh44
Copy link
Contributor Author

jkosh44 commented Jan 29, 2025

Looks like "MySQL CDC tests (before source versioning)" is still failing with this error:

> CREATE SOURCE mz_source
FROM MYSQL CONNECTION mysql_conn
FOR ALL TABLES;
query error didn't match; sleeping to see if dataflow produces error shortly 50ms 75ms 113ms 169ms 253ms 380ms 570ms 854ms 1s 2s 3s 4s 6s 10s 15s 16s2025-01-29T17:39:10.383403Z  WARN mz_testdrive::action::consistency: No Catalog state on disk, skipping consistency check
table-in-mysql-schema.td:38:1: error: expected error containing "MySQL source must ingest at least one table, but FOR ALL TABLES matched none", got "source \"materialize.public.mz_source\" already exists"

The error usually means that the CREATE was successful but we were expecting it to fail. It already exists because we created it on the first try. It's possible that it was created somewhere else, but that's usually what that error means.

The curious thing is that this is running (or at least meant to be running) before the source migration happens. I haven't been able to understand why the test is expecting that CREATE to fail though. It looks fine to me.

@def- def- force-pushed the source-table-migration branch from a26c7aa to 26d1ea6 Compare January 29, 2025 17:57
@def-
Copy link
Contributor

def- commented Jan 29, 2025

I think just a test problem, pushed another attempt: https://buildkite.com/materialize/nightly/builds/11003

@jkosh44
Copy link
Contributor Author

jkosh44 commented Jan 29, 2025

Looks like the same error happened:

 CREATE SOURCE mz_source
FROM MYSQL CONNECTION mysql_conn
FOR ALL TABLES;
query error didn't match; sleeping to see if dataflow produces error shortly 50ms 75ms 113ms 169ms 253ms 380ms 570ms 854ms 1s 2s 3s 4s 6s 10s 15s 16s2025-01-29T18:16:23.591008Z  WARN mz_testdrive::action::consistency: No Catalog state on disk, skipping consistency check
table-in-mysql-schema.td:38:1: error: expected error containing "MySQL source must ingest at least one table, but FOR ALL TABLES matched none", got "source \"materialize.public.mz_source\" already exists"
     |
  20 | > CREATE SECRET mysq ... [rest of line truncated for security]
  24 |     PASSWORD SECRET  ... [rest of line truncated for security]
  27 | $ mysql-connect name ... [rest of line truncated for security]
  37 |
  38 | ! CREATE SOURCE mz_source
     | ^

@jkosh44
Copy link
Contributor Author

jkosh44 commented Jan 29, 2025

If I delete all test files except for table-in-mysql-schema.td then the test passes. So it does seem like the environment is getting polluted from another test.

@jkosh44
Copy link
Contributor Author

jkosh44 commented Jan 29, 2025

If I delete all test files except for table-in-mysql-schema.td then the test passes. So it does seem like the environment is getting polluted from another test.

@def- Looking at the testdrive reset logic, it looks like we never actually reset postgres or mysql, only Materialize and Kafka:

if config.reset {
// Delete any existing Materialize and Kafka state *before* the test
// script starts. We don't clean up Materialize or Kafka state at the
// end of the script because it's useful to leave the state around,
// e.g., for debugging, or when using a testdrive script to set up
// Materialize for further tinkering.
state.reset_materialize().await?;
// Only try to clean up Kafka state if the test script uses a Kafka
// action. Tests that don't use Kafka likely don't have a Kafka
// broker available.
if has_kafka_cmd {
state.reset_kafka().await?;
}
}
.

We probably want to reset mysql and postgres too.

In the meantime hopefully there's some manual cleanup I can add to the test.

@jkosh44
Copy link
Contributor Author

jkosh44 commented Jan 29, 2025

If I delete all test files except for table-in-mysql-schema.td then the test passes. So it does seem like the environment is getting polluted from another test.

Alternatively, it's possible that this test was implicitly relying on another test to create the mysql database in mysql. That no longer happens, so USE mysql fails, the tables are created in the public database, and creating the source succeeds.

EDIT: That doesn't really make sense that it would pass when all other tests are deleted.

@jkosh44
Copy link
Contributor Author

jkosh44 commented Jan 29, 2025

Comment on lines +32 to +37
DROP DATABASE IF EXISTS other;
CREATE DATABASE IF NOT EXISTS mysql;
USE mysql;

# Insert data pre-snapshot
CREATE TABLE t_in_mysql (f1 INT);
INSERT INTO t_in_mysql VALUES (1), (2);

> DROP SOURCE IF EXISTS mz_source;
CREATE TABLE mysql.t_in_mysql (f1 INT);
INSERT INTO mysql.t_in_mysql VALUES (1), (2);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know which one of these changes caused the test to start passing, but one of them did.

@def-
Copy link
Contributor

def- commented Jan 30, 2025

I'm a bit worried about the benchmark failures in the last run, reran just in case: https://buildkite.com/materialize/nightly/builds/11013

@jkosh44
Copy link
Contributor Author

jkosh44 commented Jan 30, 2025

ci: Larger agent for td migration test

It looks like the test is hung for 2 hrs on a specific file (table.td).

I'm a bit worried about the benchmark failures in the last run, reran just in case: https://buildkite.com/materialize/nightly/builds/11013

I'm a bit surprised by this. Everything is supposed to be behind a feature flag, so how did it manage to slow down a benchmark.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants