Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How-To: Add Large Table Stream Recovery #22

Open
newfront opened this issue Sep 28, 2023 · 0 comments
Open

How-To: Add Large Table Stream Recovery #22

newfront opened this issue Sep 28, 2023 · 0 comments
Assignees
Labels
baby-steps Gentle Introductions to Concepts. This is like the First Steps idea or Gentle Introductions

Comments

@newfront
Copy link
Owner

It is common to make mistakes. Often these mistakes rear their ugly heads in terms of problems without an easy solution. Take, for example, sometimes it is the right thing to do to make a breaking change to our schema (StructType representing our Delta Lake table).

If we've done our job well, and have a proper buffer insulating our downstream data customers from the immediate changes: say we are changing a bronze-level table, and our silver-level table (represented by the medallion architecture tiers) insulates our external data customers who read off of our golden-tables.

  1. (bronze:pre-change-table) exists up to a specific timestamp (milliseconds) represented by bronze.schema.table_a
  2. (bronze:post-change-table) exists after we cut over from the (bronze:pre-change-table) represented by bronze.schema.table_b
from delta.tables import DeltaTable
dtA = DeltaTable.forName(spark, bronze.schema.table_a)
dtB = DeltaTable.forName(spark, bronze.schema.table_b)

# (note) dtA.schema != dtB.schema

There are a few options for where to go from here.

a.) We can ditch the old table table_a, say good riddance, and move forwards with table_b. This is a simple solution, we just drop table bronze.schema.table_a. Then we erase the historic data, and if we don't need it, then that is fine. Don't horde data just because you can.

b.) We can forward the non-conformant struct or struct fields to create a forwarder from table_a -> table_b schema. This way we can read and overwrite the old table using a streaming reader that will opt-into overwriteSchema:true for the writer to the same table. (table_a -> transformations -> table_a). This could be problematic for streaming applications reading from the table since we will require all of them to also cut over to the new format....

c.) We can use most of the techniques from b, and write to an intermediate table using a DEEP CLONE followed by the transformation and overwriteSchema. This will be a new table (table_c), we maintain both old (table_a) and new versions (table_c) of the same table. Once this table is done catching up, we can read from the new table (table_b) copying until we catch up to real-time, then based on the last committed transaction (delta log : version (long)), we can cut over the streaming application (writing to table_b) from table_b to table_c. As long as the silver table continues to read from table_c, with some minor modifications, no downstream customers will notice that you transitioned to a new source of ingestion data.

@newfront newfront self-assigned this Dec 15, 2023
@newfront newfront added the baby-steps Gentle Introductions to Concepts. This is like the First Steps idea or Gentle Introductions label Dec 15, 2023
@newfront newfront added this to the End of Year Playground milestone Dec 15, 2023
@newfront newfront removed this from the End of Year Playground milestone Jun 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
baby-steps Gentle Introductions to Concepts. This is like the First Steps idea or Gentle Introductions
Projects
Development

When branches are created from issues, their pull requests are automatically linked.

2 participants
@newfront and others