How-To: Add Large Table Stream Recovery #22

newfront · 2023-09-28T15:19:03Z

It is common to make mistakes. Often these mistakes rear their ugly heads in terms of problems without an easy solution. Take, for example, sometimes it is the right thing to do to make a breaking change to our schema (StructType representing our Delta Lake table).

If we've done our job well, and have a proper buffer insulating our downstream data customers from the immediate changes: say we are changing a bronze-level table, and our silver-level table (represented by the medallion architecture tiers) insulates our external data customers who read off of our golden-tables.

(bronze:pre-change-table) exists up to a specific timestamp (milliseconds) represented by bronze.schema.table_a
(bronze:post-change-table) exists after we cut over from the (bronze:pre-change-table) represented by bronze.schema.table_b

from delta.tables import DeltaTable
dtA = DeltaTable.forName(spark, bronze.schema.table_a)
dtB = DeltaTable.forName(spark, bronze.schema.table_b)

# (note) dtA.schema != dtB.schema

There are a few options for where to go from here.

a.) We can ditch the old table table_a, say good riddance, and move forwards with table_b. This is a simple solution, we just drop table bronze.schema.table_a. Then we erase the historic data, and if we don't need it, then that is fine. Don't horde data just because you can.

b.) We can forward the non-conformant struct or struct fields to create a forwarder from table_a -> table_b schema. This way we can read and overwrite the old table using a streaming reader that will opt-into overwriteSchema:true for the writer to the same table. (table_a -> transformations -> table_a). This could be problematic for streaming applications reading from the table since we will require all of them to also cut over to the new format....

c.) We can use most of the techniques from b, and write to an intermediate table using a DEEP CLONE followed by the transformation and overwriteSchema. This will be a new table (table_c), we maintain both old (table_a) and new versions (table_c) of the same table. Once this table is done catching up, we can read from the new table (table_b) copying until we catch up to real-time, then based on the last committed transaction (delta log : version (long)), we can cut over the streaming application (writing to table_b) from table_b to table_c. As long as the silver table continues to read from table_c, with some minor modifications, no downstream customers will notice that you transitioned to a new source of ingestion data.

The text was updated successfully, but these errors were encountered:

newfront self-assigned this Dec 15, 2023

newfront added the baby-steps Gentle Introductions to Concepts. This is like the First Steps idea or Gentle Introductions label Dec 15, 2023

newfront added this to The HitchHikers Guide to Delta Lake Streaming Dec 15, 2023

newfront added this to the End of Year Playground milestone Dec 15, 2023

newfront moved this to In Progress in The HitchHikers Guide to Delta Lake Streaming Dec 15, 2023

newfront removed this from the End of Year Playground milestone Jun 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How-To: Add Large Table Stream Recovery #22

How-To: Add Large Table Stream Recovery #22

newfront commented Sep 28, 2023

How-To: Add Large Table Stream Recovery #22

How-To: Add Large Table Stream Recovery #22

Comments

newfront commented Sep 28, 2023