You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It is common to make mistakes. Often these mistakes rear their ugly heads in terms of problems without an easy solution. Take, for example, sometimes it is the right thing to do to make a breaking change to our schema (StructType representing our Delta Lake table).
If we've done our job well, and have a proper buffer insulating our downstream data customers from the immediate changes: say we are changing a bronze-level table, and our silver-level table (represented by the medallion architecture tiers) insulates our external data customers who read off of our golden-tables.
(bronze:pre-change-table) exists up to a specific timestamp (milliseconds) represented by bronze.schema.table_a
(bronze:post-change-table) exists after we cut over from the (bronze:pre-change-table) represented by bronze.schema.table_b
There are a few options for where to go from here.
a.) We can ditch the old table table_a, say good riddance, and move forwards with table_b. This is a simple solution, we just drop table bronze.schema.table_a. Then we erase the historic data, and if we don't need it, then that is fine. Don't horde data just because you can.
b.) We can forward the non-conformant struct or struct fields to create a forwarder from table_a -> table_b schema. This way we can read and overwrite the old table using a streaming reader that will opt-into overwriteSchema:true for the writer to the same table. (table_a -> transformations -> table_a). This could be problematic for streaming applications reading from the table since we will require all of them to also cut over to the new format....
c.) We can use most of the techniques from b, and write to an intermediate table using a DEEP CLONE followed by the transformation and overwriteSchema. This will be a new table (table_c), we maintain both old (table_a) and new versions (table_c) of the same table. Once this table is done catching up, we can read from the new table (table_b) copying until we catch up to real-time, then based on the last committed transaction (delta log : version (long)), we can cut over the streaming application (writing to table_b) from table_b to table_c. As long as the silver table continues to read from table_c, with some minor modifications, no downstream customers will notice that you transitioned to a new source of ingestion data.
The text was updated successfully, but these errors were encountered:
It is common to make mistakes. Often these mistakes rear their ugly heads in terms of problems without an easy solution. Take, for example, sometimes it is the right thing to do to make a breaking change to our schema (StructType representing our Delta Lake table).
If we've done our job well, and have a proper buffer insulating our downstream data customers from the immediate changes: say we are changing a bronze-level table, and our silver-level table (represented by the medallion architecture tiers) insulates our external data customers who read off of our golden-tables.
bronze.schema.table_a
bronze.schema.table_b
There are a few options for where to go from here.
a.) We can ditch the old table
table_a
, say good riddance, and move forwards withtable_b
. This is a simple solution, we justdrop table bronze.schema.table_a
. Then we erase the historic data, and if we don't need it, then that is fine. Don't horde data just because you can.b.) We can forward the non-conformant struct or struct fields to create a forwarder from table_a -> table_b schema. This way we can read and overwrite the old table using a streaming reader that will opt-into overwriteSchema:true for the writer to the same table. (table_a -> transformations -> table_a). This could be problematic for streaming applications reading from the table since we will require all of them to also cut over to the new format....
c.) We can use most of the techniques from b, and write to an intermediate table using a DEEP CLONE followed by the transformation and overwriteSchema. This will be a new table (table_c), we maintain both old (table_a) and new versions (table_c) of the same table. Once this table is done catching up, we can read from the new table (table_b) copying until we catch up to real-time, then based on the last committed transaction (delta log : version (long)), we can cut over the streaming application (writing to table_b) from
table_b
totable_c
. As long as the silver table continues to read from table_c, with some minor modifications, no downstream customers will notice that you transitioned to a new source of ingestion data.The text was updated successfully, but these errors were encountered: