Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CT-2468] [Feature] For versioned models, automatically create view/clone of latest version in unsuffixed database location #7442

Open
Tracked by #7979 ...
jtcohen6 opened this issue Apr 24, 2023 · 23 comments
Milestone

Comments

@jtcohen6
Copy link
Contributor

jtcohen6 commented Apr 24, 2023

Thanks @dbeatty10 @joellabes for working through this with me!

As the producer of a versioned model, I want the unsuffixed database relation (<db>.<schema>.<model_name>) to always point to the latest version of that model. This provides downstream consumers with the same optionality that's available to ref: "pin" to a specific version (suffix), or "unpin" (no suffix) and always get the latest.

dbt should handle this for me, by creating an additional "pointer" view (select * from <latest>) or table clone (depending on the relation type and data platform support) whenever I run the latest version of my model.

Implementation options

I'm not in love with any of these options:

  1. Create a new node in the manifest with this logic, materialized as a view (or clone?). Not sure if/how we'd be able to set configurations on that additional node/object (e.g. persisting descriptions).
  2. Update the view, table, and incremental materializations to produce an additional relation, if the model is versioned (and this is the latest version)
  3. Provide users with a macro that they can "set and forget" as a model post-hook

Of those, (3) is the least magical, and it's what we'll propose to users in the short term (v1.5). We still don't like:

  • Hooks in general, and especially hooks that run DDL to create database objects
  • Lacking the robustness of an actual materialization
  • Requiring users to copy-paste a macro and update their project, when this is the Right Behavior that should just happen out of the box
  • Making the proposed logic for schema management more complicated
@jtcohen6 jtcohen6 added enhancement New feature or request model_versions labels Apr 24, 2023
@github-actions github-actions bot changed the title [Feature] For versioned models, automatically create view/clone of latest version in unsuffixed database location [CT-2468] [Feature] For versioned models, automatically create view/clone of latest version in unsuffixed database location Apr 24, 2023
@jtcohen6 jtcohen6 added the Refinement Maintainer input needed label Apr 24, 2023
@jtcohen6 jtcohen6 self-assigned this May 3, 2023
@jtcohen6
Copy link
Contributor Author

jtcohen6 commented Jun 9, 2023

We have three implementation options. Each has trade-offs around user limitations and complexity of implementation.

Next step: Let's spike each of the options, and make a call.

@jtcohen6 jtcohen6 added spike and removed Refinement Maintainer input needed labels Jun 9, 2023
@jtcohen6 jtcohen6 removed their assignment Jun 9, 2023
@nicohein
Copy link

Hi All,

I would like to propose a fourth option, that is a little bit more verbose but adds flexibility:

Option 4

The main model itself remains unchanged. It is expected to be implemented in dim_customers.sql. The materialization with the version number of the main model becomes optional. If you want to have this materialization, then you create an item in the versions.

- models:
  - name: dim_customers
     # it essentially becomes obsolete to specify the main/latest version
    config:
      materialized: table
      meta:
        key: value
    columns:
      - name: customer_id
        description: This is the primary key
        data_type: int
      - name: ...
    versions:
      - version: 1
        implemented_in: dim_customers_OLD
        deprecation_date: '2022-01-01'
        config:
          meta: {} #remove the meta from v2
        columns:
          include: ...
          exclude: ...
      - version: 2
        implemented_in: dim_customers
      - version: 3
        description: A dramatic reimagining of our customers model with no column overlap? What are they thinking?!
        implemented_in: dim_customers_NEW_BUT_NOT_READY
        # deprecation_date is optional and not included for this version
        columns:
          - name: totally_new_column_1
          - name: totally_new_column_2
          - name: totally_new_column_3
          ...

This would result in a second table being created called dim_customers_v2. If the developer instead choses to materialize _v2 as a view thats possible by instead writing:

      - version: 2
        # implemented_in: dim_customers_v2   # expected default 

And adding the corresponding .sql file:

{{ config(materialized="view") }}

select * from {{ ref('dim_customers') }}

If a team decides to have _v2 materialized as a table and define dim_customers as a view, that is now equally possible.

One remaining challenge I can think of is that a developer could introduce a drift between the main model and v2 because there is simply no guarantee of those models being equal. I regard this risk as low because whoever will make use of models, will put some thought behind the versioning model.

@Spince
Copy link

Spince commented Jun 14, 2023

this is concerning to me.. the whole purpose of versioning an api is to NOT introduce breaking changes. unless every model is postfixed with _v1 from the beginning the proposed solutions would cause breaking changes. ie- dim_customers was created but is now being modified, the above solution places a view where dim_customers "v1" originally existed

i need to give it more thought but it seems like the pointer object that routes a query to the latest version should instead be postfixed with _latest. i think this needs to be considered in terms of implemention on an existing project that's been around for years rather than a clean slate project that's perfectly configured according to dbt features and standards that weren't even discussed or codified until years after the original dim_customers table was created.

@nicohein
Copy link

I think I still would expect dim_customers to behave just like like dim_customers_latest. Nevertheless would the proposal for option 4 handle different ways of implementation:

dim_customers unchanged & latest materialized as table

- models:
  - name: dim_customers
    config:
      materialized: table
      meta:
        key: value
    columns:
       ...
    versions:
      - version: latest  # just like a tag in docker 
        implemented_in: dim_customers_v2
        alias: dim_customers_latest  # just in case default behavior results in dim_customers_vlatest
      ...
      - version: 2
        # implemented_in: dim_customers_v2  
      ...

dim_customers unchanged & latest as view

- models:
  - name: dim_customers
    config:
      materialized: table
      meta:
        key: value
    columns:
       ...
    versions:
      - version: latest
        alias: dim_customers_latest
      ...
      - version: 2
        # implemented_in: dim_customers_v2
      ...

and dim_customers_latest.sql:

{{ config(materialized="view") }}

select * from {{ ref('dim_customers', version="2") }}

dim_customers like latest & v2 + latest as view

- models:
  - name: dim_customers
    config:
      materialized: table
      meta:
        key: value
    columns:
       ...
    versions:
      - version: latest
        alias: dim_customers_latest
      - version: 2
        implemented_in: dim_customers_latest
     ...

and dim_customers_latest.sql:

{{ config(materialized="view") }}

select * from {{ ref('dim_customers') }}

Happy to clarify any questions around this idea and receive some more feedback.

@Spince
Copy link

Spince commented Jun 15, 2023

@nicohein I guess I should have been more specific. I don't agree with I want the unsuffixed database relation (<db>.<schema>.<model_name>) to always point to the latest version of that model from the original comment here.. if I was implementing such a feature I would not do it that way.

jtcohen6 added a commit to dbt-labs/docs.getdbt.com that referenced this issue Jun 30, 2023
This is still just a placeholder "YMMV" solution until we decide on a real implementation:
- dbt-labs/dbt-core#7442
joellabes added a commit to dbt-labs/docs.getdbt.com that referenced this issue Jul 4, 2023
This is still just a placeholder "YMMV" solution until we decide on a
real implementation:
- dbt-labs/dbt-core#7442

---

If you're versioning `some_model` for the first time:
- that model is materialized as a `table`
- `my_db.my_schema.some_model` will exist as a table in the database
- needs to be dropped before running `create or replace view`

(NOTE: I need to actually test this code)

Co-authored-by: Joel Labes <[email protected]>
@jtcohen6
Copy link
Contributor Author

jtcohen6 commented Jul 18, 2023

From discussion yesterday:

  • We're unlikely to move forward with option 1. Adding this as a new node to the manifest doesn't feel quite right, even if it would in a similar case for metricflow_time_spine
  • While this makes sense as a default/common pattern, we think enough users will want a way to opt out of or modify this pattern, that the explicitness of option 3 is appealing. For example, there might not be a need for this behavior if you're defining a versioned model with a single version, as for the use case described in [CT-2787] [Bug] dbt is too aggressive about preventing contracted changes during CI  #8028 (comment)
  • Whether we move forward with option 2 (implicit) or option 3 (explicit), the thing to figure out is what code we'd be calling (whether in our materialization/task or in the users' hook). Now that there's a clone materialization, which abstracts over DWH support for table clones and the object relation type to produce a clone or "pointer view" accordingly — can we call that as a macro in a hook, in a way that's cleaner than the code we currently have documented?

Next step:

@joellabes
Copy link
Contributor

My preference is Option 2.5, where it happens automatically but the logic is implemented in an overridable macro in the vein of generate_alias_name so that you can modify it if you want to. You could even do something like

models:
  - name: some_model
     versions: 
       - v: 1
         ...
       - v: 2
         config: 
           generate_pointer_view: false

if you wanted model-by-model control, right?

@alison985
Copy link

Link to a Slack message with the way I think about this. https://getdbt.slack.com/archives/C2JRRQDTL/p1692185439979629?thread_ts=1692185332.030239&cid=C2JRRQDTL Specifically:

Rename foobar to foobar_v1
Create foobar as view (it should know that foobar in after is not the same as foobar in before)
Create foobar_v2

I also have to make sure foobar(view/MV), foobar_v1(table/MV), and foobar(table/MV) are all getting refreshed data. Assume there are some objects that I could be maintaining 3 versions of at the same time for months.

@Spince is 💯 correct - you can't have an object name change version. 1 object name == 1 version == 1 model. If I'm reading correctly, I think that means that Option 4 from @nicohein with @Spince's requirements is the closest thinking. I'm not sure I understand @joellabes's latest comment.

Right now, I'm handling this in naming convention as opposed to anything in dbt syntax because I need explicit control and dbt doesn't support ALTERs and changing materialization type of an object. (I haven't thought through incremental models for this yet, but I'm sure there's interesting cases there.) If you have the benefit of starting a code base from scratch everything has a _v1 suffix, _v2 suffix, and so on. Sometimes I put the version on the schema, sometimes on the individual model. When you don't have that luxury, the non-suffixed version is _v1 implicitly.

Think of this like API version control. I have to give downstream(e.g. BI) a chance to update. The downstream customer is pinned to a version, it's just a matter if they know that explicitly(vX suffix) or not(no _vX suffix). The version append suffixes are important for organizational expectations and communications.

If I want to have dim_customers_latest, that can be an optional object creation so people can choose to be on the bleeding edge that might break at anytime(e.g. a developer release).

@jtcohen6
Copy link
Contributor Author

@Spince @nicohein @alison985 Thank you for the feedback on this! We've been working on the technical spiking that might make this mechanism possible. I'm coming back around to the questions about the desirable product / end user experience.

After reading through the comments above, this is what I'd propose:

  • This "latest" pointer view/clone should have a suffix _latest, creating a database object named <modelname>_latest.
  • This will be a pointer view / clone of whichever version the dbt model's maintainer has specified via latest_version, or the default if not specified.
  • This is instead of having an unsuffixed representation (<modelname>) by default.
  • dbt should automatically create this view/clone for all versioned models.
  • It should be possible to disable dbt's automatic creation of the "latest" view/clone.
  • If disabled, you could then recreate the "latest" view/clone, using all your own custom configs, as @nicohein was showing above:
versions:
  - v: latest  # this is an actual version of the model called 'latest'
    defined_in: dim_customers_latest  # it needs a custom definition
    config:
      alias: dim_customers_latest  # instead of default, which would be '_vlatest'

In this case, we are defining another real version of the model — not just a pointer to an existing (latest) version. It has all the same flexibility as any other model version.


How should users disable dbt's automatic creation of the "latest" view/clone? Ideas:

  • By passing an additional config, as @joellabes suggested (generate_pointer_view: false), to the top-level model definition
  • By creating a v: latest and explicitly disabling it — though this doesn't feel quite right to me:
models:
  - name: dim_customers
    versions:
      - v: latest
        config: {enabled: false}

I will give that last piece a bit more thought - but I'm curious to hear how this lands overall!

@alison985
Copy link

Sounds right to me. I would vote for disabling via configuration as @joellabes suggested, but I don't think it can be specified at the version level(see below).

New item: you're repeatedly saying "view/clone" instead of just view. But I don't see a proposed spec on where to define which it is in YML. I learned about dbt clone today and I don't want to use it, but presumably others will. Also, since it's a choice, I removed "view" from the keyword being proposed. I propose the following, Option A:

models:
  - name: some_model
    config:
      materialized: table

      ########
      generate_pointer: false
      # OR                                      # both can not be specified at the same time
      materialized_pointer: [view | clone]      # default to view if not specified
      ########
    versions: 
       - v: 1
         ...
       - v: 2
         # How would we know what period of time an automatic alias of a version should be disabled?
         #config:   
         #  generate_pointer_view: false
       - v: 3
         #config:
         #  generate_pointer_view: true

I thought about doing (Option B)generate_pointer: [false | view | clone] because they are mutually exclusive, but it didn't feel right to mix a boolean and a string. I guess you could do (Option C)generate_pointer: [no | view | clone] but it would introduce new boolean keywords(yes/no) to dbt YML which also feels wrong.

@jtcohen6 jtcohen6 removed the spike label Sep 26, 2023
@jtcohen6
Copy link
Contributor Author

@alison985 Thanks for taking another look!

Re: "view/clone," I had been thinking that:

  1. If the "latest version" of this model corresponds to a table in the database (because it's materialized as table or incremental), and this DWH supports table clones, we would create a table clone. This shares all metadata attributes of the cloned table.
  2. If the "latest version" of this model is materialized as a view (or materialized view, or something else), or if this DWH does not support table clones, we would create a pointer view (create view ... as select * from ...).

The benefit of this approach is that metadata queries expecting this object to be a table would still work, if the conditions for (1) are met. But it is more complex as an approach. I get that it would be simpler to just say: This is always a pointer view.

So to answer all the outstanding questions here:

  1. Let's keep things simple, by always creating a view.
  2. The view should be named _latest.
  3. We check the config generate_latest_pointer: true|false. It is true by default, the user can set to false in any of three places: dbt_project.yml for many/all models; in the top-level model config for this versioned model; or in the config for the "latest" version of a versioned model.
  4. For more custom behavior, users could add a v: latest with their own alias config, contents, etc.

@SBurwash
Copy link

SBurwash commented Jan 5, 2024

Hey everyone 👋 very much in love with this feature.

Question - with the pointer, would we be able to pass documentation? The goal here would be to have the pointer have the same documentation as the latest table (through the persist_docs feature)

@smitsrr
Copy link

smitsrr commented Mar 13, 2024

Just had a production bug this morning because the new _v1 model didn't have a database object (unsuffixed) which triggered a full-refresh 😬 .
I am emphatically upvoting this issue for development!

@AnotherGuitar
Copy link

*upvote

1 similar comment
@shahinsaleh
Copy link

*upvote

@geoHeil
Copy link

geoHeil commented Aug 20, 2024

upvote

@rapha-pereira
Copy link

upvote!

@frannydelaney
Copy link

upvote (is there a better way to do this?)

@elyobo
Copy link

elyobo commented Sep 4, 2024

upvote (is there a better way to do this?)

Yes, use 👍 on the issue

@sigveks
Copy link

sigveks commented Sep 16, 2024

Regarding the implementation of post-hooks for triggering the view generation of latest model. I think that on-run end hook is better than project-wide post-hook as any model specific post-hooks will run after the generation of a view. This means that if you have a post-hook that is manipulating the data of the model, the view on top will not be reflected by these changes. Consider changing the documentation here https://docs.getdbt.com/docs/collaborate/govern/model-versions#configuring-database-location-with-alias

@ChristianP89
Copy link

Upvoting this as well 👍

@graciegoheen graciegoheen added this to the v1.10 milestone Nov 19, 2024
@lord-skinner
Copy link

Up doot 👍🏼

@dassowmd
Copy link

*upvote

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests