Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Inspect columns #386

Merged
merged 20 commits into from
Feb 1, 2024
Merged

Conversation

gastlich
Copy link
Contributor

@gastlich gastlich commented Oct 7, 2023

This is a:

  • new functionality

Link to Issue

Closes #385 #199

Description & motivation

As mentioned in #385, it would be great if this tool could create a model using all the information gathered for columns. This would enable us to begin implementing basic validation rules for columns, such as:

  1. Checking for missing descriptions.
  2. Ensuring correct naming conventions for columns, which would prevent certain names from being used.

Furthermore, in the future, we can expand its use for more advanced validation purposes.

This is my first PR in this project and it would be good if you shared with me some hints/suggestions how to deliver the best outcome.

TODO

  • Integration Tests
  • Test on postgres
  • Check if sources.tables changes data structure from graph
  • Update documentation

Integration Test Screenshot

Checklist

  • I have verified that these changes work locally on the following warehouses (Note: it's okay if you do not have access to all warehouses, this helps us understand what has been covered)
    • BigQuery Screenshot 2023-10-07 at 20 26 03
    • Postgres
    • Redshift
    • Snowflake
    • Databricks
    • DuckDB
    • Trino/Starburst
  • I have updated the README.md (if applicable)
  • I have added tests & descriptions to my models (and macros if applicable)

@gastlich gastlich mentioned this pull request Oct 9, 2023
@dave-connors-3 dave-connors-3 marked this pull request as ready for review October 15, 2023 21:13
@dave-connors-3
Copy link
Collaborator

dave-connors-3 commented Oct 15, 2023

hey @gastlich!

this is looking great, really appreciate your work here! I was able to run this locally and is seems to be working great! the CI check seems to be on our side, not related to your changes as far as I can tell.

I think this is a great compromise to provide folks with a base to build column level checks into their projects without creaing unnecessary noise for those who don't want or need column level checks!

Couple small things:

  1. could you add tests and descriptions to the new models (at least to stg_columns?) Would be great to ensure the grain is consistent and have clear docs to explain the purpose of the model
  2. could you add a blurb about this model set into the actual documentation website? I think maybe a new markdown file in the docs/customization folder could be appropriate (and a corresponding addition to the config)

Let me know if you need a hand with either of these!

@gastlich gastlich force-pushed the get-column-values branch 2 times, most recently from 2716ff6 to 3572103 Compare October 24, 2023 22:07
@gastlich
Copy link
Contributor Author

gastlich commented Oct 24, 2023

hey @dave-connors-3

Thanks for you feedback and suggestions! I've pushed some changes including:

  1. I've added definitions for stg_columns.sql to graph.yml and int_all_columns.sql to core.yml. I hope this is what you meant by "adding descriptions to the new models"
  2. When it comes to "consistent" grain, would you prefer to use surrogate key, which is a new column with the value defined as <node.unique_id>-<column.name> or composite key with test implemented by dbt_utils.unique_combination_of_columns?
  3. I would like to ask you what's your preference regarding the naming of columns in the new models? Should I prefix all of the node's columns with node_ prefix to make it explicit, that it doesn't relate to column? For example node_unique_id instead of unique_id, and node_name instead of name?
  4. I've added a new documentation page querying-columns.md with a basic explanation of the model.
  5. I updated mkdocs.yml to include the new page.

I still need to figure out what's needed for the integration tests, but this is a task for another evening :) Could you review what has been recently published, please?

Copy link
Collaborator

@dave-connors-3 dave-connors-3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is looking really good! Given that this package runs on some really large customer projects, and therefore there's a risk that this model might be rather non-performant, I think I'd prefer to have the columns models disabled by default, and allow users to turn them on if they want to run them. Would you mind adding those configs to the dbt_project.yml?

Other than that, just a couple small suggestions!

models/marts/core/int_all_columns.sql Outdated Show resolved Hide resolved
docs/customization/querying-columns.md Outdated Show resolved Hide resolved
Copy link
Collaborator

@b-per b-per left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the work on this one!

I added a couple of comments about small changes.

docs/customization/querying-columns.md Outdated Show resolved Hide resolved
macros/unpack/get_column_values.sql Outdated Show resolved Hide resolved
@b-per
Copy link
Collaborator

b-per commented Oct 27, 2023

I just ran some performance tests on our Internal Analytics project wit a significant amount of models

The memory footprint was checked with usr/bin/time -l dbt ..., looking at maximum resident set size

Running DPE with the new models

  • Memory: 427 MB
  • Time: 55s

Running DPE without the new models

  • Memory: 411 MB
  • Time: 43s

Running DPE just for the new models

  • Memory: 435 MB
  • Time: 24s

Conclusions

  • This new table has no impact on the Memory used by dbt
  • Even on a large project, the performance impact is just 10s

@dave-connors-3 , what do you think about disabling the model by default. Does it still make sense? I'd be OK with having it enabled by default, especially if it unlocks further use cases (e.g. checking constraints + tests)

@gastlich
Copy link
Contributor Author

gastlich commented Oct 29, 2023

@dave-connors-3 @b-per

  1. I've updated docs
  2. Rewritten int_all_columns as import CTE, joining with stg_nodes
  3. Made stg_columns a view

I haven't disabled the model yet. Once you make the final call, I will implement whatever is suggested. :)

Let me know what the next steps are. Thanks a lot for looking into this! 🙇

Copy link
Collaborator

@b-per b-per left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for the changes. One last comment from me.

models/marts/core/int_all_columns.sql Outdated Show resolved Hide resolved
@gastlich
Copy link
Contributor Author

@dave-connors-3 @b-per any other thoughts. I think I still need to add something to the integration tests? 🤔

@b-per
Copy link
Collaborator

b-per commented Dec 4, 2023

Hi @gastlich ! Yes, I think that if you add an integration test on the table we can do a final review and get it merged!

@dave-connors-3
Copy link
Collaborator

hey @gastlich! any update on adding tests here? would love to get this in!

@gastlich
Copy link
Contributor Author

gastlich commented Jan 5, 2024

Hey @dave-connors-3, Happy New Year! Apologies for the delay; I was quite busy during the Christmas break. I've managed to find some time to work on integration tests.

Just to ensure we're aligned: as I haven't created any fct_ tests specifically based on the int_all_columns model, it's not straightforward to test fragments of that model. Instead, I've provided a single test covering the entire int_all_columns model. In the future, when you opt to introduce more specialised column-based tests to this library, you can substitute my generic tests with the newly developed ones.

I hope this meets your requirements. Looking forward to your review! 🙇

Copy link
Collaborator

@dave-connors-3 dave-connors-3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for your patience @gastlich! a couple things:

  • i think we can get rid of the core_seeds stuff -- apologies if i misled you, but we generally just use those seeds to assert the output of the fact models. doing a uniqueness test is probably enough.
  • i think we should rething the join in int_all_columns -- we are excluding source oclumn in the current implementation, and I think we can get away with relatively little additional context in that column table, and people can join back on unique_id as needed. I can talk with the team and see if they agree on that last point, but i think the join will still need to be rethought!

models/marts/core/int_all_columns.sql Outdated Show resolved Hide resolved
macros/unpack/get_column_values.sql Show resolved Hide resolved
@gastlich
Copy link
Contributor Author

thanks for your patience @gastlich! a couple things:

* i think we can get rid of the `core_seeds` stuff -- apologies if i misled you, but we generally just use those seeds to assert the output of the fact models. doing a uniqueness test is probably enough.

* i think we should rething the join in `int_all_columns` -- we are excluding source oclumn in the current implementation, and I think we can get away with relatively little additional context in that column table, and people can join back on unique_id as needed. I can talk with the team and see if they agree on that last point, but i think the join will still need to be rethought!

@dave-connors-3 I've removed the test seed file and core_seeds.yml. When it comes to the join, I think you are right, that you need to agree with the rest of the team on what the best approach here should be :) I'm not as familiar with the project, in order to decide on the final look! Let me know, once you have an update.

wrap_string_with_quotes(node.unique_id),
wrap_string_with_quotes(dbt.escape_single_quotes(column.name)),
wrap_string_with_quotes(dbt.escape_single_quotes(column.description)),
'null' if not column.data_type else wrap_string_with_quotes(dbt.escape_single_quotes(column.data_type)),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we can remove the 'null' if not column.data_type else because the wrap_string_with_quotes macro automatically handles NULLs

Suggested change
'null' if not column.data_type else wrap_string_with_quotes(dbt.escape_single_quotes(column.data_type)),
wrap_string_with_quotes(dbt.escape_single_quotes(column.data_type)),

macros/unpack/get_column_values.sql Outdated Show resolved Hide resolved
Copy link
Collaborator

@graciegoheen graciegoheen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's merge it ! Thanks for all the hard work here, I know folks are going to be stoked !

Copy link
Collaborator

@dave-connors-3 dave-connors-3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

neat

@graciegoheen graciegoheen merged commit c9e9275 into dbt-labs:main Feb 1, 2024
6 of 7 checks passed
@gastlich gastlich deleted the get-column-values branch February 2, 2024 08:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Column-level granularity
5 participants