feat: Inspect columns #386

gastlich · 2023-10-07T19:20:49Z

This is a:

new functionality

Link to Issue

Description & motivation

As mentioned in #385, it would be great if this tool could create a model using all the information gathered for columns. This would enable us to begin implementing basic validation rules for columns, such as:

Checking for missing descriptions.
Ensuring correct naming conventions for columns, which would prevent certain names from being used.

Furthermore, in the future, we can expand its use for more advanced validation purposes.

This is my first PR in this project and it would be good if you shared with me some hints/suggestions how to deliver the best outcome.

TODO

Integration Tests
Test on postgres
Check if sources.tables changes data structure from graph
Update documentation

Integration Test Screenshot

Checklist

dave-connors-3 · 2023-10-15T21:40:49Z

hey @gastlich!

this is looking great, really appreciate your work here! I was able to run this locally and is seems to be working great! the CI check seems to be on our side, not related to your changes as far as I can tell.

I think this is a great compromise to provide folks with a base to build column level checks into their projects without creaing unnecessary noise for those who don't want or need column level checks!

Couple small things:

could you add tests and descriptions to the new models (at least to stg_columns?) Would be great to ensure the grain is consistent and have clear docs to explain the purpose of the model
could you add a blurb about this model set into the actual documentation website? I think maybe a new markdown file in the docs/customization folder could be appropriate (and a corresponding addition to the config)

Let me know if you need a hand with either of these!

gastlich · 2023-10-24T22:19:08Z

hey @dave-connors-3

Thanks for you feedback and suggestions! I've pushed some changes including:

I've added definitions for stg_columns.sql to graph.yml and int_all_columns.sql to core.yml. I hope this is what you meant by "adding descriptions to the new models"
When it comes to "consistent" grain, would you prefer to use surrogate key, which is a new column with the value defined as <node.unique_id>-<column.name> or composite key with test implemented by dbt_utils.unique_combination_of_columns?
I would like to ask you what's your preference regarding the naming of columns in the new models? Should I prefix all of the node's columns with node_ prefix to make it explicit, that it doesn't relate to column? For example node_unique_id instead of unique_id, and node_name instead of name?
I've added a new documentation page querying-columns.md with a basic explanation of the model.
I updated mkdocs.yml to include the new page.

I still need to figure out what's needed for the integration tests, but this is a task for another evening :) Could you review what has been recently published, please?

models/staging/graph/stg_columns.sql

dave-connors-3

this is looking really good! Given that this package runs on some really large customer projects, and therefore there's a risk that this model might be rather non-performant, I think I'd prefer to have the columns models disabled by default, and allow users to turn them on if they want to run them. Would you mind adding those configs to the dbt_project.yml?

Other than that, just a couple small suggestions!

models/marts/core/int_all_columns.sql

docs/customization/querying-columns.md

b-per

Thanks for the work on this one!

I added a couple of comments about small changes.

docs/customization/querying-columns.md

macros/unpack/get_column_values.sql

b-per · 2023-10-27T12:28:59Z

I just ran some performance tests on our Internal Analytics project wit a significant amount of models

The memory footprint was checked with usr/bin/time -l dbt ..., looking at maximum resident set size

Running DPE with the new models

Memory: 427 MB
Time: 55s

Running DPE without the new models

Memory: 411 MB
Time: 43s

Running DPE just for the new models

Memory: 435 MB
Time: 24s

Conclusions

This new table has no impact on the Memory used by dbt
Even on a large project, the performance impact is just 10s

@dave-connors-3 , what do you think about disabling the model by default. Does it still make sense? I'd be OK with having it enabled by default, especially if it unlocks further use cases (e.g. checking constraints + tests)

Co-authored-by: dave-connors-3 <[email protected]>

gastlich · 2023-10-29T13:56:00Z

@dave-connors-3 @b-per

I've updated docs
Rewritten int_all_columns as import CTE, joining with stg_nodes
Made stg_columns a view

I haven't disabled the model yet. Once you make the final call, I will implement whatever is suggested. :)

Let me know what the next steps are. Thanks a lot for looking into this! 🙇

b-per

Thanks a lot for the changes. One last comment from me.

models/marts/core/int_all_columns.sql

models/staging/graph/stg_columns.sql

gastlich · 2023-11-22T11:17:13Z

@dave-connors-3 @b-per any other thoughts. I think I still need to add something to the integration tests? 🤔

b-per · 2023-12-04T11:48:41Z

Hi @gastlich ! Yes, I think that if you add an integration test on the table we can do a final review and get it merged!

dave-connors-3 · 2024-01-04T16:45:21Z

hey @gastlich! any update on adding tests here? would love to get this in!

gastlich · 2024-01-05T11:23:36Z

Hey @dave-connors-3, Happy New Year! Apologies for the delay; I was quite busy during the Christmas break. I've managed to find some time to work on integration tests.

Just to ensure we're aligned: as I haven't created any fct_ tests specifically based on the int_all_columns model, it's not straightforward to test fragments of that model. Instead, I've provided a single test covering the entire int_all_columns model. In the future, when you opt to introduce more specialised column-based tests to this library, you can substitute my generic tests with the newly developed ones.

I hope this meets your requirements. Looking forward to your review! 🙇

integration_tests/models/dbt_project_evaluator_schema_tests/core.yml

dave-connors-3

thanks for your patience @gastlich! a couple things:

i think we can get rid of the core_seeds stuff -- apologies if i misled you, but we generally just use those seeds to assert the output of the fact models. doing a uniqueness test is probably enough.
i think we should rething the join in int_all_columns -- we are excluding source oclumn in the current implementation, and I think we can get away with relatively little additional context in that column table, and people can join back on unique_id as needed. I can talk with the team and see if they agree on that last point, but i think the join will still need to be rethought!

models/marts/core/int_all_columns.sql

macros/unpack/get_column_values.sql

gastlich · 2024-01-21T16:12:19Z

thanks for your patience @gastlich! a couple things:

* i think we can get rid of the `core_seeds` stuff -- apologies if i misled you, but we generally just use those seeds to assert the output of the fact models. doing a uniqueness test is probably enough.

* i think we should rething the join in `int_all_columns` -- we are excluding source oclumn in the current implementation, and I think we can get away with relatively little additional context in that column table, and people can join back on unique_id as needed. I can talk with the team and see if they agree on that last point, but i think the join will still need to be rethought!

@dave-connors-3 I've removed the test seed file and core_seeds.yml. When it comes to the join, I think you are right, that you need to agree with the rest of the team on what the best approach here should be :) I'm not as familiar with the project, in order to decide on the final look! Let me know, once you have an update.

graciegoheen · 2024-02-01T16:45:19Z

macros/unpack/get_column_values.sql

+                        wrap_string_with_quotes(node.unique_id),
+                        wrap_string_with_quotes(dbt.escape_single_quotes(column.name)),
+                        wrap_string_with_quotes(dbt.escape_single_quotes(column.description)),
+                        'null' if not column.data_type else wrap_string_with_quotes(dbt.escape_single_quotes(column.data_type)),


I believe we can remove the 'null' if not column.data_type else because the wrap_string_with_quotes macro automatically handles NULLs

Suggested change

'null' if not column.data_type else wrap_string_with_quotes(dbt.escape_single_quotes(column.data_type)),

wrap_string_with_quotes(dbt.escape_single_quotes(column.data_type)),

macros/unpack/get_column_values.sql

graciegoheen

Let's merge it ! Thanks for all the hard work here, I know folks are going to be stoked !

dave-connors-3

neat

gastlich force-pushed the get-column-values branch from c50ce03 to 64e33d5 Compare October 7, 2023 19:26

gastlich mentioned this pull request Oct 9, 2023

Column-level granularity #385

Closed

dave-connors-3 marked this pull request as ready for review October 15, 2023 21:13

dave-connors-3 requested review from graciegoheen, dave-connors-3 and b-per as code owners October 15, 2023 21:13

gastlich force-pushed the get-column-values branch 2 times, most recently from 2716ff6 to 3572103 Compare October 24, 2023 22:07

gastlich added 2 commits October 24, 2023 23:08

feat: inspect columns

1c645e2

Describe new models in graph and core yaml files

860eedb

gastlich force-pushed the get-column-values branch from 3572103 to 8b660f4 Compare October 24, 2023 22:08

Add a new docs page 'Querying Columns'

4a327b2

gastlich force-pushed the get-column-values branch from 8b660f4 to 4a327b2 Compare October 24, 2023 22:10

Update mkdocs to link new page

cd7d880

dave-connors-3 reviewed Oct 25, 2023

View reviewed changes

models/staging/graph/stg_columns.sql Outdated Show resolved Hide resolved

dave-connors-3 requested changes Oct 25, 2023

View reviewed changes

models/marts/core/int_all_columns.sql Outdated Show resolved Hide resolved

docs/customization/querying-columns.md Outdated Show resolved Hide resolved

b-per requested changes Oct 26, 2023

View reviewed changes

docs/customization/querying-columns.md Outdated Show resolved Hide resolved

macros/unpack/get_column_values.sql Outdated Show resolved Hide resolved

gastlich and others added 5 commits October 29, 2023 12:44

Update docs/customization/querying-columns.md

4edf16b

Co-authored-by: dave-connors-3 <[email protected]>

import CTE and join with stg_nodes

53c6fb6

Merge branch 'main' into get-column-values

006a03b

Don't materialise stg_columns

dcd1cfc

Improve description for querying columns page

e3c1afe

b-per requested changes Oct 30, 2023

View reviewed changes

models/marts/core/int_all_columns.sql Outdated Show resolved Hide resolved

kokorin reviewed Oct 31, 2023

View reviewed changes

models/staging/graph/stg_columns.sql Outdated Show resolved Hide resolved

gastlich added 2 commits November 5, 2023 23:25

Remove distinct clause from stg_columns

141417d

Inner join with stg_nodes instead of right join

0bfa709

dave-connors-3 added 2 commits December 20, 2023 09:59

Merge branch 'main' into get-column-values

b99fab8

Merge branch 'main' into get-column-values

06f0f61

Add integrations tests for int_all_columns model

0a69c51

dave-connors-3 reviewed Jan 18, 2024

View reviewed changes

integration_tests/models/dbt_project_evaluator_schema_tests/core.yml Outdated Show resolved Hide resolved

dave-connors-3 requested changes Jan 18, 2024

View reviewed changes

models/marts/core/int_all_columns.sql Outdated Show resolved Hide resolved

macros/unpack/get_column_values.sql Show resolved Hide resolved

Remove unneeded equality tests

3556df7

gastlich added 2 commits January 24, 2024 08:27

Merge branch 'main' into get-column-values

3e025b3

Remove int_all_columns model

df44df7

graciegoheen reviewed Feb 1, 2024

View reviewed changes

macros/unpack/get_column_values.sql Outdated Show resolved Hide resolved

Apply suggestions from code review

438b19f

graciegoheen reviewed Feb 1, 2024

View reviewed changes

macros/unpack/get_column_values.sql Outdated Show resolved Hide resolved

Update macros/unpack/get_column_values.sql

e8d03e7

graciegoheen approved these changes Feb 1, 2024

View reviewed changes

Merge branch 'main' into get-column-values

779789f

graciegoheen requested review from b-per and dave-connors-3 February 1, 2024 16:58

dave-connors-3 approved these changes Feb 1, 2024

View reviewed changes

b-per approved these changes Feb 1, 2024

View reviewed changes

graciegoheen merged commit c9e9275 into dbt-labs:main Feb 1, 2024
6 of 7 checks passed

gastlich deleted the get-column-values branch February 2, 2024 08:41

borismo mentioned this pull request Mar 13, 2024

value too long for type character varying(600) when building new columns models #437

Closed

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Inspect columns #386

feat: Inspect columns #386

gastlich commented Oct 7, 2023 •

edited

Loading

dave-connors-3 commented Oct 15, 2023 •

edited

Loading

gastlich commented Oct 24, 2023 •

edited

Loading

dave-connors-3 left a comment

b-per left a comment

b-per commented Oct 27, 2023

gastlich commented Oct 29, 2023 •

edited

Loading

b-per left a comment

gastlich commented Nov 22, 2023

b-per commented Dec 4, 2023

dave-connors-3 commented Jan 4, 2024

gastlich commented Jan 5, 2024

dave-connors-3 left a comment

gastlich commented Jan 21, 2024

graciegoheen Feb 1, 2024

graciegoheen left a comment

dave-connors-3 left a comment

	'null' if not column.data_type else wrap_string_with_quotes(dbt.escape_single_quotes(column.data_type)),
	wrap_string_with_quotes(dbt.escape_single_quotes(column.data_type)),

feat: Inspect columns #386

feat: Inspect columns #386

Conversation

gastlich commented Oct 7, 2023 • edited Loading

Link to Issue

Description & motivation

TODO

Integration Test Screenshot

Checklist

dave-connors-3 commented Oct 15, 2023 • edited Loading

gastlich commented Oct 24, 2023 • edited Loading

dave-connors-3 left a comment

Choose a reason for hiding this comment

b-per left a comment

Choose a reason for hiding this comment

b-per commented Oct 27, 2023

Running DPE with the new models

Running DPE without the new models

Running DPE just for the new models

Conclusions

gastlich commented Oct 29, 2023 • edited Loading

b-per left a comment

Choose a reason for hiding this comment

gastlich commented Nov 22, 2023

b-per commented Dec 4, 2023

dave-connors-3 commented Jan 4, 2024

gastlich commented Jan 5, 2024

dave-connors-3 left a comment

Choose a reason for hiding this comment

gastlich commented Jan 21, 2024

graciegoheen Feb 1, 2024

Choose a reason for hiding this comment

graciegoheen left a comment

Choose a reason for hiding this comment

dave-connors-3 left a comment

Choose a reason for hiding this comment

gastlich commented Oct 7, 2023 •

edited

Loading

dave-connors-3 commented Oct 15, 2023 •

edited

Loading

gastlich commented Oct 24, 2023 •

edited

Loading

gastlich commented Oct 29, 2023 •

edited

Loading