Problems with feature column names exceeding Postgres' max identifier length #739

adunmore · 2020-05-21T19:05:01Z

Triage can try to create column names longer than Postgres' 63 character max identifier length. When this happens, Postgres truncates the candidate column name.

I've seen this causing problems in the feature generation code where an expected column name doesn't match the actual name in database. For example, when triage applies imputation rules, it first queries a feature table for the list of columns containing nulls. If one of the column names returned was truncated on creation, then triage doesn't have an imputation rule to handle it.

This might also cause problems where triage tries to create duplicate columns, as when creating columns for different aggregations of the same quantity:

'[some_63_char_identifier]_sum'
and
'[some_63_char_identifier]_max'
both get truncated to the same identifier:
'[some_63_char_identifier]'

The text was updated successfully, but these errors were encountered:

ecsalomon · 2020-05-21T20:47:33Z

Thanks for flagging this. The temporary workaround is to shorten the name of your columns in your from_obj or when you are selecting values for categorical columns (this is where I tend to run into it). For example, for categoricals, I will sometimes use something like left(6) to produce shortened names.

thcrock · 2020-05-21T23:01:19Z

It's worth thinking about what Triage should do in this case. Just flag it in validation and throw an error as early as possible, or is there a reasonable way to autocorrect/autoshorten for the user?

In either case, it'll probably be easier to fix it if #608 is merged, because in it the full feature names (with intervals and aggregate functions, etc) are available much earlier in the process, before the database starts doing work.

adunmore · 2020-05-22T00:09:53Z

Could we

implement an option to specify an alias for the source column name (the middle part in the generated name) in the config file
throw an error and require that the user specify a shorter name in the alias

This has the advantage of being easier than trying to figure out an approach to automatically shortening the name, and allows the user to ensure that the resulting name is meaningful/readable while being unique (compared to something like generating a hash of the source column name)

thcrock · 2020-05-22T00:17:21Z

I think #1 is one of the things @ecsalomon was alluding to, if I read it right. The aliasing you can do now.

If you specify the quantity as a dict instead of a string you can call the output column whatever you want. Dirtyduckling uses this aliasing option:
https://github.com/dssg/triage/blob/master/example/config/dirty-duckling.yaml#L32

Instead of
quantity: <sourcecolname>
do

quantity:
   <alias>: <sourcecolname>

thcrock · 2020-05-22T00:24:05Z

As far as the validation goes, this ticket exists for that.

#517
#514

A bunch of people have hit this problem, so it would be very nice to fix! As of yet it hasn't been done because it's a bit complicated to catch this before feature creation given how the feature creation code is structured right now (hence my bringing up #608)

ecsalomon · 2020-05-22T14:55:07Z

To be slightly more useful, here is an example feature group with my categorical workaround:

 -
        prefix: 'dx'
        from_obj: |
            (SELECT entity_id,
                   knowledge_date,
                   left(diagnosis, 6) AS dx                     # shortened name
              FROM diagnosis_table) AS dx
        knowledge_date_column: 'knowledge_date'
        categoricals_imputation:
            all:
                type: 'zero'
        categoricals:
            - # first 6 letters of diagnosis with at least 150 occurrences
                column: dx
                choice_query: |
                    SELECT DISTINCT left(dx, 6)       # shorten each category to 6 letters
                      FROM (
                               SELECT left(diagnosis, 6) AS dx, 
                                      count(*)
                                 FROM diagnosis_table
                                GROUP BY dx
                           ) AS dx_counts
                     WHERE count > 150
                metrics: 
                    - max
                    - sum

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problems with feature column names exceeding Postgres' max identifier length #739

Problems with feature column names exceeding Postgres' max identifier length #739

adunmore commented May 21, 2020 •

edited

Loading

ecsalomon commented May 21, 2020

thcrock commented May 21, 2020

adunmore commented May 22, 2020

thcrock commented May 22, 2020

thcrock commented May 22, 2020

ecsalomon commented May 22, 2020

Problems with feature column names exceeding Postgres' max identifier length #739

Problems with feature column names exceeding Postgres' max identifier length #739

Comments

adunmore commented May 21, 2020 • edited Loading

ecsalomon commented May 21, 2020

thcrock commented May 21, 2020

adunmore commented May 22, 2020

thcrock commented May 22, 2020

thcrock commented May 22, 2020

ecsalomon commented May 22, 2020

adunmore commented May 21, 2020 •

edited

Loading