-
Notifications
You must be signed in to change notification settings - Fork 61
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Problems with feature column names exceeding Postgres' max identifier length #739
Comments
Thanks for flagging this. The temporary workaround is to shorten the name of your columns in your |
It's worth thinking about what Triage should do in this case. Just flag it in validation and throw an error as early as possible, or is there a reasonable way to autocorrect/autoshorten for the user? In either case, it'll probably be easier to fix it if #608 is merged, because in it the full feature names (with intervals and aggregate functions, etc) are available much earlier in the process, before the database starts doing work. |
Could we
This has the advantage of being easier than trying to figure out an approach to automatically shortening the name, and allows the user to ensure that the resulting name is meaningful/readable while being unique (compared to something like generating a hash of the source column name) |
I think #1 is one of the things @ecsalomon was alluding to, if I read it right. The aliasing you can do now. If you specify the quantity as a dict instead of a string you can call the output column whatever you want. Dirtyduckling uses this aliasing option: Instead of
|
As far as the validation goes, this ticket exists for that. A bunch of people have hit this problem, so it would be very nice to fix! As of yet it hasn't been done because it's a bit complicated to catch this before feature creation given how the feature creation code is structured right now (hence my bringing up #608) |
To be slightly more useful, here is an example feature group with my categorical workaround: -
prefix: 'dx'
from_obj: |
(SELECT entity_id,
knowledge_date,
left(diagnosis, 6) AS dx # shortened name
FROM diagnosis_table) AS dx
knowledge_date_column: 'knowledge_date'
categoricals_imputation:
all:
type: 'zero'
categoricals:
- # first 6 letters of diagnosis with at least 150 occurrences
column: dx
choice_query: |
SELECT DISTINCT left(dx, 6) # shorten each category to 6 letters
FROM (
SELECT left(diagnosis, 6) AS dx,
count(*)
FROM diagnosis_table
GROUP BY dx
) AS dx_counts
WHERE count > 150
metrics:
- max
- sum |
Triage can try to create column names longer than Postgres' 63 character max identifier length. When this happens, Postgres truncates the candidate column name.
I've seen this causing problems in the feature generation code where an expected column name doesn't match the actual name in database. For example, when triage applies imputation rules, it first queries a feature table for the list of columns containing nulls. If one of the column names returned was truncated on creation, then triage doesn't have an imputation rule to handle it.
This might also cause problems where triage tries to create duplicate columns, as when creating columns for different aggregations of the same quantity:
'[some_63_char_identifier]_sum'
and
'[some_63_char_identifier]_max'
both get truncated to the same identifier:
'[some_63_char_identifier]'
The text was updated successfully, but these errors were encountered: