TypeError: update(): incompatible function arguments. when filtering and groupby #1933
-
Hi, I get the following error when I create a few expressions and groupby after. Not quite sure how to get around it or what breaks the code. It looks to me like there is a misalignment in the number of records present after the expressions are actually evaluated. The following produces the error:
Invoked with: <vaex.superutils.ordered_set_float64 object at 0x000002B69C786870>, array(['2022-02-16', '2022-02-16', '2022-02-16', '2022-02-16', |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 2 replies
-
Hi, Thank you for providing a fully reproducible example! Very much appreciated. Ok, so problem 1.1: the Problem 1.2: the second Problem 2: I will open issues and make tests for the problems above. Also a general note: please avoid using Anyway, here is my workaround, i've left some comments so you know what/why i am doing. df = vaex.open('test_file.parquet')
df = df.dropna(column_names=['lDD']) # No need to do this since you drop it in the next line
df = df.drop(['future_order_date', 'dD', 'lDD']) # No real need to drop columns since vaex works out of core anyway. This just makes them "hidden", i.e. they are not deleted. The data is untouched on disk.
# The following substitutes the first apply function
df['tmp'] = df['LT'] + '-' + df['Province'] # Create a common "key"
mapper = {'W-ON': 'A1', 'W-BC': 'A2', 'WS-AB': 'A3'} # the mapping from key to value
df['tmpLT'] = df.tmp.map(mapper=mapper, default_value='--') # Do the mapping
df['LT2'] = df.func.where(df.tmpLT == '--', df.LT, df.tmpLT) # This satisfies the "else" statement in the apply function
# notice I am not over-writing the columns. The extra columns are virtual, i.e. they are evaluated on the fly and take no memory.
# I guess you can overwrite them, but execution time and memory will be the same.
# This way you get a bit more visibility for debugging, testing, etc for free.
results_df = df.copy()
# There is this method in both pandas / vaex for casting datetime to string is your preferred format.
# the final astype is a workaround problem 1.2 from above, should not be needed and we should fix that.
results_df['oD'] = results_df.oD.dt.strftime('%Y-%m-%d').astype('str')
# This should work now.
results_df.groupby(['oD', 'LT2'],agg={'qty':vaex.agg.sum('Q')}) I hope this helps a bit. If you wanna dig in the codebase and see if you can fix any of the things above, please feel free! |
Beta Was this translation helpful? Give feedback.
Hi,
Thank you for providing a fully reproducible example! Very much appreciated.
If I am not mistaken your example uncovered two bugs, but I think I have work-arounds for now at least.
Ok, so problem 1.1: the
apply
function returns an expression of dtype float instead of a string (?!). So this was causing biggest problem. I don't know why that is. Something to investigate on our end i suppose.Problem 1.2: the second
apply
is also not returning a string. Maybe there is a problem withapply
functions that need to return a string dtype.Problem 2:
df.x.dt.strftime(...)
returns dtypenull
(??) instead of expected dtype string. Luckily there is an easy "fix" for this.I will open issues and m…