TypeError: update(): incompatible function arguments. when filtering and groupby #1933

dataovoxo · 2022-02-17T05:09:30Z

dataovoxo
Feb 17, 2022

Hi,

I get the following error when I create a few expressions and groupby after. Not quite sure how to get around it or what breaks the code. It looks to me like there is a misalignment in the number of records present after the expressions are actually evaluated. The following produces the error:

import pandas as pd
import vaex
import datetime

d = {'future_order_date': ([pd.to_datetime(pd.to_datetime(datetime.datetime.now()).tz_localize(None))] * 10), \
        'dD': ([None, pd.to_datetime('2022-02-18'), pd.to_datetime('2022-02-18'), None, pd.to_datetime('2022-02-18'), pd.to_datetime('2022-02-18'), pd.to_datetime('2022-02-18'), pd.to_datetime('2022-02-18'), None, pd.to_datetime('2022-02-18')]), \
        'lDD': ([None, pd.to_datetime('2022-02-18').strftime('%Y-%m-%dT%H:%M.%f')[:-7] + 'Z', pd.to_datetime('2022-02-18').strftime('%Y-%m-%dT%H:%M.%f')[:-7] + 'Z', None, pd.to_datetime('2022-02-18').strftime('%Y-%m-%dT%H:%M.%f')[:-7] + 'Z', pd.to_datetime('2022-02-18').strftime('%Y-%m-%dT%H:%M.%f')[:-7] + 'Z', pd.to_datetime('2022-02-18').strftime('%Y-%m-%dT%H:%M.%f')[:-7] + 'Z', pd.to_datetime('2022-02-18').strftime('%Y-%m-%dT%H:%M.%f')[:-7] + 'Z', None, pd.to_datetime('2022-02-18').strftime('%Y-%m-%dT%H:%M.%f')[:-7] + 'Z']),\
        'oD': [pd.to_datetime('2022-02-16')] * 10,\
        'LT': [None, 'W', 'WS', None, 'W', 'W', 'W', 'W', None, 'WS'],\
        'Province': [None, 'ON', 'AB', None, 'ON', 'ON', 'BC', 'ON', None, 'AB'],\
        'run': [0]*10,\
        'sD':[None, pd.to_datetime('2022-02-17'), pd.to_datetime('2022-02-17'), None, pd.to_datetime('2022-02-17'), pd.to_datetime('2022-02-17'), pd.to_datetime('2022-02-17'), pd.to_datetime('2022-02-17'), None, pd.to_datetime('2022-02-17')] , 
        'Q': [None, 2.0, 8.0, None, 1.0, 10.0, 5.0, 4.0, None, 1.0]
}

df = pd.DataFrame(data = d )
df.to_parquet('test_file.parquet')

df = vaex.open('test_file.parquet')
df = df.dropna(column_names=['lDD'])
df = df.drop(['future_order_date', 'dD', 'lDD'])
def map_fl(LT, Province):
    name = None
    if LT == 'W':
        if Province == 'ON':
            name = 'A1'
        elif Province == 'BC':
            name = 'A2'
    elif LT == 'WS':
        name = 'A3'
    else:
        name = LT
    return name
df['LT'] = df.apply(map_fl, ['LT', 'Province'])

results_df = df.copy()
results_df['oD'] = results_df['oD'].apply(lambda x: str(x)[0:10])
results_df.groupby(['oD', 'LT'],agg={'qty':vaex.agg.sum('Q')})

Invoked with: <vaex.superutils.ordered_set_float64 object at 0x000002B69C786870>, array(['2022-02-16', '2022-02-16', '2022-02-16', '2022-02-16',
'2022-02-16', '2022-02-16', '2022-02-16'], dtype='<U10'), -1; kwargs: chunk_size=1048576, bucket_size=4194304

Answered by JovanVeljanoski

Feb 17, 2022

Hi,

Thank you for providing a fully reproducible example! Very much appreciated.
If I am not mistaken your example uncovered two bugs, but I think I have work-arounds for now at least.

Ok, so problem 1.1: the apply function returns an expression of dtype float instead of a string (?!). So this was causing biggest problem. I don't know why that is. Something to investigate on our end i suppose.

Problem 1.2: the second apply is also not returning a string. Maybe there is a problem with apply functions that need to return a string dtype.

Problem 2: df.x.dt.strftime(...) returns dtype null (??) instead of expected dtype string. Luckily there is an easy "fix" for this.

I will open issues and m…

View full answer

JovanVeljanoski · 2022-02-17T10:09:32Z

JovanVeljanoski
Feb 17, 2022
Maintainer

Hi,

Thank you for providing a fully reproducible example! Very much appreciated.
If I am not mistaken your example uncovered two bugs, but I think I have work-arounds for now at least.

Ok, so problem 1.1: the apply function returns an expression of dtype float instead of a string (?!). So this was causing biggest problem. I don't know why that is. Something to investigate on our end i suppose.

Problem 1.2: the second apply is also not returning a string. Maybe there is a problem with apply functions that need to return a string dtype.

Problem 2: df.x.dt.strftime(...) returns dtype null (??) instead of expected dtype string. Luckily there is an easy "fix" for this.

I will open issues and make tests for the problems above.

Also a general note: please avoid using apply as much as possible when using vaex. I know it is a common thing to use with pandas, but with vaex it is the last thing you try when you run out of options (depending on what you are trying to do also).

Anyway, here is my workaround, i've left some comments so you know what/why i am doing.

df = vaex.open('test_file.parquet')
df = df.dropna(column_names=['lDD'])   # No need to do this since you drop it in the next line
df = df.drop(['future_order_date', 'dD', 'lDD'])   # No real need to drop columns since vaex works out of core anyway. This just makes them "hidden", i.e. they are not deleted. The data is untouched on disk.

# The following substitutes the first apply function
df['tmp'] = df['LT'] + '-' + df['Province']  # Create a common "key"
mapper = {'W-ON': 'A1', 'W-BC': 'A2', 'WS-AB': 'A3'}  # the mapping from key to value
df['tmpLT'] = df.tmp.map(mapper=mapper, default_value='--')   # Do the mapping
df['LT2'] = df.func.where(df.tmpLT == '--', df.LT, df.tmpLT)  # This satisfies the "else" statement in the apply function
# notice I am not over-writing the columns. The extra columns are virtual, i.e. they are evaluated on the fly and take no memory. 
# I guess you can overwrite them, but execution time and memory will be the same.
# This way you get a bit more visibility for debugging, testing,  etc  for free.

results_df = df.copy()
# There is this method in both pandas / vaex for casting datetime to string is your preferred format.
# the final astype is a workaround problem 1.2 from above, should not be needed and we should fix that.
results_df['oD'] = results_df.oD.dt.strftime('%Y-%m-%d').astype('str')
# This should work now.
results_df.groupby(['oD', 'LT2'],agg={'qty':vaex.agg.sum('Q')})

I hope this helps a bit. If you wanna dig in the codebase and see if you can fix any of the things above, please feel free!

2 replies

JovanVeljanoski Feb 17, 2022
Maintainer

A quicker way to solve the above issues, and keeping your code the same is to use
df = df.extract() right after you do the df = df.drop(['future_order_date', 'dD', 'lDD']).

That will solve all subsequent problems. It is to do with filtering... I need to work a bit harder to expose the problem.

But between these two answer, you should be good to go.

dataovoxo Feb 17, 2022
Author

Thanks a lot Jovan

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TypeError: update(): incompatible function arguments. when filtering and groupby #1933

{{title}}

Replies: 1 comment 2 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

TypeError: update(): incompatible function arguments. when filtering and groupby #1933

dataovoxo Feb 17, 2022

Replies: 1 comment · 2 replies

JovanVeljanoski Feb 17, 2022 Maintainer

JovanVeljanoski Feb 17, 2022 Maintainer

dataovoxo Feb 17, 2022 Author

dataovoxo
Feb 17, 2022

Replies: 1 comment 2 replies

JovanVeljanoski
Feb 17, 2022
Maintainer

JovanVeljanoski Feb 17, 2022
Maintainer

dataovoxo Feb 17, 2022
Author