-
Notifications
You must be signed in to change notification settings - Fork 358
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enabling binary operations with list-like Python objects. #2054
base: master
Are you sure you want to change the base?
Conversation
@@ -495,7 +495,7 @@ def __rsub__(self, other) -> Union["Series", "Index"]: | |||
return -column_op(F.datediff)(self, F.lit(other)).astype("long") | |||
else: | |||
raise TypeError("date subtraction can only be applied to date series.") | |||
return column_op(Column.__rsub__)(self, other) | |||
return column_op(lambda left, right: right - left)(self, other) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI: Column.__rsub__
doesn't support pyspark.sql.column.Column
for second parameter.
>>> kdf = ks.DataFrame({"A": [1, 2, 3, 4], "B": [10, 20, 30, 40]})
>>> sdf = kdf.to_spark()
>>> col1 = sdf.A
>>> col2 = sdf.B
>>> Column.__rsub__(col1, col2)
Traceback (most recent call last):
...
TypeError: Column is not iterable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does support:
>>> Column.__rsub__(df.id, 1)
Column<'(1 - id)'>
It doesn't work in your case above because the instance is Spark column. In practice, that wouldn't happen because it will only be called when the first operand doesn't know how to handle Spark column e.g.) 1 - df.id
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it cause any exception?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we use column_op(Column.__rsub__)(self, other)
as it is, it raises TypeError: Column is not iterable
for the case below.
>>> kser = ks.Series([1, 2, 3, 4])
>>> [10, 20, 30, 40] - kser
Traceback (most recent call last):
...
TypeError: Column is not iterable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not that this case must be handled in lines 490-492. We can move back to Column.__rsub__
.
# other = tuple with the different length | ||
other = (np.nan, 1, 3, 4, np.nan) | ||
with self.assertRaisesRegex( | ||
ValueError, "operands could not be broadcast together with shapes" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error message looks weird. Is it matched with pandas'?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original error message from pandas looks like :
ValueError: operands could not be broadcast together with shapes (4,) (8,)
@ueshin , maybe we don't include the (4,) (8,)
part since it requires to compute length of both objects which can be expensive ??
Codecov Report
@@ Coverage Diff @@
## master #2054 +/- ##
==========================================
- Coverage 94.71% 93.26% -1.45%
==========================================
Files 54 54
Lines 11503 11735 +232
==========================================
+ Hits 10895 10945 +50
- Misses 608 790 +182
Continue to review full report at Codecov.
|
@@ -813,3 +813,31 @@ def compare_disallow_null(left, right, comp): | |||
|
|||
def compare_allow_null(left, right, comp): | |||
return left.isNull() | right.isNull() | comp(left, right) | |||
|
|||
|
|||
def check_same_length(left: "IndexOpsMixin", right: Union[list, tuple]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice utility! The function name might be misleading considering its return type. Would it be possible to annotate the return type or rename the function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also could you try to reduce the amount of test codes by using loop or parameterizing if there is no difference except for the operators?
if LooseVersion(pd.__version__) < LooseVersion("1.2.0"): | ||
right = pd.Index(right, name=pindex_ops.name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens with pandas<1.2?
Seems like it's working with pandas >= 1.0 in the test?
Actually it works:
>>> pd.__version__
'1.0.5'
>>> pd.Index([1,2,3]) + [4,5,6]
Int64Index([5, 7, 9], dtype='int64')
>>> [4,5,6] + pd.Index([1,2,3])
Int64Index([5, 7, 9], dtype='int64')
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ohh,,, seems like It doesn't work for only rmod
in pandas < 1.2.
>>> [4, 5, 6] % pd.Index([1,2,3])
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'Int64Index' object has no attribute 'rmod'
Let me address for only this case.
Thanks!
raise ValueError( | ||
"operands could not be broadcast together with shapes ({},) ({},)".format( | ||
len_pindex_ops, len_right | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can show the length of left
if it's less than the length of right
, but if it's greater, the actual length is unknown.
@@ -321,6 +322,9 @@ def spark_column(self) -> Column: | |||
__neg__ = column_op(Column.__neg__) | |||
|
|||
def __add__(self, other) -> Union["Series", "Index"]: | |||
if isinstance(other, (list, tuple)): | |||
pindex_ops, other = check_same_length(self, other) | |||
return ks.from_pandas(pindex_ops + other) # type: ignore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we avoid using # type: ignore
as possible? We can use cast
instead.
if isinstance(other, (list, tuple)): | ||
other = ks.Index(other, name=self.name) # type: ignore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not needed?
@@ -495,7 +495,7 @@ def __rsub__(self, other) -> Union["Series", "Index"]: | |||
return -column_op(F.datediff)(self, F.lit(other)).astype("long") | |||
else: | |||
raise TypeError("date subtraction can only be applied to date series.") | |||
return column_op(Column.__rsub__)(self, other) | |||
return column_op(lambda left, right: right - left)(self, other) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not that this case must be handled in lines 490-492. We can move back to Column.__rsub__
.
if isinstance(other, (list, tuple)): | ||
other = ks.Index(other, name=self.name) # type: ignore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not needed?
So far, Koalas doesn't support list-like Python objects for Series binary operations.
This PR enables it.
ref #2022 (comment)