Error when filtering a Series using a condition from a DataFrame #2199

lamesjaidler · 2021-09-13T11:58:45Z

I'm wanting to filter down a Koalas Series based on a condition from a related Koalas DataFrame

X = ks.DataFrame({
    'A': [1,2,3,4,5]
})
y = ks.Series([1,0,1,0,1])
y[X['A']>3]

However, running the last line gives the following error:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
/var/folders/hw/y20855_146x8gbvbpqmtpdp00000gq/T/ipykernel_27136/1423666562.py in <module>
      3 })
      4 y = ks.Series([1,0,1,0,1])
----> 5 y[X['A']>3]

~/venvs/argov2_dev/lib/python3.8/site-packages/databricks/koalas/series.py in __getitem__(self, key)
   6134                 # with ints, searches based on index values when the value is int.
   6135                 return self.iloc[key]
-> 6136             return self.loc[key]
   6137         except SparkPandasIndexingError:
   6138             raise KeyError(

~/venvs/argov2_dev/lib/python3.8/site-packages/databricks/koalas/indexing.py in __getitem__(self, key)
    419 
    420                 kdf[temp_col] = key
--> 421                 return type(self)(kdf[self._kdf_or_kser.name])[kdf[temp_col]]
    422 
    423             cond, limit, remaining_index = self._select_rows(key)

~/venvs/argov2_dev/lib/python3.8/site-packages/databricks/koalas/frame.py in __getitem__(self, key)
  11705 
  11706         if key is None:
> 11707             raise KeyError("none key")
  11708         elif isinstance(key, Series):
  11709             return self.loc[key.astype(bool)]

KeyError: 'none key'

This syntax works as expected with Pandas Series/DataFrames:

X = pd.DataFrame({
    'A': [1,2,3,4,5]
})
y = pd.Series([1,0,1,0,1])
y[X['A']>3]

Gives:

3    0
4    1
dtype: int64

Note that I have the following option set:

ks.set_option('compute.ops_on_diff_frames', True)

This seems like a bug? Or does it need to be carried out in a different way?

xinrong-meng · 2021-09-20T22:59:56Z

Unfortunately, the exact use case above is not supported.

However, there is workaround by making the original series(y) have the same name as the conditioning series work.

So in the above example,

>>> X = ks.DataFrame({
...     'A': [1,2,3,4,5]
... })
>>> y = ks.Series([1,0,1,0,1], name='A')
>>> y[X['A']>3]
3    0
4    1
Name: A, dtype: int64
>>>

We will improve that in pandas API on Spark, under https://issues.apache.org/jira/browse/SPARK-36394.

Thanks for letting us know!

FYI @ueshin @HyukjinKwon @itholic

### What changes were proposed in this pull request? Fix filtering a Series (without a name) by a boolean Series. ### Why are the changes needed? A bugfix. The issue is raised as databricks/koalas#2199. ### Does this PR introduce _any_ user-facing change? Yes. #### From ```py >>> psser = ps.Series([0, 1, 2, 3, 4]) >>> ps.set_option('compute.ops_on_diff_frames', True) >>> psser.loc[ps.Series([True, True, True, False, False])] Traceback (most recent call last): ... KeyError: 'none key' ``` #### To ```py >>> psser = ps.Series([0, 1, 2, 3, 4]) >>> ps.set_option('compute.ops_on_diff_frames', True) >>> psser.loc[ps.Series([True, True, True, False, False])] 0 0 1 1 2 2 dtype: int64 ``` ### How was this patch tested? Unit test. Closes #34061 from xinrong-databricks/filter_series. Authored-by: Xinrong Meng <[email protected]> Signed-off-by: Takuya UESHIN <[email protected]> (cherry picked from commit 6a5ee02) Signed-off-by: Takuya UESHIN <[email protected]>

### What changes were proposed in this pull request? Fix filtering a Series (without a name) by a boolean Series. ### Why are the changes needed? A bugfix. The issue is raised as databricks/koalas#2199. ### Does this PR introduce _any_ user-facing change? Yes. #### From ```py >>> psser = ps.Series([0, 1, 2, 3, 4]) >>> ps.set_option('compute.ops_on_diff_frames', True) >>> psser.loc[ps.Series([True, True, True, False, False])] Traceback (most recent call last): ... KeyError: 'none key' ``` #### To ```py >>> psser = ps.Series([0, 1, 2, 3, 4]) >>> ps.set_option('compute.ops_on_diff_frames', True) >>> psser.loc[ps.Series([True, True, True, False, False])] 0 0 1 1 2 2 dtype: int64 ``` ### How was this patch tested? Unit test. Closes #34061 from xinrong-databricks/filter_series. Authored-by: Xinrong Meng <[email protected]> Signed-off-by: Takuya UESHIN <[email protected]>

xinrong-meng mentioned this issue Sep 21, 2021

[SPARK-36818][PYTHON] Fix filtering a Series by a boolean Series apache/spark#34061

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error when filtering a Series using a condition from a DataFrame #2199

Error when filtering a Series using a condition from a DataFrame #2199

lamesjaidler commented Sep 13, 2021

xinrong-meng commented Sep 20, 2021 •

edited

Loading

Error when filtering a Series using a condition from a DataFrame #2199

Error when filtering a Series using a condition from a DataFrame #2199

Comments

lamesjaidler commented Sep 13, 2021

xinrong-meng commented Sep 20, 2021 • edited Loading

xinrong-meng commented Sep 20, 2021 •

edited

Loading